Kernel Planet

June 04, 2021

Matthew Garrett: Mike Lindell's Cyber "Evidence"

Mike Lindell, notable for absolutely nothing relevant in this field, today filed a lawsuit against a couple of voting machine manufacturers in response to them suing him for defamation after he claimed that they were covering up hacks that had altered the course of the US election. Paragraph 104 of his suit asserts that he has evidence of at least 20 documented hacks, including the number of votes that were changed. The citation is just a link to a video called Absolute 9-0, which claims to present sufficient evidence that the US supreme court will come to a 9-0 decision that the election was tampered with.

The claim is that Lindell was provided with a set of files on the 9th of January, and gave these to some cyber experts to verify. These experts identified them as packet captures. The video contains scrolling hex, and we are told that this is the raw encrypted data from the files. In reality, the hex values correspond very clearly to printable ASCII, and appear to just be the Pennsylvania voter roll. They're not encrypted, and they're not packet captures (they contain no packet headers).

20 of these packet captures were then selected and analysed, giving us the tables contained within Exhibit 12. The alleged source IPs appear to correspond to the networks the tables claim, and the latitude and longitude presumably just come from a geoip lookup of some sort (although clearly those values are far too precise to be accurate). But if we look at the target IPs, we find something interesting. Most of them resolve to the website for the county that was the nominal target (eg, 198.108.253.104 is www.deltacountymi.org). So, we're supposed to believe that in many cases, the county voting infrastructure was hosted on the county website.

Unfortunately we're not given the destination port, but 198.108.253.104 isn't listening on anything other than 80 and 443. We're told that the packet data is encrypted, so presumably it's over HTTPS. So, uh, how did they decrypt this to figure out how many votes were switched? If Mike's hackers have broken TLS, they really don't need to be dealing with this.

We're also given some background information on how it's impossible to reconstruct packet captures after the fact (untrue), or that modifying them would change their hashes (true, but in the absence of known good hash values that tells us nothing), but it's pretty clear that nothing we're shown actually demonstrates what we're told it does.

In summary: yes, any supreme court decision on this would be 9-0, just not the way he's hoping for.

Update: It was pointed out that this data appears to be part of a larger dataset. This one is even more dubious - it somehow has MAC addresses for both the source and destination (which is impossible), and almost none of these addresses are in actual issued ranges.

comment count unavailable comments

June 04, 2021 05:49 AM

Linux Plumbers Conference: Performance and Scalability Microconference Accepted into 2021 Linux Plumbers Conference

We are pleased to announce that the Performance and Scalability Microconference has been accepted into the 2021 Linux Plumbers Conference.

All parts of the Linux ecosystem, kernel and userspace, should account for performance and scalability. The purpose of this microconference is for developers from different projects to meet and collaborate, as the entire stack must perform well for the user to see good results. Because performance and scalability are very generic topics, this microconference focuses on issues that may also be addressed in other, more specific sessions.

The structure will be similar to what was followed in previous years, including topics such as synchronization primitives, bottlenecks in memory management, testing/validation, lockless algorithms and RCU, among others.

Here are some of the outcomes from the last time the event was held in 2018:

This year’s topics tentatively include:

Come and join us in the discussion of improving performance and scalability of your system.

We hope to see you there.

June 04, 2021 01:12 AM

June 03, 2021

Brendan Gregg: An Unbelievable Demo

This is the story of the most unbelievable demo I've been given in world of open source. You can't make this stuff up. It was 2005, and I felt like I was in the eye of a hurricane. I was an independent performance consultant and Sun Microsystems had just released DTrace, a tool that could instrument all software. This gave performance analysts like myself X-ray vision. While I was busy writing and publishing advanced performance tools using DTrace (my open source [DTraceToolkit] and other [DTrace tools]), I noticed something odd: I was producing more DTrace tools than were coming out of Sun itself. Perhaps there was some internal project that was consuming all their DTrace expertise?


DTraceToolkit v0.96 tools (2006)
As I wasn't a Sun Microsystems employee I wasn't privy to Sun's internal projects. However, I was doing training and consulting for Sun, helping their customers with system administration and performance. Sun sometimes invited me to their own customer meetings and other events I might be interested in, as a local expert. I was living in Sydney, Australia. This time I was told that there was a Very Important Person visiting from the US whom I'd want to meet. I didn't recognize the name, but was told that he was a DTrace expert and developer at Sun, and was on a world tour demonstrating Sun's new DTrace-based product. Ah-hah – this must be the internal project! But this would be no ordinary project. I'd seen some amazing technologies from Sun, but I'd never seen a developer on a world tour. This was going to be big, and would likely blow away my earlier DTrace work. The VIP was returning to Sydney for a few days before going to the next Australian city, so we agreed to meet at the Sun Sydney office. ## The Meeting The DTrace expert arrived wearing casual business attire and a heavy American accent, and seemed a bit weary from his world tour. He had just been to South Africa and New Zealand, and listed other countries and cities he was heading to next. Two other Australian Sun staff joined the meeting, and one introduced me with: "Brendan teaches some classes for us, and has been doing some DTrace stuff.” Low-key introductions are the norm in Australia (especially for Australians) and I wondered whether he knew of this cultural difference. Another difference was that there were few roles in Australia for engineers in 2005, unlike the US. The Sun Microsystems Australia jobs, for example, were all in support and none in development, and other tech giants had not yet arrived (unlike today). So back then in Australia you could find amazing engineers doing whatever roles were available. I tried to expand on the "stuff" a bit by saying that I’d written the DTraceToolkit, but he wasn't impressed. He didn't recognize my name, nor had he heard of the DTraceToolkit. To him, I was just some random guy. He was kind enough to give me a quick demo anyway. His DTrace product was an add-on for a larger Sun GUI that I was already familiar with. After it loaded, he showed how you could run one of several DTrace tools by double clicking an icon. Either the raw output would be printed in a separate window, or the results would be shown as a line graph. This seemed __quite underwhelming__. The GUI already had this functionality: Showing the raw output of tools or drawing a line graph. I was hoping for a new GUI feature. The only new work was the tools themselves, of which there were several. He gave a quick sales pitch about the new and amazing observability they provided, something he must have said many times to impress customers. I got the feeling he wasn't expecting me to properly appreciate their value. But I _did_ understand these tools, since I had coded similar functionality for my own DTraceToolkit. They were useful, but...I was expecting a hurricane of awesome _new_ DTrace content. "I've done these before – I've written tools that do these things myself!" "Yeah, sure." He didn’t quite say it, but gave me a look like he didn't really believe me, or that I could even truly understand what they were. This was an important innovation by Sun Microsystems, a US-based multinational company worth billions. I was just some random Aussie. ## Socket Tracing I browsed the GUI icons for something new, and the closest was a tool for tracing socket I/O. I had tried this in 2004 ([socketsnoop.d]) and published it as open source, but my tool was incomplete: I didn't have access to the kernel source code so I had to figure out everything the hard way using black box analysis. It worked for most TCP traffic types but not others, which I warned about in the script comments. I'd also not included it in the DTraceToolkit yet as I didn't consider it finished. So of all the tools he had, I was most interested to see this one. Sun could do a much better job just by referring to the source code they were instrumenting, and actually finish this tool. "Can I see the socket I/O script?". I fired up a terminal. He looked alarmed at first, as if I wasn't supposed to look behind the curtain, then realized another selling feature: "Well, sure, you could even add more tools to the GUI!" and after a pause, added "if you have them". Sure, I have them all right. He gave me a path to start looking under, and after a bit of searching I found the directory with all the tools he had been demoing. The tools all had familiar names. One was even called socketsnoop.d. A new possibility dawned on me. No way. I printed socketsnoop.d. The screen filled with _my own script_. It was the same incomplete attempt I had hacked up a year earlier, and published as open source. It included some weird code that only made sense when I wrote it (use of PFORMAT, prior to defaultargs) and was written in my earlier coding style. I was looking at _my own fucking script_. "This is MY script." I printed the other tools and saw the same – they were _all mine_. This hot new Sun product that Mr. VIP was touring the world showing off was actually just my own open source tools. My jaw was on the floor. He didn't seem to believe me. ## You Can't Do That I used grep to search all his tools for my name, which was in the header comment of all my tools, to prove beyond a doubt that these were mine. But I found nothing. My name had been stripped. Some of my tools had even included the line:
# Author: Brendan Gregg  [Sydney, Australia]
And now, here he was, in Sydney, Australia, trying to sell Brendan Gregg's tools to Brendan Gregg. One of the Australian Sun staff interrupted: "Those say copyright Sun Microsystems." Most of my tools had my own copyright and a GPLv2 or CDDL license. But these only had Sun's standard copyright message, and the open source licenses had been stripped. "You deleted my name! And the copyrights and licenses!" The other Aussie added, to the VIP: "You can't do that." A silence fell over the room as the magnitude of what had happened sunk in. While some at Sun were encouraging open source contributions and building a community, others were ripping off that same community. Taking their work, changing the licence and copyrights, and then selling it. The VIP wasn't prepared for this and had a look of confusion. He didn't say much, other than that he didn't know what had happened, and that he may have gotten the tools from someone else already like this (ie, don't blame me). He seemed to be only half believing what we were saying. The meeting ended quickly. I suggested that he get newer copies of my tools, directly from the DTraceToolkit, since these older versions from my homepage were out of date, and some had errors that I had already fixed. I also reminded him to keep my name, copyright, and license on all of them. In his defense, perhaps the meeting may have gone differently had I not been given a low-key Australian introduction. That's an Australian cultural problem (tall poppy syndrome). To an Australian, introductions in the US can sound boastful, but they can also be useful as a quick way to share one's specialties. ## Other Cases Of all the tools I had published as open source, I still can't believe socketsnoop.d was included. It wasn't even very good. Later on I wrote much better socket tools (in my [DTrace] and [BPF] books). A few years later, Apple added dozens of my tools to OS X. They left my name, copyright, and CDDL open source license intact, and even improved and enhanced some of them. Years later, Oracle did the same for Oracle Solaris 11, and the BSD community did for FreeBSD. My thanks to all of you. You might say that this wasn't really Sun the company doing this, but rather, a careless individual. But there was something in Sun's culture that contributed to this kind of carelessness. It was something I and my consulting colleagues had run into before: The belief at Sun that only Sun could make good use of its own technologies, and anything created outside of Sun was trash. When these Sun employees found something that was good, they were inclined to assume it came from Sun, and it was therefore safe to reuse and rebrand (and relicense) as they assumed they already held the copyrights. There were also others at Sun that did try hard to do the right thing by me and my work. On at least four other occasions my DTraceToolkit was built into observability products, without stripping licenses. (In one case they wanted to relicense to GPL, and talked to me and Sun legal about it, but that's another story.) This also wasn't the last time someone unwittingly tried to sell me my own work, it was just the first. I've learned not to tell sales people that I invented what they are showing me, as they then give me funny looks like I'm a crazy person, but instead to simply say "I have a lot of experience with that technology" and leave it at that. I'm reminded of this first case since my BPF tools are now appearing in observability products, and will grow to a scale much bigger than my DTrace tools. I'll write about it more in future posts, but my immediate advice to developers is this: Please do not rewrite my BPF tools and the bcc libraries; build upon them as-is (either bcc Python or bcc libbpf-tool versions). They are still works-in-progress, and rewriting (forking) them divides engineering resources and has your customers using out of date versions. We are better off with all the wood behind one arrow. (Note that I think my flame graph software is different: Since it is a simple algorithm that doesn't need much maintenance, I don't see a big problem with people rewriting it. It is nice to get some thanks, however, just as I have done for those that inspired flame graphs.) As for the unbelievable demo: This wasn't the great DTrace product I imagined when hearing about a world tour. It was, in fact, my own tools. I suspect that it's not uncommon for an open source developer to discover, at some point, that their own code has been rebranded. But the circumstance in this case may be a little unusual. A US developer got a world tour for software he didn't write, which included giving a sales pitch and demo in Australia, unwittingly, to the author. I don't think he even said thank you. [socketsnoop.d]: http://www.brendangregg.com/DTrace/socketsnoop.d [DTrace]: /dtrace.html [BPF]: /bpf-performance-tools-book.html [DTraceToolkit]: /dtracetoolkit.html [DTrace tools]: /dtrace.html

June 03, 2021 02:00 PM

June 02, 2021

Matthew Garrett: Producing a trustworthy x86-based Linux appliance

Let's say you're building some form of appliance on top of general purpose x86 hardware. You want to be able to verify the software it's running hasn't been tampered with. What's the best approach with existing technology?

Let's split this into two separate problems. The first is to do as much as we can to ensure that the software can't be modified without our consent[1]. This requires that each component in the boot chain verify that the next component is legitimate. We call the first component in this chain the root of trust, and in the x86 world this is the system firmware[2]. This firmware is responsible for verifying the bootloader, and the easiest way to do this on x86 is to use UEFI Secure Boot. In this setup the firmware contains a set of trusted signing certificates and will only boot executables with a chain of trust to one of these certificates. Switching the system into setup mode from the firmware menu will allow you to remove the existing keys and install new ones.

(Note: You shouldn't use the trusted certificate directly for signing bootloaders - instead, the trusted certificate should be used to sign another certificate and the key for that certificate used to sign your bootloader. This way, if you ever need to revoke the signing certificate, you can simply sign a new one with the trusted parent and push out a revocation update instead of having to provision new keys)

But what do you want to sign? In the general purpose Linux world, we use an intermediate bootloader called Shim to bridge from the Microsoft signing authority to a distribution one. Shim then verifies the signature on grub, and grub in turn verifies the signature on the kernel. This is a large body of code that exists because of the use cases that general purpose distributions need to support - primarily, booting on arbitrary off the shelf hardware, and allowing arbitrary and complicated boot setups. This is unnecessary in the appliance case, where the hardware target can be well defined, where there's no need for interoperability with the Microsoft signing authority, and where the boot configuration can be extremely static.

We can skip all of this complexity using systemd-boot's unified Linux image support. This has the format described here, but the short version is that it's simply a kernel and initramfs linked into a small EFI executable that will run them. Instructions for generating such an image are here, and if you follow them you'll end up with a single static image that can be directly executed by the firmware. Signing this avoids dealing with a whole host of problems associated with relying on shim and grub, but note that you'll be embedding the initramfs as well. Again, this should be fine for appliance use-cases, but you'll need your build system to support building the initramfs at image creation time rather than relying on it being generated on the host.

At this point we have a single image that can be verified by the firmware and will get us to the point of a running kernel and initramfs. Unless you've got enough RAM that you can put your entire workload in the initramfs, you're going to want a filesystem as well, and you're going to want to verify that that filesystem hasn't been tampered with. The easiest approach to this is to use dm-verity, a device-mapper layer that uses a hash tree to verify that the filesystem contents haven't been modified. The kernel needs to know what the root hash is, so this can either be embedded into your initramfs image or into the kernel command line. Either way, it'll end up in the signed boot image, so nobody will be able to tamper with it.

It's important to note that a dm-verity partition is read-only - the kernel doesn't have the cryptographic secret that would be required to generate a new hash tree if the partition is modified. So if you require the ability to write data or logs anywhere, you'll need to add a new partition for that. If this partition is unencrypted, an attacker with access to the device will be able to put whatever they want on there. You should treat any data you read from there as untrusted, and ensure that it's validated before use (ie, don't just feed it to a random parser written in C and expect that everything's going to be ok). On the other hand, if it's encrypted, remember that you can't just put the encryption key in the boot image - an attacker with access to the device is going to be able to dump that and extract it. You'll probably want to use a TPM-sealed encryption secret, which will be discussed later on.

At this point everything in the boot process is cryptographically verified, and so should be difficult to tamper with. Unfortunately this isn't really sufficient - on x86 systems there's typically no verification of the integrity of the secure boot database. An attacker with physical access to the system could attach a programmer directly to the firmware flash and rewrite the secure boot database to include keys they control. They could then replace the boot image with one that they've signed, and the machine would happily boot code that the attacker controlled. We need to be able to demonstrate that the system booted using the correct secure boot keys, and the only way we can do that is to use the TPM.

I wrote an introduction to TPMs a while back. The important thing to know here is that the TPM contains a set of Platform Configuration Registers that are large enough to contain a cryptographic hash. During boot, each component of the boot process will generate a "measurement" of other security critical components, including the next component to be booted. These measurements are a representation of the data in question - they may simply be a hash of the object being measured, or the hash of a structure containing various pieces of metadata. Each measurement is passed to the TPM, along with the PCR it should be measured into. The TPM takes the new measurement, appends it to the existing value, and then stores the hash of this concatenated data in the PCR. This means that the final PCR value depends not only on the measurement, but also on every previous measurement. Without breaking the hash algorithm, there's no way to set the PCR to an arbitrary value. The hash values and some associated data are stored in a log that's kept in system RAM, which we'll come back to later.

Different PCRs store different pieces of information, but the one that's most interesting to us is PCR 7. Its use is documented in the TCG PC Client Platform Firmware Profile (section 3.3.4.8), but the short version is that the firmware will measure the secure boot keys that are used to boot the system. If the secure boot keys are altered (such as by an attacker flashing new ones), the PCR 7 value will change.

What can we do with this? There's a couple of choices. For devices that are online, we can perform remote attestation, a process where the device can provide a signed copy of the PCR values to another system. If the system also provides a copy of the TPM event log, the individual events in the log can be replayed in the same way that the TPM would use to calculate the PCR values, and then compared to the actual PCR values. If they match, that implies that the log values are correct, and we can then analyse individual log entries to make assumptions about system state. If a device has been tampered with, the PCR 7 values and associated log entries won't match the expected values, and we can detect the tampering.

If a device is offline, or if there's a need to permit local verification of the device state, we still have options. First, we can perform remote attestation to a local device. I demonstrated doing this over Bluetooth at LCA back in 2020. Alternatively, we can take advantage of other TPM features. TPMs can be configured to store secrets or keys in a way that renders them inaccessible unless a chosen set of PCRs have specific values. This is used in tpm2-totp, which uses a secret stored in the TPM to generate a TOTP value. If the same secret is enrolled in any standard TOTP app, the value generated by the machine can be compared to the value in the app. If they match, the PCR values the secret was sealed to are unmodified. If they don't, or if no numbers are generated at all, that demonstrates that PCR 7 is no longer the same value, and that the system has been tampered with.

Unfortunately, TOTP requires that both sides have possession of the same secret. This is fine when a user is making that association themselves, but works less well if you need some way to ship the secret on a machine and then separately ship the secret to a user. If the user can simply download the secret via some API, so can an attacker. If an attacker has the secret, they can modify the secure boot database and re-seal the secret to the new PCR 7 value. That means having to add some form of authentication, along with a strong binding of machine serial number to a user (in order to avoid someone with valid credentials simply downloading all the secrets).

Instead, we probably want some mechanism that uses asymmetric cryptography. A keypair can be generated on the TPM, which will refuse to release an unencrypted copy of the private key. The public key, however, can be exported and stored. If it's acceptable for a verification app to connect to the internet then the public key can simply be obtained that way - if not, a certificate can be issued to the key, and this exposed to the verifier via a QR code. The app then verifies that the certificate is signed by the vendor, and if so extracts the public key from that. The private key can have an associated policy that only permits its use when PCR 7 has an appropriate value, so the app then generates a nonce and asks the user to type that into the device. The device generates a signature over that nonce and displays that as a QR code. The app verifies the signature matches, and can then assert that PCR 7 has the expected value.

Once we can assert that PCR 7 has the expected value, we can assert that the system booted something signed by us and thus infer that the rest of the boot chain is also secure. But this is still dependent on the TPM obtaining trustworthy information, and unfortunately the bus that the TPM sits on isn't really terribly secure (TPM Genie is an example of an interposer for i2c-connected TPMs, but there's no reason an LPC one can't be constructed to attack the sort usually used on PCs). TPMs do support encrypted communication channels, but bootstrapping those isn't straightforward without firmware support. The easiest way around this is to make use of a firmware-based TPM, where the TPM is implemented in software running on an ancillary controller. Intel's solution is part of their Platform Trust Technology and runs on the Management Engine, AMD run it on the Platform Security Processor. In both cases it's not terribly feasible to intercept the communications, so we avoid this attack. The downside is that we're then placing more trust in components that are running much more code than a TPM would and which have a correspondingly larger attack surface. Which is preferable is going to depend on your threat model.

Most of this should be achievable using Yocto, which now has support for dm-verity built in. It's almost certainly going to be easier using this than trying to base on top of a general purpose distribution. I'd love to see this become a largely push button receive secure image process, so might take a go at that if I have some free time in the near future.

[1] Obviously technologies that can be used to ensure nobody other than me is able to modify the software on devices I own can also be used to ensure that nobody other than the manufacturer is able to modify the software on devices that they sell to third parties. There's no real technological solution to this problem, but we shouldn't allow the fact that a technology can be used in ways that are hostile to user freedom to cause us to reject that technology outright.
[2] This is slightly complicated due to the interactions with the Management Engine (on Intel) or the Platform Security Processor (on AMD). Here's a good writeup on the Intel side of things.

comment count unavailable comments

June 02, 2021 04:36 PM

May 28, 2021

Brendan Gregg: Moving my US tech job to Australia

I've moved from the San Francisco Bay Area to Sydney, Australia, where I will continue the best job so far of my career: Performance engineering at Netflix. I'm grateful for the support of Netflix engineering management, Netflix HRBPs, and others for helping to make this happen. While my move is among the first from the Linux cloud teams, Netflix has had staff in Australia for years (for content, marketing, and the FreeBSD OCA). It's been a privilege and an adventure to work in Silicon Valley with so many amazing people. But I'm now excited about my new adventure: Doing an advanced tech role remotely from Australia. I know others who have also left the Bay Area or are planning to. Back in 2015 we'd have BPF (iovisor) meetups in Santa Clara and most contributors would be there in person, with some having travelled. Now we're more scattered, either to other US cities or worldwide. As another indicator of tech moving elsewhere, last year brought the [headline]: "Bay Area's share of VC deals predicted to fall below 20% for first time in 2021." Day to day things won't be much different. I'm still online, doing the same work, answering the same emails. And many of us expect (when travel is possible) to make regular visits to the US for company-wide meetings and events. I think some coworkers will still see me occasionally in the US office and won't even realize I've moved.

Why Australia?

When I told people I was moving to Australia they'd guess why: "Is it because of X? Or Y? ... or Z?" Well, the answer is yes, all of the above. I began discussing Australian tech roles with different companies in Jan 2020. The pandemic then added another reason to move. Both the US and Australia have their pros and cons, and I have many favorite places and people in both (sorry I didn't come say goodbye: We'll meet again). But in the end I'm a proud Australian and I do prefer Australia for various reasons, many of which Deirdré wrote about in [Why move to Australia?]. Additional reasons for me included visa uncertainty (and the abuse it leads to), voting rights, and complex international taxation. (Disclaimer: Netflix is an exception, as they have been great with visa workers including myself). Another reason is that the tech market became stronger in Australia. I moved to the US in 2006 as there were many more opportunities there, especially in kernel engineering and performance. Now, in 2021, Australia has a thriving tech market. Sydney has AWS and Google offices and even a small Netflix office, just to name a few. There is also a wider variety of roles available. If you want to do kernel engineering work you no longer need to move to California to work for Sun Microsystems in the MPK17 building. You can work on Linux anywhere.

Linux is Already Remote

Linux has been described as the world's most successful open source project, and it's all engineers working remotely. There's no Linux kernel headquarters where all the engineers sit in an open office layout, typing furiously then dashing for the break room coffee during kernel builds, and where maintainers can yell across the room at someone for their bad patch (when it's Linus yelling, everyone takes off their headphones to listen). That doesn't happen. Engineers are remote, and may only meet once or twice a year at Linux kernel conferences. And it's worked very well for years. Another example of remote work I've already done is book writing. Last year I published [Systems Performance 2nd Edition], which I wrote from my home office with help from remote contributors. The entire project was run via emails, a Google drive, and Google docs, and was delivered to the publisher on time.

Making it Work

While tech workers are well suited for remote work (savvy with communications technologies) there are benefits with office work, and I don't think remote work is for everyone. (One benefit I'll miss is playing in the Netflix cricket team.) In the future I'd expect hybrid teams, where the remote workers visit the office on a regular cadence (e.g., once a quarter) for meetings. This is a model that's already been successfully used by some teams, including at Netflix. Update: I was asked on Twitter about my work hours. I set my own schedule where I start work around 7am, which gives between 3 and 5 hours overlap with California time (depending on daylight savings). About once a month I'll have a 4am meeting. Back when I did [SRE oncall] for Netflix I'd have more wakeups at unpredictable times, so this feels easier to manage. (I also had prior jobs in the Bay Area where I'd be in the office most days past midnight, so compared to that this is like a health retreat!) As more people move to other timezones I think this will improve further. Some meetings may move to an asynchronous format, and others may be run twice for world coverage, at 9am and 4pm California time.
To work remote I think you have to really want it and be willing to put in extra effort, including doing the occasional early meeting. Personally, I use a stopwatch to help me stay productive: I pause it whenever I have an interruption, and measure how many hours of uninterrupted work I get done each day, log it, and then plot it on graphs to see the trends. Yes, I'm performance analyzing myself. It's been a slow process, but I've been figuring out how to become more productive each day. It's really satisfying to finish a full day's work and then realize I'm no longer in the Bay Area, but instead have a two minute walk to the beach. It's just one of many reasons to put in that extra effort. [Why move to Australia?]: http://www.beginningwithi.com/2020/12/01/why-move-to-australia/ [headline]: https://www.bizjournals.com/sanjose/news/2020/12/14/bay-area-vc-deal-share-predicted-to-fall-below-20.html [Systems Performance 2nd Edition]: /systems-performance-2nd-edition-book.html [SRE oncall]: /blog/2016-05-04/srecon2016-perf-checklists-for-sres.html

May 28, 2021 02:00 PM

May 23, 2021

David Sterba: Authenticated hashes for btrfs (part 1)

There was a request to provide authenticated hashes in btrfs, natively as one of the btrfs checksum algorithms. Sounds fun but there’s always more to it, even if this sounds easy to implement.

Johaness T. at that time in SUSE sent the patchset adding the support for SHA256 [1] with a Labs conference paper, summarizing existing solutions and giving details about the proposed implementation and use cases.

The first version of the patchset posted got some feedback, issues were found and some ideas suggested. Things have stalled a bit, but the feature is still very interesting and really not hard to implement. The support for additional checksums has provided enough support code to just plug in the new algorithm and enhance the existing interfaces to provide the key bytes. So until now I’ve assumed you know what an authenticated hash means, but for clarity and in simple terms: a checksum that depends on a key. The main point is that it’s impossible to generate the same checksum for given data without knowing the key, where impossible is used in the cryptographic-strength sense, there’s an almost zero probability doing that by chance and brute force attack is not practical.

Auth hash, fsverity

Notable existing solution for that is fsverity that works in read-only fashion, where the key is securely hidden and used only to verify that data that are read from media haven’t been tampered with. A typical use case is an OS image in your phone. But that’s not all. Images of OS appear in all sorts of boxed devices, IoT. Nowadays, with explosion of edge computing, assuring integrity of the end devices is a fundamental requirement.

Where btrfs can add some value is the read AND write support, with an authenticated hash. This brings questions around key handling, and not everybody is OK with a device that could potentially store malicious/invalid data with a proper authenticated checksum. So yeah, use something else, this is not your use case, or maybe there’s another way how to make sure the key won’t be compromised easily. This is beyond the scope of what filesystem can do, though.

As an example use case of writable filesystem with authenticated hash: detect outside tampering with on-disk data, eg. when the filesystem was unmounted. Filesystem metadata formats are public, interesting data can be located by patterns on the device, so changing a few bytes and updating the checksum(s) is not hard.

There’s one issue that was brought up and I think it’s not hard to observe anyway: there’s a total dependency on the key to verify a basic integrity of the data. Ie. without the key it’s not possible to say if the data are valid as if a basic checksum was used. This might be still useful for a read-only access to the filesystem, but absence of key makes this impossible.

Existing implementations

As was noted in the LWN discussion [2], what ZFS does, there are two checksums. One is the authenticated and one is not. I point you to the comment stating that, as I was not able to navigate far enough in the ZFS code to verify the claim, but the idea is clear. It’s said that the authenticated hash is eg. SHA512 and the plain hash is SHA256, split half/half in the bytes available for checksum. The way the hash is stored is a simple trim of the first 16 bytes of each checksum and store them consecutively. As both hashes are cryptographically strong, the first 16 bytes should provide enough strength despite the truncation. Where 16 bytes is 128 bits.

When I was thinking about that, I had a different idea how to do that. Not that copying the scheme would not work for btrfs, anything that the linux kernel crypto API provides is usable, the same is achievable. I’m not judging the decisions what hashes to use or how to do the split, it works and I don’t see a problem in the strength. Where I see potential for an improvement is performance, without sacrificing strength too much. Trade-offs.

The CPU or software implementation of SHA256 is comparably slower to checksums with hardware aids (like CRC32C instructions) or hashes designed to perform well on CPUs. That was the topic of the previous round of new hashes, so we now compete against BLAKE2b and XXHASH. There are CPUs with native instructions to calculate SHA256 and the performance improvement is noticeable, orders of magnitude better. But the support is not as widespread as eg. for CRC32C. Anyway, there’s always choice and hardware improves over time. The number of hashes may seem to explode but as long as it’s manageable inside the filesystem, we take it. And a coffee please.

Secondary hash

The checksum scheme proposed is to use a cryptographic hash and a non-cryptographic one. Given the current support for SHA256 and BLAKE2b, the cryptographic hash is given. There are two of them and that’s fine. I’m not drawing an exact parallel with ZFS, the common point for the cryptographic hash is that there are limited options and the calculation is expensive by design. This is where the non-cryptographic hash can be debated. Also I want to call it secondary hash, with obvious meaning that it’s not too important by default and comes second when the authenticated hash is available.

We have CRC32C and XXHASH to choose from. Note that there are already two hashes from the start so supporting both secondary hashes would double the number of final combinations. We’ve added XXHASH to enhance the checksum collision space from 32 bits to 64 bits. What I propose is to use just XXHASH as the secondary hash, resulting in two new hashes for the authenticated and secondary hash. I haven’t found a good reason to also include CRC32C.

Another design point was where to do the split and truncation. As the XXHASH has fixed length, this could be defined as 192 bits for the cryptographic hash and 64 bits for full XXHASH.

Here we are, we could have authenticated SHA256 accompanied by XXHASH, or the same with BLAKE2b. The checksum split also splits the decision tree what to do when the checksum partially matches. For a single checksum it’s a simple yes/no decision. The partial match is the interesting case:

This leads to 4 outcomes of the checksum verification, compared to 2. A boolean type can simply represent the yes/no outcome but for two hashes it’s not that easy. It depends on the context, though I think it still should be straightforward to decide what to do that in the code. Nevertheless, this has to be updated in all calls to checksum verification and has to reflect the key availability eg. in case where the data are auto-repaired during scrub or when there’s a copy.

Performance considerations

The performance comparison should be now clear: we have the potentially slow SHA256 but fast XXHASH, for each metadata and data block, vs slow SHA512 and slow SHA256. As I reckon it’s possible to also select SHA256/SHA256 split in ZFS, but that can’t beat SHA256/XXHASH.

The key availability seems to be the key point in all that, puns notwithstanding. The initial implementation assumed for simplicity to provide the raw key bytes to kernel and to the userspace utilities. This is maybe OK for a prototype but under any circumstances can’t survive until a final release. There’s key management wired deep into linux kernel, there’s a library for the whole API and command line tools. We ought to use that. Pass the key by name, not the raw bytes.

Key management has it’s own culprits and surprises (key owned vs possessed), but let’s assume that there’s a standardized way how to obtain the key bytes from the key name. In kernel its “READ_USER_KEY_BYTES”, in userspace it’s either keyctl_read from libkeyutils or a raw syscall to keyctl. Problem solved, on the low-level. But, well, don’t try that over ssh.

Accessing a btrfs image for various reasons (check, image, restore) now needs the key to verify data or even the key itself to perform modifications (check + repair). The command line interface has to be extended for all commands that interact with the filesystem offline, ie. the image and not the mounted filesystem.

This results to a global option, like btrfs --auth-key 1234 ispect-internal dump-tree, compared to btrfs inspect-internal dump-tree --auth-key 1234. This is not finalized, but a global option is now the preferred choice.

Final words

I have a prototype, that does not work in all cases but at least passes mkfs and mount. The number of checksum verification cases got above what I was able to fix by the time of writing this. I think this has enough matter on itself so I’m pushing it out out as part 1. There are open questions regarding the command line interface and also a some kind of proof or discussion regarding attacks. Stay tuned.

References:

May 23, 2021 10:00 PM

May 22, 2021

Brendan Gregg: What is Observability

It's a made-up computer word that my word processor decorates with a wiggly red you-can't-spell line. At least it did until I clicked "Add to Dictionary" (it got too annoying as I was writing a book on computer observability).

Observability: The ability to observe.
Observe-ability. Observability. In computer engineering we use it to describe the tools, data sources, and methods for understanding (observing!) how a technology is operating. We don't use the _real_ word "observable" since that implies the wrong thing. Imagine "observable metrics": Are there metrics that _aren't_ observable? Using observability in sentences: - What observability tools are installed? (Means: What tools exist that only read state?) - What observability does that database have? (Means: What metrics and logs does it have?) - Let me try some observability first. (Means: Let me look at the system without changing it.) Wait, aren't all performance tools observability tools? No. _Experimental_ tools change the state of the system to understand it. For example, benchmarks. As an analogy, a car's dashboard is a collection of observability tools that let you understand how the car is operating (speed, rpm, temperature). A car's 0-60 mph time is an _experiment_. When I was a performance consultant I'd show up to random companies who wanted me to fix their computer performance issues. If they trusted me with a login to their production servers, I could help them a lot quicker. To get that trust I knew which tools looked but didn't touch: Which were observability tools and which were experimental tools. "I'll start with observability tools only" is something I'd say at the start of every engagement. Note that observability tools aren't completely harmless: Their execution consumes resources, usually negligible, but in some cases it's enough to perturb the target of study. This is the "observer effect." Another use of the term observability is as a reminder to switch between tool types, and not to get stuck on one. A colleague (Roch Bourbonnais from memory) once told me:
"You have two hands. Observability and experimentation."
It stuck with me as it also makes the point that when you're only using one type to solve a performance problem __you're working one-handed__.

May 22, 2021 02:00 PM

May 20, 2021

Linux Plumbers Conference: Scheduler Microconference Accepted into 2021 Linux Plumbers Conference

We are pleased to announce that the Scheduler Microconference has been accepted into the 2021 Linux Plumbers Conference! The scheduler is an important functionality of the Linux kernel, deciding what process gets to run when, where and for how long. With different topologies and workloads, it is no easy task to give the user the best experience possible. Schedulers are one of the most discussed topics on the Linux Kernel Mailing List, but many of these topics need further discussion in a conference format. Indeed, the scheduler microconference is responsible for many topics to make progress.

At last year’s meet up, the Scheduler microconference achieved the following results:

Not only were enhancements made, but the meetup also helped prove that some topics were not feasible and we do not need to spend more time on them.

This year’s topics to be discussed include:

Come and join us in the discussion of controlling what tasks get to run on your machine and when. We hope to see you there!

May 20, 2021 01:23 AM

May 14, 2021

Linux Plumbers Conference: Confidential Computing Microconference Accepted into 2021 Linux Plumbers Conference

We are pleased to announce that the Confidential Computing Microconference has been accepted into the 2021 Linux Plumbers Conference! In this microconference we will discuss how Linux can support encryption technologies which protect data during processing on the CPU. Examples are AMD SEV, Intel TDX, IBM Secure Execution for s390x and ARM Secure Virtualization. These are recent additions compared to technologies which protect data while in transit (SSL, VPNs) and at rest (disk encryption).

The Linux kernel recently gained support for SEV-ES and support for Intel TDX is upcoming. AMD SEV will be further enhanced by Secure Nested Paging (SNP). Support for these technologies requires intrusive changes to the Linux kernel for memory integrity and secure interrupt delivery to virtual machines. Designing these changes in a way that works for different confidential computing technologies is one goal of this microconference.

Topics to be included, but not limited to, are:

Please come and join us in the discussion for solutions to the open problems for supporting these technologies.

We hope to see you there!

May 14, 2021 01:09 AM

May 13, 2021

James Bottomley: The Community Corrosive Effects of CLAs

As one of the kernel DCO advocates, I’ve written many times about using the DCO instead of a CLA for copyright and patent contributions under open source licences. In spite of my obvious biases, I’ll try to give a factual overview of the cases for the DCO and CLA system. First, it should be noted that both the DCO and any CLA are types of Contribution Agreements (a set of terms by which contributors are agreeing to be bound). It should also be acknowledged that the DCO is a far more recent invention than CLAs. The DCO was first pioneered by the Linux kernel in 2004 (having been designed by Diane Peters, then of OSDL) and was subsequently adopted by a broad range of open source projects. However, in legal terms, the DCO is much less well understood than a standard CLA type agreement between the contributor and some entity, which is largely the reason you find a number of lawyers still advocating for the use of CLAs in various open source projects: because they’d like to stick with something that has more miles on it, or because they’re invested in the older model of community, largely pioneered by Apache. The biggest problem today is that the operation of most CLAs is asymmetrical: they take from the contributor more rights than the open source code actually needs, so lets begin with a summary of each type of Contribution Agreement.

DCO

The DCO is a legal representation by the contributor to everyone who might ever use the code. It requires no second party on the other side to counter sign it or act as the receiving entity, so it exactly mirrors the inbound=outbound licensing model first coined by Richard Fontana. The DCO explicitly grants to all downstream recipients only the exact rights the Open Source licence requires (and nothing more). In this sense it is fully symmetrical: the rights granted by the contributor are the same as the rights received by the downstream (i.e. inbound=outbound). Every contributor under the DCO retains their own copyright (or their company does if the contribution is a work for hire). The main alleged disadvantage of the DCO is that it encourages distributed ownership and makes it very hard to change the licence of the project because each contributor has only granted the rights necessary for the current licence, so if the new one requires more or different rights, all the current contributors have to re-grant those new or different rights (which can be a huge number of people for large long running projects). Since the DCO is a representation to everyone and requires no receiving entity, the project collecting the code doesn’t require any formal legal entity, like a foundation, to operate and thus the DCO gives rise to a truly lightweight structure for any project. The other big advantage of the DCO is that all of the representations are tracked by the Signed-off-by: tag on the commit, which goes in the git repository of the project code, so anyone with a clone of the repository has complete access to information about who changed what and where their DCO signoff is.

CLA

All current Open Source CLAs are structured as agreements between the contributor and a second party. Most often, the second party is a Foundation or a Corporation, making them quite heavy weight in terms of setup, admin and overhead. Every current CLA that I know about takes more rights from the contributor than the open source licence actually requires. For instance the Apache Individual CLA grants the right to copy, derive and sublicence to the Apache foundation who then relicence the contribution to the project usually under the Apache 2.0 licence. This is a classic asymmetric grant because the Apache foundation receives far more rights in the contribution than it grants to the downstream recipients. The FSF CLA is even more extreme because they require assignment of the copyright (so they will own the code and you, the author, will have no further right or interest in it except possibly for minimal moral rights to be named the author). Apart from the asymmetric grant, which places the receiving entity in a privileged position in the ecosystem, the other problem with CLAs is that they’re legal agreements, so they require a lawyer to prepare them, a mechanism to ensure people sign them and a mechanism to keep all the signatures … sometimes this can be in filing cabinets if paper instead of electronic copies are used. This repository of agreements then isn’t available to anyone except the tracking entity, meaning that if someone needs to know if John Doe signed a CLA, they have to reach out and ask. In some cases the actual filing cabinets got lost as projects changed offices, so some CLA based projects don’t actually have complete records of all their CLAs.

CLAs Catalyse Community Corrosion

The main driver of community corrosion is the temptation to abuse a position of power (this temptation becomes irresistable over time because, as Baron Acton put it, “all power corrupts”). Since CLAs by their nature force a power imbalance between the contributor and the receiving entity, they act as focal points for this corrosion. Communities are very sensitive to what they see as their work being misused, so the fastest way to lose community trust is to abuse the power the CLA gave you to go against the community itself. There are numerous examples of this in the Corporate World, the most topical one today being the Elastic change from Apache 2.0 to SSPL to better monetize the code the community contributed freely to. One might think the solution to this is never to sign a CLA if the holder of the power imbalance is a corporation … i.e. only do it if the other entity is a not for profit foundation. But ask yourself, how much do you trust the people running the foundation and do its bylaws guarantee your rights in the code? Relicensing for commercial gain isn’t the only way the community could be abused, so how sure are you of the power you’re handing to a foundation which, after all, is an entity governed by some type of board, all of whom likely have political agendas, won’t be abused? To see some examples of foundations not being in tune with their community, one only has to look at the FSF and Richard Stallman. Based on all of this I conclude, like Drew DeVault, that you should never sign a CLA under any circumstances.

The bottom line is that if you do sign a CLA some decision will happen at some point that you don’t agree with but which you already gave away the power to block because of the rights imbalance inherent in the CLA you signed. Inevitably this decision will cause you to feel betrayed because your views are being ignored and as a contributor you feel you should be heard, so you’ll sour on the project. This is the community corrosion catalyst buried deep inside all CLAs.

One final thing to note is that it is possible to craft a CLA that only takes the rights it needs, in the same way the DCO does, it’s just that no project I know has ever done this. However, even if this experiment were attempted, you still need a recipient entity, plus all the infrastructure to do signing and track the signed agreements, so you’d still be better off using a lightweight DCO process.

Conclusion: For Community Small is Beautiful

The way to avoid the community corrosion problem is to do everything minimally: use a DCO to take only the rights the downstream requires and to avoid all the heavyweight recipient, signing and tracking infrastructure. Don’t set up a foundation unless you absolutely need an entity, say to handle cash, and if you must set one up, never give it any control over the project (like appointing a change control or architecture control board for instance) everything you set up should be as small as possible and clearly serve the project and its community. Above all, don’t use a CLA because it will cause a rights imbalance that corrodes your community and it will require a large amount of overhead to run.

May 13, 2021 10:51 PM

May 11, 2021

Paul E. Mc Kenney: Stupid RCU Tricks: Which tests do I run???

The rcutorture test suite has quite a few options, including locktorture, rcuscale, refscale, and scftorture in addition to rcutorture itself. These tests can be run with the assistance of either KASAN or KCSAN. Given that RCU contains kernel modules, there is the occasional need for an allmodconfig build. Testing of kvfree_rcu() is currently a special case of rcuscale. Some care is required to adapt some of the tests to the test system, for example, based on the number of available CPUs. Both rcuscale and refscale have varying numbers of primitives that they test, so how to keep up with the inevitable additions and deletions? How much time should be devoted to each of locktorture, scftorture, and rcutorture, which, in contrast with rcuscale and refscale, do not have natural accuracy-driven durations? And finally, if you do run all of these things, you end up with about 100 gigabytes of test artifacts scattered across more than 50 date-stamped directories in tools/testing/selftests/rcutorture/bin/res.

Back in the old days, I kept mental track of the -rcu tree and ran the tests appropriate to whatever was queued there. This strategy broke down in late 2020 due to family health issues (everyone is now fine, thank you!), resulting in a couple of embarrassing escapes. Some additional automation was clearly required.

This automation took the form of a new torture.sh script. This is not intended to be the main testing mechanism, but instead an overnight touch-test of the full rcutorture suite that is run occasionally, for example, just after accepting a large patch series or just before sending a pull request.

By default, torture.sh runs everything both with and without KASAN, and with a 10-minute “duration base”. The translation from “duration base” into wall-clock time is a bit indirect. The fewer CPUs you have, the more tests you run, and the longer it takes your system to build a kernel, the more wall-clock time that “10 minutes” will turn into. On my 16-hardware-thread laptop, running everything (including the non-default KCSAN runs) turns that 10-minute duration base into about 11 hours. Increasing the duration base by five minutes increases the total wall-clock time by about 100 minutes.

This is therefore not a test to be integrated into a per-commit CI system, however, manually selecting specific tests for the most recent RCU-related commit is far easier than keeping the entire -rcu stack in one's head. And torture.sh assists with this by providing sets of --configs- and --do- parameters.

The --configs- parameters are as follows:


  1. --configs-rcutorture.
  2. --configs-locktorture.
  3. --configs-scftorture.
These arguments are passed to the --configs argument of kvm.sh for the --torture rcu, --torture lock, and --torture scf cases, respectively. By default, --configs CFLIST is passed. You may accumulate a long list via multiple --configs- arguments, or you can just as easily pass a long quoted list of scenarios through a single --configs- argument.

The --do- parameters are as follows:

  1. --do-all, which enables everything, including non-default options such as KCSAN.
  2. --do-allmodconfig, which does a single allmodconfig kernel build without running anything, and without either KASAN or KCSAN.
  3. --do-clocksourcewd, which does a short test of the clocksource watchdog, verifying that it can tell the difference between delay-based skew and clock-based skew.
  4. --do-kasan, which enables KASAN on everything except -do-allmodconfig.
  5. --do-kcsan, which enables KCSAN on everything except -do-allmodconfig.
  6. --do-kvfree, which runs a special rcuscale test of the kvfree_rcu() primitive.
  7. --do-locktorture, which enables a set of locktorture runs.
  8. --do-none, which disables everything. Yes, you can give a long series of --do-all and --do-none arguments if you really want to, but the usual approach is to follow --do-none with the lists of tests you want to enable, for example, --do-none --do-clocksourcewd will test only the clocksource watchdog, and do so in but a few minutes.
  9. --do-rcuscale, which enables rcuscale update-side performance tests, adapted to the number of CPUs on your system.
  10. --do-rcutorture, which enables rcutorture stress tests.
  11. --do-refscale, which enables refscale read-side performance tests, adapted to the number of CPUs on your system.
  12. --do-scftorture, which enables scftorture stress tests for smp_call_function() and friends, adapted to the number of CPUs on your system.
Each of these --do- parameters has a corresponding --do-no- parameter, wit the exception of --do-all and --do-none, each of which is the other's --do-no- parameter. This allows all-but runs, for example, --do-all --do-no-rcutorture would run everything (even KCSAN), but none of the rcutorture runs.

As of early 2021, KCSAN is still a bit picky about compiler versions, so the --kcsan-kmake-arg allows you to specify arguments to the --kmake-arg argument to kvm.sh. For example, right now, I use --kcsan-kmake-arg "CC=clang-11".

As noted earlier, both rcuscale and refscale can have tests added and removed over time. The torture.sh script deals with this by doing a grep through the rcuscale.c and refscale source code, respectively, and running all of the tests that it finds.

The --duration argument specifies the duration base, which, as noted earlier, defaults to 10 minutes. This duration base is apportioned across the kvm.sh script's --duration parameter, with 70% for rcutorture, 10% for locktorture, and 20% for scftorture. So if you specify --duration 20 to torture.sh, the rcutorture kvm.sh runs will specify --duration 14, the locktorture kvm.sh runs will specify --duration 2, and the scftorture kvm.sh runs will specify --duration 4.

The 100GB full run is addressed at least partially by compressing KASAN vmlinux files, which gains roughly a factor of two overall, courtesy of the 1GB size of each such file. Normally, torture.sh uses all available CPUs to do the compression, but you can restrict it using the --compress-kasan-vmlinux parameter. At the extreme, --compress-kasan-vmlinux 0 will disable compression entirely, which can be an attractive option given that compressing takes about an hour of wall-clock time on my 16-CPU laptop.

Finally, torture.sh places all of its output under a date-stamped directory suffixed with -torture, for example, tools/testing/selftests/rcutorture/res/2021.05.03-20.10.12-torture. This allows bulky torture.sh directories to be more aggressively cleaned up when disks start getting full.

Taking all of this together, torture.sh provides a very useful overnight “acceptance test” for RCU.

May 11, 2021 10:30 PM

May 08, 2021

Brendan Gregg: Poor Disk Performance

People often tell me they don't understand performance tool output because they can't tell what's "good" or "bad." It can be hard as performance is subjective. What's good for one user may be bad for another. There are also cases where I can't tell either: The tools only provide clues for further analysis. I recently encountered terrible disk performance and thought it'd be useful to collect Linux tool screenshots and share them for reference. E.g., iostat(1):

$ iostat -xz 10
[...]
Device      r/s     w/s     rkB/s     wkB/s   rrqm/s   wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
nvme0n1    4.40    6.00     42.00     43.20     0.00     4.30   0.00  41.75    6.45    0.80   0.03     9.55     7.20   0.15   0.16
dm-0       4.40   10.30     42.00     43.20     0.00     0.00   0.00   0.00    6.55    0.47   0.03     9.55     4.19   0.54   0.80
dm-1       4.40    9.80     42.00     43.20     0.00     0.00   0.00   0.00    6.55    0.49   0.03     9.55     4.41   0.56   0.80
sdb        4.50    0.00    576.00      0.00     0.00     0.00   0.00   0.00  434.31    0.00   1.98   128.00     0.00 222.22 100.00
It's the sdb disk and I'm first looking at the r_await column to see the average time in milliseconds for reads. An average of 434 ms is awful, and a small queue size (aqu-sz) indicates it's a problem with the disk and not the workload applied. I want to see distributions and event logs. But first, about this disk...
See the dust on this disk? ## Flying height Were you ever taught in computer science that the size of a dust particle dwarfs the distance between the disk head and the platter? Something like:
It's called "[flying height]" or "fly height," and (from that reference) was about 5 nanometers for 2011 drives. Particles of dust can be 1000x bigger. The heads "float" on a film of air, and this is sometimes described as "air lubrication." To quote from an article about hard drive [air filters]: "some hard drives are not rated to exceed 7,000 feet while operating because the air pressure would be too low inside the drive to float the heads properly." Such hard drives have air ports, and air filters, to equalize pressure with the outside air. (Update: Some modern drives after 2015 are sealed with [helium].) I was first told about the ratio between fly height and particles of dust in a computer studies class at school, with the teacher drawing this diagram on a chalkboard. I assumed that a speck of dust would destroy a drive head at 7200 rpm. Right? I just found a Quora article with a better diagram than mine, which also asks the question So, what do YOU think would happen if the disk read/write head were to run over a speck of dust? (The article doesn't answer.) ## What happened The disk photo is an 80 Gbyte Western Digital IDE disk I found when packing up to move house. Missing its lid. Dusty. I'd also recently bought a [SATA/IDE to USB hub] and couldn't resist seeing if the disk was readable despite the dust, and finding out what was on it (I'd forgotten). Surely it's unreadable, right?...
The drive failed immediately. The disk sped up, the head clicked, then sped down with an error. I found the lid but no drive screws, and rested it on top. Still errored. By pushing down on the lid, however, (simulating screws) it sped up and down a few times before failing. The harder I pushed the less it vibrated and the more it worked, until I finally had it returning I/O, albeit slowly. (This may be the opposite of my famous [shouting video]: This time I'm suppressing vibration to make a disk work.) I managed to read over 99.9999% of disk sectors successfully. It took several hours so I left a bottle of apple juice pressing the lid down. Performance was still poor, but the head wasn't obliterated. Only an 8-Kbyte sequential chunk failed and could not be read (big bit of dust?). The iostat output from earlier (and the screenshots below) are the performance of this disk, dust-n-all. While dust may have been a factor, I think the biggest cause for poor performance was vibration with the lid unscrewed, based on how much faster it worked when I used my body weight to hold the lid down. I could hear it spin faster. It seemed to have several set speeds, and when pushing hard it would try a faster speed for a couple of seconds, then a faster one, until it found the fastest it could operate (presumably it tries faster speeds until it begins to get sector-ECC errors). The way it tried faster speeds somehow reminded me of how 32x CDROM drives operated. ## Screenshots Back to my opening line: The following screenshots may help you better understand these tool outputs. I'll start with the worst performance and then show moderately-poor performance. From these outputs I try to determine if the problem is: - **The workload**: High-latency disk I/O is commonly caused by the workload applied. It may be due to queueing, especially from file systems that send a batch of writes. It can also be simply large I/O, or the presence of other disk commands that slow subsequent I/O. - **The disk**: If it isn't the workload applied, then slow I/O may well be caused by a bad disk. Analysis is similar whether the disk is rotational magnetic or flash-memory based. Rotational disks have extra latency from head seeks for random I/O, and spin ups from the idle state. The workload is 128 Kbyte sequential reads using the dd(1) utility. I'd guess they'd normally take between 1 and 2 ms for this disk. ### Worst performance iostat(1), printing 10-second summaries:
$ iostat -xz 10
Linux 4.15.0-66-generic (lgud-bgregg) 	12/16/2020 	_x86_64_	(8 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           7.70    0.01    2.03    0.09    0.00   90.17
[...]

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           7.90    0.00    2.07   10.87    0.00   79.15

Device      r/s     w/s     rkB/s     wkB/s   rrqm/s   wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
nvme0n1    0.40   15.30      2.00    167.20     0.00     2.70   0.00  15.00    7.00    0.81   0.01     5.00    10.93   0.13   0.20
dm-0       0.40   18.00      2.00    167.20     0.00     0.00   0.00   0.00    7.00    7.69   0.14     5.00     9.29   0.33   0.60
dm-1       0.30   17.80      1.60    167.20     0.00     0.00   0.00   0.00    6.67    7.78   0.14     5.33     9.39   0.29   0.52
dm-2       0.10    0.00      0.40      0.00     0.00     0.00   0.00   0.00    8.00    0.00   0.00     4.00     0.00   8.00   0.08
sdb        7.30    0.00    934.40      0.00     0.00     0.00   0.00   0.00  269.70    0.00   1.97   128.00     0.00 136.88  99.92

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           7.70    0.00    1.66   10.97    0.00   79.68

Device      r/s     w/s     rkB/s     wkB/s   rrqm/s   wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
nvme0n1    4.40    6.00     42.00     43.20     0.00     4.30   0.00  41.75    6.45    0.80   0.03     9.55     7.20   0.15   0.16
dm-0       4.40   10.30     42.00     43.20     0.00     0.00   0.00   0.00    6.55    0.47   0.03     9.55     4.19   0.54   0.80
dm-1       4.40    9.80     42.00     43.20     0.00     0.00   0.00   0.00    6.55    0.49   0.03     9.55     4.41   0.56   0.80
sdb        4.50    0.00    576.00      0.00     0.00     0.00   0.00   0.00  434.31    0.00   1.98   128.00     0.00 222.22 100.00

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           6.89    0.00    1.90   10.99    0.00   80.23

Device      r/s     w/s     rkB/s     wkB/s   rrqm/s   wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
nvme0n1    0.30    7.60      1.20    119.20     0.00     4.40   0.00  36.67    2.67    1.63   0.01     4.00    15.68   0.20   0.16
dm-0       0.30   12.00      1.20    119.20     0.00     0.00   0.00   0.00    2.67    2.30   0.03     4.00     9.93   0.55   0.68
dm-1       0.30   11.40      1.20    119.20     0.00     0.00   0.00   0.00    2.67    2.42   0.03     4.00    10.46   0.58   0.68
sdb        3.50    0.00    448.00      0.00     0.00     0.00   0.00   0.00  579.66    0.00   1.99   128.00     0.00 285.71 100.00
This output shows 10-second statistical summaries. Massive r_await with little aqu-sz, as mentioned earlier. The read size is large (128 Kbyte average as seen in iostat(1)), but that's not excessive. biolatency (this is my BPF tool from [bcc]), printing 60-second histograms, per disk (-D):
# biolatency -D 60 1
Tracing block device I/O... Hit Ctrl-C to end.


disk = 'nvme0n1'
     usecs               : count     distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 12       |*                                       |
        16 -> 31         : 318      |****************************************|
        32 -> 63         : 210      |**************************              |
        64 -> 127        : 106      |*************                           |
       128 -> 255        : 65       |********                                |
       256 -> 511        : 29       |***                                     |
       512 -> 1023       : 31       |***                                     |
      1024 -> 2047       : 81       |**********                              |
      2048 -> 4095       : 93       |***********                             |
      4096 -> 8191       : 76       |*********                               |

disk = 'sdb'
     usecs               : count     distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 0        |                                        |
        16 -> 31         : 0        |                                        |
        32 -> 63         : 0        |                                        |
        64 -> 127        : 0        |                                        |
       128 -> 255        : 0        |                                        |
       256 -> 511        : 0        |                                        |
       512 -> 1023       : 0        |                                        |
      1024 -> 2047       : 0        |                                        |
      2048 -> 4095       : 0        |                                        |
      4096 -> 8191       : 0        |                                        |
      8192 -> 16383      : 0        |                                        |
     16384 -> 32767      : 1        |                                        |
     32768 -> 65535      : 15       |**                                      |
     65536 -> 131071     : 214      |****************************************|
    131072 -> 262143     : 84       |***************                         |
    262144 -> 524287     : 46       |********                                |
    524288 -> 1048575    : 7        |*                                       |
   1048576 -> 2097151    : 0        |                                        |
   2097152 -> 4194303    : 1        |                                        |
Note the sdb latencies range from 32 ms to over 2 seconds! biosnoop (this is my BPF tool from [bcc]), printing every disk event:
# biosnoop
TIME(s)     COMM           PID    DISK    T SECTOR     BYTES  LAT(ms)
0.000000    dd             16014  sdb     R 37144544   131072   77.96
0.008933    biosnoop       21118  nvme0n1 R 652936664  4096      7.53
0.143268    dd             16014  sdb     R 37144800   131072  143.20
0.333243    dmcrypt_write  347    nvme0n1 W 244150736  4096      2.72
0.333256    dmcrypt_write  347    nvme0n1 W 244150744  4096      2.49
0.333259    dmcrypt_write  347    nvme0n1 W 244150752  4096      1.38
0.361565    dd             16014  sdb     R 37145056   131072  218.24
0.463294    dd             16014  sdb     R 37145312   131072  101.70
0.590237    dd             16014  sdb     R 37145568   131072  126.92
0.734682    dd             16014  sdb     R 37145824   131072  144.38
0.864665    Cache2 I/O     6515   nvme0n1 R 694714632  4096      0.10
0.961290    dd             16014  sdb     R 37146080   131072  226.55
1.063137    dd             16014  sdb     R 37146336   131072  101.79
1.198111    dd             16014  sdb     R 37146592   131072  134.91
1.425886    dd             16014  sdb     R 37146848   131072  227.74
1.619342    dd             16014  sdb     R 37147104   131072  193.38
1.754445    dd             16014  sdb     R 37147360   131072  135.04
1.856156    dd             16014  sdb     R 37147616   131072  101.65
2.000656    dd             16014  sdb     R 37147872   131072  144.42
2.102591    dd             16014  sdb     R 37148128   131072  101.83
2.204427    dd             16014  sdb     R 37148384   131072  101.77
2.397540    dd             16014  sdb     R 37148640   131072  193.05
2.567098    dd             16014  sdb     R 37148896   131072  169.52
2.576776    dmcrypt_write  347    nvme0n1 W 94567816   57344     7.46
2.577205    dmcrypt_write  347    nvme0n1 W 499469088  12288     0.02
2.577272    dmcrypt_write  347    nvme0n1 W 499469112  16384     0.04
2.580759    dmcrypt_write  347    nvme0n1 W 499469144  4096      2.03
2.752098    dd             16014  sdb     R 37149152   131072  184.94
2.945566    dd             16014  sdb     R 37149408   131072  193.41
3.039011    dd             16014  sdb     R 37149664   131072   93.38
3.165834    dd             16014  sdb     R 37149920   131072  126.76
3.401771    dd             16014  sdb     R 37150176   131072  235.87
3.536805    dd             16014  sdb     R 37150432   131072  134.95
3.705294    dd             16014  sdb     R 37150688   131072  168.43
3.772291    Cache2 I/O     6515   nvme0n1 R 694703744  4096      7.55
3.873563    dd             16014  sdb     R 37150944   131072  168.21
4.018151    dd             16014  sdb     R 37151200   131072  144.53
4.253137    dd             16014  sdb     R 37151456   131072  234.92
4.310591    dmcrypt_write  347    nvme0n1 W 220635024  16384     2.70
[...]
This shows individual I/O to disk sdb taking 100 ms and more (LAT(ms)). If I ran this for long enough I should see outliers reaching up to over 2 seconds. I don't see evidence of queueing in this biosnoop output: One tell-tale sign of queueing is when I/O latencies ramp up (e.g.: 10ms, 20ms, 30ms, 40ms, etc.) with a steady completion time between them (seen in the TIME(s) column). This can be when the disk is working through its queue, so later I/O have steadily increasing latency. But the completion times and latencies in this output show that the disk doesn't appear to have a deep queue. It's just plain slow. ### Poor performance By pressing hard on the disk lid it was able to operate faster, but still somewhat poor.
# biosnoop
TIME(s)     COMM           PID    DISK    T SECTOR     BYTES  LAT(ms)
[...]
2.643276    dd             16014  sdb     R 46133728   131072    1.60
2.660996    dd             16014  sdb     R 46133984   131072   16.98
2.671327    dd             16014  sdb     R 46134240   131072   10.31
2.673299    dd             16014  sdb     R 46134496   131072    1.94
2.675298    dd             16014  sdb     R 46134752   131072    1.97
2.685624    dd             16014  sdb     R 46135008   131072   10.29
2.705410    dd             16014  sdb     R 46135264   131072   19.76
2.707425    dd             16014  sdb     R 46135520   131072    1.96
2.710357    dd             16014  sdb     R 46135776   131072    1.66
2.716280    dd             16014  sdb     R 46136032   131072    1.62
2.739534    dd             16014  sdb     R 46136288   131072   19.07
2.741464    dd             16014  sdb     R 46136544   131072    1.90
2.743432    dd             16014  sdb     R 46136800   131072    1.93
2.745563    dd             16014  sdb     R 46137056   131072    1.57
2.756934    dd             16014  sdb     R 46137312   131072   10.11
2.783863    dd             16014  sdb     R 46137568   131072   26.90
2.785830    dd             16014  sdb     R 46137824   131072    1.93
2.787835    dd             16014  sdb     R 46138080   131072    1.97
2.790935    dd             16014  sdb     R 46138336   131072    2.55
[...]
The latencies here look like they are a mix of normal speed (~1.9 ms) and slower ones (~10ms and slower). Given it's a 7,200 rpm disk, a revolution takes ~8ms, so if it needs to retry sectors I'd expect to see latencies of 2ms, 10ms, 18ms, 26ms, etc. Here's the biolatency(1) histograms when the disk is running faster:
disk = 'sdb'
     usecs               : count     distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 0        |                                        |
        16 -> 31         : 0        |                                        |
        32 -> 63         : 0        |                                        |
        64 -> 127        : 0        |                                        |
       128 -> 255        : 0        |                                        |
       256 -> 511        : 0        |                                        |
       512 -> 1023       : 0        |                                        |
      1024 -> 2047       : 13       |******                                  |
      2048 -> 4095       : 82       |****************************************|
      4096 -> 8191       : 0        |                                        |
      8192 -> 16383      : 9        |****                                    |
     16384 -> 32767      : 7        |***                                     |
     32768 -> 65535      : 41       |********************                    |
     65536 -> 131071     : 77       |*************************************   |
    131072 -> 262143     : 2        |                                        |
    262144 -> 524287     : 1        |                                        |
The distribution is bimodal. The faster mode will be the sequential reads, the slower mode shows the retries. And the iostat(1) output when the disk is in this faster state:
$ iostat -xz 10
[...]
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          11.78    0.00    2.68    2.82    0.00   82.72

Device      r/s     w/s     rkB/s     wkB/s   rrqm/s   wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
nvme0n1    3.50   11.70     15.60    146.40     0.40     2.30  10.26  16.43    2.40    0.21   0.00     4.46    12.51   0.05   0.08
dm-0       3.90   14.00     15.60    146.40     0.00     0.00   0.00   0.00    2.87    0.17   0.01     4.00    10.46   0.54   0.96
dm-1       1.40   13.70      5.60    146.40     0.00     0.00   0.00   0.00    4.29    0.18   0.01     4.00    10.69   0.29   0.44
dm-2       2.50    0.00     10.00      0.00     0.00     0.00   0.00   0.00    2.08    0.00   0.01     4.00     0.00   2.08   0.52
sdb      321.40    0.00  41139.20      0.00     0.00     0.00   0.00   0.00    5.11    0.00   1.64   128.00     0.00   3.01  96.88
The average (r_await) of 5.11 ms really doesn't tell the full story like the histogram or per-event output does. ## More questions What's happening to all that dust? Is it stuck to the platter surface, or does it bounce around when the disk is spinning? The photo I included was after I read the entire disk, so the dust didn't end up in the internal air filters. It was still on the platter. Would a 1 TB disk be as tolerant to dust as this old 80 GB disk? (When I was a sysadmin, I heard a story of how old VAX drives would stall, so holes had been drilled in them with tape over the holes. When stalled, the sysadmin would peel back the tape and use their finger to spin-start them. Those even older drives must have been more tolerant of dust!) And at what point is there too much dust? I don't recommend you try this, but if I had time or interest I'd create a perspex lid and see how much dust a drive can keep working with. At least I answered one question. I found that these hard drive heads were not destroyed by dust, and could read almost everything from a dusty disk, albeit slowly. Perhaps that's not the case with more modern SMR disks with smaller tolerances, but I'd have to try, given the surprising result this time. [flying height]: https://en.wikipedia.org/wiki/Flying_height [SATA/IDE to USB hub]: https://www.amazon.com/gp/product/B01NAUIA6G/ [shouting video]: http://www.brendangregg.com/blog/2008-12-31/unusual-disk-latency.html [air filters]: https://www.karlstechnology.com/blog/hard-drive-air-filters [bcc]: https://github.com/iovisor/bcc [helium]: https://techreport.com/news/27031/shingled-platters-breathe-helium-inside-hgsts-10tb-hard-drive/

May 08, 2021 02:00 PM

May 06, 2021

Matthew Garrett: More doorbell adventures

Back in my last post on this topic, I'd got shell on my doorbell but hadn't figured out why the HTTP callbacks weren't always firing. I still haven't, but I have learned some more things.

Doorbird sell a chime, a network connected device that is signalled by the doorbell when someone pushes a button. It costs about $150, which seems excessive, but would solve my problem (ie, that if someone pushes the doorbell and I'm not paying attention to my phone, I miss it entirely). But given a shell on the doorbell, how hard could it be to figure out how to mimic the behaviour of one?

Configuration for the doorbell is all stored under /mnt/flash, and there's a bunch of files prefixed 1000eyes that contain config (1000eyes is the German company that seems to be behind Doorbird). One of these was called 1000eyes.peripherals, which seemed like a good starting point. The initial contents were {"Peripherals":[]}, so it seemed likely that it was intended to be JSON. Unfortunately, since I had no access to any of the peripherals, I had no idea what the format was. I threw the main application into Ghidra and found a function that had debug statements referencing "initPeripherals and read a bunch of JSON keys out of the file, so I could simply look at the keys it referenced and write out a file based on that. I did so, and it didn't work - the app stubbornly refused to believe that there were any defined peripherals. The check that was failing was pcVar4 = strstr(local_50[0],PTR_s_"type":"_0007c980);, which made no sense, since I very definitely had a type key in there. And then I read it more closely. strstr() wasn't being asked to look for "type":, it was being asked to look for "type":". I'd left a space between the : and the opening " in the value, which meant it wasn't matching. The rest of the function seems to call an actual JSON parser, so I have no idea why it doesn't just use that for this part as well, but deleting the space and restarting the service meant it now believed I had a peripheral attached.

The mobile app that's used for configuring the doorbell now showed a device in the peripherals tab, but it had a weird corrupted name. Tapping it resulted in an error telling me that the device was unavailable, and on the doorbell itself generated a log message showing it was trying to reach a device with the hostname bha-04f0212c5cca and (unsurprisingly) failing. The hostname was being generated from the MAC address field in the peripherals file and was presumably supposed to be resolved using mDNS, but for now I just threw a static entry in /etc/hosts pointing at my Home Assistant device. That was enough to show that when I opened the app the doorbell was trying to call a CGI script called peripherals.cgi on my fake chime. When that failed, it called out to the cloud API to ask it to ask the chime[1] instead. Since the cloud was completely unaware of my fake device, this didn't work either. I hacked together a simple server using Python's HTTPServer and was able to return data (another block of JSON). This got me to the point where the app would now let me get to the chime config, but would then immediately exit. adb logcat showed a traceback in the app caused by a failed assertion due to a missing key in the JSON, so I ran the app through jadx, found the assertion and from there figured out what keys I needed. Once that was done, the app opened the config page just fine.

Unfortunately, though, I couldn't edit the config. Whenever I hit "save" the app would tell me that the peripheral wasn't responding. This was strange, since the doorbell wasn't even trying to hit my fake chime. It turned out that the app was making a CGI call to the doorbell, and the thread handling that call was segfaulting just after reading the peripheral config file. This suggested that the format of my JSON was probably wrong and that the doorbell was not handling that gracefully, but trying to figure out what the format should actually be didn't seem easy and none of my attempts improved things.

So, new approach. Rather than writing the config myself, why not let the doorbell do it? I should be able to use the genuine pairing process if I could mimic the chime sufficiently well. Hitting the "add" button in the app asked me for the username and password for the chime, so I typed in something random in the expected format (six characters followed by four zeroes) and a sufficiently long password and hit ok. A few seconds later it told me it couldn't find the device, which wasn't unexpected. What was a little more unexpected was that the log on the doorbell was showing it trying to hit another bha-prefixed hostname (and, obviously, failing). The hostname contains the MAC address, but I hadn't told the doorbell the MAC address of the chime, just its username. Some more digging showed that the doorbell was calling out to the cloud API, giving it the 6 character prefix from the username and getting a MAC address back. Doing the same myself revealed that there was a straightforward mapping from the prefix to the mac address - changing the final character from "a" to "b" incremented the MAC by one. It's actually just a base 26 encoding of the MAC, with aaaaaa translating to 00408C000000.

That explained how the hostname was being generated, and in return I was able to work backwards to figure out which username I should use to generate the hostname I was already using. Attempting to add it now resulted in the doorbell making another CGI call to my fake chime in order to query its feature set, and by mocking that up as well I was able to send back a file containing X-Intercom-Type, X-Intercom-TypeId and X-Intercom-Class fields that made the doorbell happy. I now had a valid JSON file, which cleared up a couple of mysteries. The corrupt name was because the name field isn't supposed to be ASCII - it's base64 encoded UTF16-BE. And the reason I hadn't been able to figure out the JSON format correctly was because it looked something like this:

{"Peripherals":[]{"prefix":{"type":"DoorChime","name":"AEQAbwBvAHIAYwBoAGkAbQBlACAAVABlAHMAdA==","mac":"04f0212c5cca","user":"username","password":"password"}}]}


Note that there's a total of one [ in this file, but two ]s? Awesome. Anyway, I could now modify the config in the app and hit save, and the doorbell would then call out to my fake chime to push config to it. Weirdly, the association between the chime and a specific button on the doorbell is only stored on the chime, not on the doorbell. Further, hitting the doorbell didn't result in any more HTTP traffic to my fake chime. However, it did result in some broadcast UDP traffic being generated. Searching for the port number led me to the Doorbird LAN API and a complete description of the format and encryption mechanism in use. Argon2I is used to turn the first five characters of the chime's password (which is also stored on the doorbell itself) into a 256-bit key, and this is used with ChaCha20 to decrypt the payload. The payload then contains a six character field describing the device sending the event, and then another field describing the event itself. Some more scrappy Python and I could pick up these packets and decrypt them, which showed that they were being sent whenever any event occurred on the doorbell. This explained why there was no storage of the button/chime association on the doorbell itself - the doorbell sends packets for all events, and the chime is responsible for deciding whether to act on them or not.

On closer examination, it turns out that these packets aren't just sent if there's a configured chime. One is sent for each configured user, avoiding the need for a cloud round trip if your phone is on the same network as the doorbell at the time. There was literally no need for me to mimic the chime at all, suitable events were already being sent.

Still. There's a fair amount of WTFery here, ranging from the strstr() based JSON parsing, the invalid JSON, the symmetric encryption that uses device passwords as the key (requiring the doorbell to be aware of the chime's password) and the use of only the first five characters of the password as input to the KDF. It doesn't give me a great deal of confidence in the rest of the device's security, so I'm going to keep playing.

[1] This seems to be to handle the case where the chime isn't on the same network as the doorbell

comment count unavailable comments

May 06, 2021 06:26 AM

May 04, 2021

Linux Plumbers Conference: Dates for Virtual Linux Plumbers now 20-24 September

We took a look at all the events that were announced at the same time as OSS, including KVM Forum. The dates 20-24 September still seem to be clear of conference overlaps so we thought we’d grab them for Plumbers before someone else does. We also thought the timezone last year (Atlantic, 1h ahead of US Eastern and 5h behind central European) worked well, so we’ll plan to hold the conference mostly in that timezone (Although Microconference sessions can vary this if participants need. Our conference architecture will be available 24h)

May 04, 2021 02:37 PM

May 03, 2021

Linux Plumbers Conference: Containers and Checkpoint/Restore Microconference Accepted into 2021 Linux Plumbers Conference

We are pleased to announce that the Containers and Checkpoint/Restore Microconference has been accepted into the 2021 Linux Plumbers Conference! The Containers and Checkpoint/Restore micro-conference brings together kernel developers, runtime maintainers, and developers working on container- and sandboxing related technologies in general to discuss current problems and agree on new features.

Last year’s meetup resulted in:

This year’s edition of the Containers and Checkpoint/Restore micro-conference will focus on a variety of topics that are in need of discussion. The list of ideas is constantly evolving and we expect even more topics to pop up during the coming months as past experience has shown. Here is an excerpt:

Come join us and participate in the discussion with what holds “The Cloud” together.

We hope to see you there!

May 03, 2021 05:35 PM

April 30, 2021

Linux Plumbers Conference: Linux Plumbers Goes Fully Virtual

You may have noticed that the Linux Foundation has announced moving OSS+ELC from Dublin to Seattle, WA due to survey results and vaccination rates in Europe. Since we agreed to co-locate with OSS+ELC this year, we’ve been debating following suit or going virtual. Unfortunately, the safety protocols imposed by event venues in the US require masks and social distancing, making it impossible to hold the interactive part of Plumbers (the Microconferences). Since Microconferences are a differentiating feature of plumbers, we felt that rather than lose such an essential element we’d move the entire conference on-line and hope to be back in-person next year.

As with last year, we’ll be using BigBlueButton for the main video interactions, but, following the example of FOSDEM, we’ll be using Matrix for the chat portion (and following feedback, we’ll be trying to integrate the matrix chat into the BBB chat window).

OSS+ELC in Seattle is now across our original dates, so we’ll try to find new ones to not clash with existing events, stay tuned for an update.

April 30, 2021 09:26 PM

April 29, 2021

Pete Zaitcev: Swift in 2021

A developer meet-up for OpenStack, known as PTG, occurred a week ago. I attended the Swift track, where somewhat to my surprise we had two new contributors show up.

I got into a habit of telling people that I did not want Swift to end like AFS: develop great software and dead, with nobody using it. Today I looked it up, and what do you know: OpenAFS made a release in June 2020 (and apparently they also screwed up and had to post an emergency release in October).

So, I was chatting with Matt O. at PTG and he said, "oh yeah, we won some contracts when I was at SuSE, Swift was beating the competition." Not entirely a surprise, but it got me thinking: is it too early to declare Swift dead, or even AFS level dead?

Since NVIDIA gobbled up Swift, I was full of concerns for the centralization. NVIDIA uses Swift as a hyperscaler, in support of their own clusters. They already started to divest themselves from Swiftstack's customer base. I envisioned a future where NVIDIA assembles all the core contributors, then fires them all and closes the project. But then I learned that Lustre went through a cycle like that, being acquired, but then sold out to a smaller, more focused company (to DDN).

To sum, I see a possibility for Swift to remain relevant through a three-step strategy, if you will. First, Swift remains open, aligned to technology, and performant. Thanks to that, it wins new deployments (in HPC and Telco in particular). And because of the field use, it will find a corporate stewardship. So, basically, suck less for success.

P.S. Also at PTG I learned that S3 Inventory existed. Seemed like implementing it in Swift could be a satisfying accomplishment for someone new.

April 29, 2021 05:23 AM

April 27, 2021

Paul E. Mc Kenney: Stupid RCU Tricks: A tour through rcutorture

Although Linux-kernel RCU gets most of the attention, without rcutorture, RCU would not be what it is today. To see this, note that the old saying “If it ain't tested, it don't work!” is if anything more valid today than it was back then. After all, software has not gotten any simpler, workloads have not become less demanding, and systems have not grown smaller, except in terms of physical size. That said, the decrease in size has been truly impressive. Back when Jack and I invented RCU, the hardware contained in my laptop would have filled no fewer than fifteen standard racks, and that ignores the hardware that simply was not available back then, and also ignores the reliability issues that would have resulted from such an imposing agglomeration of hardware.

It is rcutorture's job to make sure that Linux-kernel RCU actually works, and so it is worthwhile getting to know rcutorture a bit better. The following blog posts cover design of, use of, and experience with this test suite:


  1. Stupid RCU Tricks: So you want to torture RCU? (use)
  2. Stupid RCU Tricks: So rcutorture is Not Aggressive Enough For You? (use)
  3. Stupid RCU Tricks: Failure Probability and CPU Count (use)
  4. Stupid RCU Tricks: Enlisting the Aid of a Debugger (use)
  5. Stupid RCU Tricks: Torturing RCU Fundamentally, Part I (design)
  6. Stupid RCU Tricks: Torturing RCU Fundamentally, Part II (design)
  7. Stupid RCU Tricks: Torturing RCU Fundamentally, Part III (design)
  8. Stupid RCU Tricks: Torturing RCU Fundamentally, Parts IV and V (design)
  9. Stupid RCU Tricks: So rcutorture is Still Not Aggressive Enough For You? (use)
  10. Stupid RCU Tricks: rcutorture fails to find an RCU bug (experience)
  11. Stupid RCU Tricks: The design of rcutorture (design)
  12. Stupid RCU Tricks: Which tests do I run??? (use)

And here are a few older posts covering rcutorture:

  1. Hunting Heisenbugs (experience, 2009)
  2. Hunting More Heisenbugs (experience, 2009)
  3. Stupid RCU Tricks: RCU Priority Inversion (design, 2010)
  4. And it used to be so simple... (design, 2011)
  5. Stupid RCU Tricks: Bug Found by Refactored Tests (design, experience, and use, 2014)
  6. Stupid RCU Tricks: rcutorture Catches an RCU Bug (experience, 2014)
  7. Stupid RCU Tricks: rcutorture Accidentally Catches an RCU Bug (experience, 2017)
Ah, but what about formal verification? But of course! Please see this series, and especially this post.

I hope that this series is helpful, and I further hope that it will inspire more aggressive torturing of other software!

April 27, 2021 11:54 PM

April 24, 2021

Paul E. Mc Kenney: Stupid RCU Tricks: The design of rcutorture

This installment of the rcutorture series takes a high-level look at its design. At the highest level, rcutorture is a stress test with a few unit-test components thrown in for good measure. It also includes scripts to handle both single-system and distributed testing. All of this code is of course paying homage to the many moods of Mr. Murphy.

The Many Moods of Mr. Murphy

As I have progressed through my career, I seem to have progressively miffed Mr. Murphy.

I completed my first professional (but pro bono) project in the mid-1970s. It had one user. Any million-year bugs it might have contained took the full million years to appear. This meant that Murphy was actually a pretty nice guy. Sure, whatever could happen would. Eventually. Maybe in geologic time.

In the 1980s, I completed a number of contract-programming projects that might have had installed bases of at many as 100 units. A million-year bug could be expected to appear about once per 10,000 years. In the 1990s, I worked on Sequent's DYNIX/ptx proprietary-UNIX operating system, which had an installed base of perhaps 6,000 systems. A million-year bug could be expected to appear not quite once per two centuries.

Shortly after the year 2000, I started working on the Linux kernel. There are at best rough estimates of the Linux kernel's installed based, and as of 2017, there were an estimated 20 billion systems of one sort of another running the Linux kernel, including smartphones, automobiles, household appliances, and much more. A million-year bug could be expected to appear more than once per hour across this huge installed base. In other words, over a period of about 40 years, Murphy has transitioned from being a pretty nice guy to being a total jerk!

Worse yet, should the Linux kernel capture even a modest fraction of the Internet-of-things market, a million-year bug could be expected to appear every few minutes across the installed base. Which might well result in Murphy becoming nothing less than a homicidal maniac.

Fortunately, there are some validation strategies that might help keep Murphy on the straight and narrow.

If You Cannot Beat Him, Join Him!

Given that everything that can happen eventually will, the task at hand is to try to make it happen in the comparative comfort and safety of the lab. This means aiding and abetting Mr. Murphy, at least within the lab environment. And this is the whole point of rcutorture, whose tricks include the following:

  1. Temporal fuzzing.
  2. Exercising race conditions.
  3. Anticipating abuse.
Of course, none of these tricks are new, but it does not hurt to review them.

Temporal Fuzzing

But why not go for the full effect and apply straight-up fuzzing? The answer to this question may be found in RCU's core API:
void rcu_read_lock(void);
void rcu_read_unlock(void);
void synchronize_rcu(void);
void call_rcu(struct rcu_head *head, rcu_callback_t func);
For the first three functions, there is nothing to fuzz, unless you are trying to test your compiler. For the last function, fuzzing of pointers—and most especially pointers to functions—is reserved for the truly brave and for those wishing to test their kernel's exception handling.

But it does make sense to fuzz the timing of calls to these functions, and that is exactly what rcutorture does. RCU readers and updaters are invoked at random times, with readers and updaters cooperating to detect any too-short grace periods, memory misordering, and so on. Much of the fuzzing is randomly generated at run time, but there are also module parameters that insert delays in specific locations. This strategy is straightforward, but can also be powerful, for example, careful choice of delays and other configuration settings decreased the mean time between failure (MTBF) of a memorable heisenbug from hundreds of hours to less than five hours. This had the beneficial effect of de-heisening this bug.

Exercising Race Conditions

Many of the most troublesome bugs involve rare operations, and one way to join forces with Murphy is to make rare operations less rare during validation. And rcutorture takes this approach often, including for the following operations:

  1. CPU hotplug.
  2. Transitions to and from idle, including transitions to and from the whole system being idle.
  3. Long RCU readers.
  4. Readers from interrupt handlers.
  5. Complex readers, for example, those overlapping with irq-disable regions.
  6. Delayed grace periods, for example, allowing a CPU to go offline and come back online during grace-period initialization.
  7. Racing call_rcu() invocations against rcu_barrier().
  8. Periodic forced migrations to other CPUs.
  9. Substantial testing of less-popular grace-period mechanisms.
  10. Processes running on the hypervisor to preempt code running in rcutorture guest OSes.
  11. Process exit.
  12. ”Near misses“ where the RCU grace-period guarantee is almost violated.
  13. Moving CPUs to and from rcu_nocbs callback-offloaded mode.
This exercising of race conditions might be reminiscent of the Netflix Chaos Monkey.

Anticipating Abuse

There are things that RCU users are not supposed to do. Just as users of the fork() system call are not supposed to code up forkbombs, RCU users are not supposed to code up endless blasts of call_rcu() invocations (see Documentation/RCU/checklist.rst item 8). Nevertheless, rcutorture does engage in (carefully limited forms of) call_rcu() abuse in order to find stress-related RCU bugs. This abuse is enabled by default and may be controlled by the rcutorture.fwd_progress module parameter and friends.

In addition, rcutorture inserts the occasional long-term delay in preemptible RCU readers and exercises code paths that must avoid deadlocks involving the scheduler and RCU.

Meta-Murphy, AKA Test the Test

Of course, one danger of joining Murphy is that things can go wrong in test code just as easily as they can go wrong in the code under test.

For this reason, rcutorture provides the rcutorture.object_debug module parameter that verifies that the code checking for double call_rcu() invocations is working properly. In addition, the rcutorture.stall_cpu module parameter and friends may be used to force RCU CPU stall warning messages of various types.

The rcutorture tests of more fundamental RCU properties may be enabled by using the rcutorture.torture_type module parameter. For example, rcutorture.torture_type=busted selects a broken RCU implementation, which may also be selected using the BUSTED scenario. Either way, rcutorture had jolly well better complain about too-short grace periods. In addition, rcutorture.torture_type=busted_srcud forces rcutorture to run compound readers against SRCU, which does not support this notion. In this case also, rcutorture had better complain about too-short grace periods for these compound readers. The rcutorture.leakpointer module parameter tests the CONFIG_RCU_STRICT_GRACE_PERIOD Kconfig option's ability to detect pointers leaked from RCU read-side critical sections. Finally, the rcutorture tests of RCU priority boosting can themselves be tested by using the BUSTED-BOOST scenario, which must then complain about priority-boosting failures.

Additional unscheduled tests of rcutorture testing are of course provided by bugs in RCU itself. Perhaps these are rare examples of Murphy working against himself, but they normally do not feel that way at the time!

Enlisting Darwin

Those who are willing to consider the possibility that natural selection applies to non-living objects might do well to consider validation such as that provided by rcutorture to be a selection function. Now, some developers might object to the thought that their carefully created changes are random mutations, but the sad fact is that long experience has often supported that view.

With this in mind, a good validation suite will select against bugs, resulting in robust software, right?

Wrong.

You see, bugs are a form of software. An undesirable form, perhaps, but a form nevertheless. Bugs will therefore adapt to any fixed validation suite and accumulate in your software, degrading its robustness. This means that any bugs located by end users must also be considered bugs against the validation suite, which after all failed to find those bugs. Modifying the validation suite to successfully find those bugs is therefore important, as is independent efforts to make the validation suite more capable. The hope is that modifying the test suite will make it more difficult for bugs to adapt to it.

But even that is insufficient. Blindly adding tests and test cases will eventually bloat your test suite to the point where it is no longer feasible to run all of it. It is therefore also necessary to review test cases and work out how to make them find bugs faster with less hardware, whether by merging tests, running more tests concurrently, or by more vigorously enlisting Mr. Murphy's assistance. It might also be necessary to eliminate test cases that are no longer relevant, for example, now that RCU no longer has a synchronize_rcu_bh(), there is no point in testing it.

In short, the price of robust software is eternal test development.

April 24, 2021 12:02 AM

April 23, 2021

Matthew Garrett: An accidental bootsplash

Back in 2005 we had Debconf in Helsinki. Earlier in the year I'd ended up invited to Canonical's Ubuntu Down Under event in Sydney, and one of the things we'd tried to design was a reasonable graphical boot environment that could also display status messages. The design constraints were awkward - we wanted it to be entirely in userland (so we didn't need to carry kernel patches), and we didn't want to rely on vesafb[1] (because at the time we needed to reinitialise graphics hardware from userland on suspend/resume[2], and vesa was not super compatible with that). Nothing currently met our requirements, but by the time we'd got to Helsinki there was a general understanding that Paul Sladen was going to implement this.

The Helsinki Debconf ended being an extremely strange event, involving me having to explain to Mark Shuttleworth what the physics of a bomb exploding on a bus were, many people being traumatised by the whole sauna situation, and the whole unfortunate water balloon incident, but it also involved Sladen spending a bunch of time trying to produce an SVG of a London bus as a D-Bus logo and not really writing our hypothetical userland bootsplash program, so on the last night, fueled by Koff that we'd bought by just collecting all the discarded empty bottles and returning them for the deposits, I started writing one.

I knew that Debian was already using graphics mode for installation despite having a textual installer, because they needed to deal with more complex fonts than VGA could manage. Digging into the code, I found that it used BOGL - a graphics library that made use of the VGA framebuffer to draw things. VGA had a pre-allocated memory range for the framebuffer[3], which meant the firmware probably wouldn't map anything else there any hitting those addresses probably wouldn't break anything. This seemed safe.

A few hours later, I had some code that could use BOGL to print status messages to the screen of a machine booted with vga16fb. I woke up some time later, somehow found myself in an airport, and while sitting at the departure gate[4] I spent a while staring at VGA documentation and worked out which magical calls I needed to make to have it behave roughly like a linear framebuffer. Shortly before I got on my flight back to the UK, I had something that could also draw a graphical picture.

Usplash shipped shortly afterwards. We hit various issues - vga16fb produced a 640x480 mode, and some laptops were not inclined to do that without a BIOS call first. 640x400 worked basically everywhere, but meant we had to redraw the art because circles don't work the same way if you change the resolution. My brief "UBUNTU BETA" artwork that was me literally writing "UBUNTU BETA" on an HP TC1100 shortly after I'd got the Wacom screen working did not go down well, and thankfully we had better artwork before release.

But 16 colours is somewhat limiting. SVGALib offered a way to get more colours and better resolution in userland, retaining our prerequisites. Unfortunately it relied on VM86, which doesn't exist in 64-bit mode on Intel systems. I ended up hacking the X.org x86emu into a thunk library that exposed the same API as LRMI, so we could run it without needing VM86. Shockingly, it worked - we had support for 256 colour bootsplashes in any supported resolution on 64 bit systems as well as 32 bit ones.

But by now it was obvious that the future was having the kernel manage graphics support, both in terms of native programming and in supporting suspend/resume. Plymouth is much more fully featured than Usplash ever was, but relies on functionality that simply didn't exist when we started this adventure. There's certainly an argument that we'd have been better off making reasonable kernel modesetting support happen faster, but at this point I had literally no idea how to write decent kernel code and everyone should be happy I kept this to userland.

Anyway. The moral of all of this is that sometimes history works out such that you write some software that a huge number of people run without any idea of who you are, and also that this can happen without you having any fucking idea what you're doing.

Write code. Do crimes.

[1] vesafb relied on either the bootloader or the early stage kernel performing a VBE call to set a mode, and then just drawing directly into that framebuffer. When we were doing GPU reinitialisation in userland we couldn't guarantee that we'd run before the kernel tried to draw stuff into that framebuffer, and there was a risk that that was mapped to something dangerous if the GPU hadn't been reprogrammed into the same state. It turns out that having GPU modesetting in the kernel is a Good Thing.

[2] ACPI didn't guarantee that the firmware would reinitialise the graphics hardware, and as a result most machines didn't. At this point Linux didn't have native support for initialising most graphics hardware, so we fell back to doing it from userland. VBEtool was a terrible hack I wrote to try to re-execute the system's graphics hardware through a range of mechanisms, and it worked in a surprising number of cases.

[3] As long as you were willing to deal with 640x480 in 16 colours

[4] Helsinki-Vantaan had astonishingly comfortable seating for time

comment count unavailable comments

April 23, 2021 07:14 PM

April 19, 2021

Dave Airlie (blogspot): DOOM (Vulkan) + lavapipe

For the fun of it I decided to run some real apps on lavapipe.

Talos Principle is still rando crashing on startup, occasionally whatever magic value ends up being right in uninit memory and it suddenly runs fine.

I started Rise of the Tomb Raider, and it renders really slowly up to the menu.

Then I gave DOOM 2016 with the Vulkan renderer a go, and with a few lavapipe hacks to enable some feature bits, I managed to get it to load a game image. It's taking 5-6s per frame to render. However most of the slowness in the frame is the BPTC texture loading which is a path that I've done no tuning for so it definitely running very slowly. I think RoTR is also hitting that slow path so I guess I've some incentive to look at cleaning it up.

 


April 19, 2021 05:58 AM

April 15, 2021

Paul E. Mc Kenney: Stupid RCU Tricks: rcutorture fails to find an RCU bug

I recently took a close look at rcutorture's console output and noticed the following string: rtbf: 0 rtb: 0. The good news is that there were no rcutorture priority-boosting failures (rtbf: 0). The bad news is that this was only because there was no priority-boosting testing (rtb: 0). And as we all know, if it isn't tested, it doesn't work, so this implied bugs in RCU priority boosting itself.

What is RCU Priority Boosting?

If you are running a kernel built with CONFIG_PREEMPT=y, RCU read-side critical sections can be preempted by higher-priority tasks, regardless of whether these tasks are executing kernel or userspace code. If there are enough higher-priority tasks, and especially if someone has foolishly disabled realtime throttling, these RCU read-side critical sections might remain preempted for a good long time. And as long as they remain preempted, RCU grace periods cannot complete. And if RCU grace periods cannot complete, your system has an OOM in its future.

This is where RCU priority boosting comes in, at least in kernels built with CONFIG_RCU_BOOST=y. If a given grace period is blocked only by preempted RCU read-side critical sections, and that grace period is at least 500 milliseconds old (this timeout can be adjusted using the RCU_BOOST_DELAY Kconfig option), then RCU starts boosting the priority of these RCU readers to the level specified by the rcutree.kthread_prio kernel boot parameter, which defaults to FIFO priority 2. RCU does this using one rcub kthread per rcu_node structure. Given a default Kconfig, this works out to one rcub kthread per 16 CPUs.

Why did rcutorture Fail to Test RCU Priority Boosting?

As with many things in life, this happened one step at a time:

  1. A bug I was chasing a few years back reproduced much more quickly if I enabled CPU hotplug on the TREE03 rcutorture scenario.
  2. And in addition, x86 no longer supports configurations where CPUs cannot be hotplugged (mumble mumble security mumble mumble), which means that the rcutorture scripting is always going to test CPU hotplug.
  3. TREE03 was the one scenario that tested RCU priority boosting.
  4. But RCU priority-boost testing assumes that CPU hotplug was disabled. So much so that it would disable itself if CPU-hotplug testing was enabled. Which it now always was.
  5. So RCU priority boosting has gone completely untested for quite a few years.
  6. Quite a few more years back, I learned that firmware sometimes lies about the number of CPUs. I learned this from bug reports noting that RCU was sometimes creating way more kthreads than made any sense on small systems.
  7. So the spawning of kthreads that are per-CPU or per-group-of-CPUs is done at CPU-online time. Which ensures that systems get the right number of RCU kthreads even in the presence of lying firmware. In the case of the RCU boost kthreads, the code verifies that the rcu_node structure in question has at least one online CPU before spawning the corresponding kthread.
  8. Except that it is now quite possible for the incoming CPU to not be fully online at the time that rcutree_online_cpu() executes, in part due to RCU being much more careful about CPU hotplug. This means that the RCU boost kthread will be spawned when the second CPU corresponding to a given rcu_node structure comes online.
  9. Which means that rcu_node structures that have only one CPU never have an RCU boost kthread, and in turn that RCU readers preempted on such CPUs will never be boosted. This problematic situation is unusual, requiring 17, 33, 49, 65, ... CPUs on the system, assuming default RCU kconfig options. But it can be made to happen, especially when using the rcutorture scripting. (--kconfig "CONFIG_NR_CPUS=17" ...)

The fix is to refactor the creation of rcub kthreads so that a CPU coming online is assumed to eventually make it online, which means that one online CPU suffices to spawn an rcub kthread.

Additional Testing Challenges

The rcu_torture_boost() function required additional rework because CPUs can fail to pass through a quiescent state for some seconds from time to time, and there is nothing that RCU priority boosting can do about this. There are now checks for this condition, and rcutorture refrains from reporting an error in such cases.

Worse yet, this testing proceeds by disabling the aforementioned realtime throttling, then running a FIFO realtime priority 1 kthread on each CPU. This sort of abuse is a great way to break your kernel, yet nothing less abusive will reliably and efficiently test RCU priority boosting. It just so happens that many of RCU's kthreads will do just fine because in this configuration they run at FIFO realtime priority 2. Unfortunately, timers often run in a ksoftirqd kthread, which runs at a non-realtime priority. This means that although RCU's grace-period kthread runs just fine, if it tries to sleep for (say) three milliseconds, it won't awaken until RCU priority boosting testing has completed, which is a great way to force this testing to fail.

Therefore, rcutorture now takes a the rude and crude approach of checking to see if it is built into the kernel (as opposed to running as a kernel module), and if so, it forces all of the ksoftirqd kthreads to run at FIFO realtime priority 2. (Needless to say, don't try this at home.)

The usual way to asynchronously determine when a grace period has ended is to post an RCU callback using call_rcu(). Except that in realtime configurations, RCU callbacks are often offloaded to rcuo kthreads. It is the system administrator's responsibility to decide where to run these, and, failing that, the Linux-kernel scheduler's responsibility. Neither of which should be expected to do the right thing in the presence of a full set of CPU-bound unthrottled real-time-priority boost-test kthreads.

Fortunately, RCU now has polling APIs for managing grace periods. The start_poll_synchronize_rcu() function starts a new grace period if needed and returns a “cookie” that can be passed to poll_state_synchronize_rcu(), which will return true if the needed grace period has completed. These functions do not rely on RCU callbacks, and thus will function correctly even if the rcuo kthreads are inauspiciously scheduled, or even if these kthreads are not scheduled at all. Thus, rcutorture's test of RCU priority boosting now uses these two functions.

With all of this in place, RCU priority boosting lives again!

But untested software does not work, and that includes the tests themselves. Thus, a new BUSTED-BOOST scenario tests RCU priority boosting on a kernel built with CONFIG_RCU_BOOST=y, which does not do RCU priority boosting. This scenario fails within a few tens of seconds, so the test being tested might actually be working!

April 15, 2021 12:57 AM

April 08, 2021

Pavel Machek: Using PinePhone

I was asking at the mailing lists about ofono configuration for PinePhone... and apparently it is not exactly simple to get it to work. (One thing is that there's no "RING" indication on AT channels, and it looks there's more.)

I'm looking for working calls and working SMSes, ideally with ringtones played when SMS arrives. So far postmarketOS with Plasma Mobile was closest... but the UI is really unstable, in what looks like hard to debug way. Is there something closer to working? Right now I guess getting Mobian to work and hacking incoming SMS notifications might be easiest..

April 08, 2021 06:49 PM

April 07, 2021

Dave Airlie (blogspot): lavapipe reporting Vulkan 1.1 (not compliant)

The lavapipe vulkan software rasterizer in Mesa is now reporting Vulkan 1.1 support.

It passes all CTS tests for those new features in 1.1 but it stills fails all the same 1.0 tests so isn't that close to conformant. (lines/point rendering are the main areas of issue).

There are also a bunch of the 1.2 features implemented so that might not be too far away though 16-bit shader ops and depth resolve are looking a bit tricky.

If there are any specific features anyone wants to see or any crazy places/ideas for using lavapipe out there, please either file a gitlab issue or hit me up on twitter @DaveAirlie


April 07, 2021 08:22 PM

April 05, 2021

Kees Cook: security things in Linux v5.9

Previously: v5.8

Linux v5.9 was released in October, 2020. Here’s my summary of various security things that I found interesting:

seccomp user_notif file descriptor injection
Sargun Dhillon added the ability for SECCOMP_RET_USER_NOTIF filters to inject file descriptors into the target process using SECCOMP_IOCTL_NOTIF_ADDFD. This lets container managers fully emulate syscalls like open() and connect(), where an actual file descriptor is expected to be available after a successful syscall. In the process I fixed a couple bugs and refactored the file descriptor receiving code.

zero-initialize stack variables with Clang
When Alexander Potapenko landed support for Clang’s automatic variable initialization, it did so with a byte pattern designed to really stand out in kernel crashes. Now he’s added support for doing zero initialization via CONFIG_INIT_STACK_ALL_ZERO, which besides actually being faster, has a few behavior benefits as well. “Unlike pattern initialization, which has a higher chance of triggering existing bugs, zero initialization provides safe defaults for strings, pointers, indexes, and sizes.” Like the pattern initialization, this feature stops entire classes of uninitialized stack variable flaws.

common syscall entry/exit routines
Thomas Gleixner created architecture-independent code to do syscall entry/exit, since much of the kernel’s work during a syscall entry and exit is the same. There was no need to repeat this in each architecture, and having it implemented separately meant bugs (or features) might only get fixed (or implemented) in a handful of architectures. It means that features like seccomp become much easier to build since it wouldn’t need per-architecture implementations any more. Presently only x86 has switched over to the common routines.

SLAB kfree() hardening
To reach CONFIG_SLAB_FREELIST_HARDENED feature-parity with the SLUB heap allocator, I added naive double-free detection and the ability to detect cross-cache freeing in the SLAB allocator. This should keep a class of type-confusion bugs from biting kernels using SLAB. (Most distro kernels use SLUB, but some smaller devices prefer the slightly more compact SLAB, so this hardening is mostly aimed at those systems.)

new CAP_CHECKPOINT_RESTORE capability
Adrian Reber added the new CAP_CHECKPOINT_RESTORE capability, splitting this functionality off of CAP_SYS_ADMIN. The needs for the kernel to correctly checkpoint and restore a process (e.g. used to move processes between containers) continues to grow, and it became clear that the security implications were lower than those of CAP_SYS_ADMIN yet distinct from other capabilities. Using this capability is now the preferred method for doing things like changing /proc/self/exe.

debugfs boot-time visibility restriction
Peter Enderborg added the debugfs boot parameter to control the visibility of the kernel’s debug filesystem. The contents of debugfs continue to be a common area of sensitive information being exposed to attackers. While this was effectively possible by unsetting CONFIG_DEBUG_FS, that wasn’t a great approach for system builders needing a single set of kernel configs (e.g. a distro kernel), so now it can be disabled at boot time.

more seccomp architecture support
Michael Karcher implemented the SuperH seccomp hooks, Guo Ren implemented the C-SKY seccomp hooks, and Max Filippov implemented the xtensa seccomp hooks. Each of these included the ever-important updates to the seccomp regression testing suite in the kernel selftests.

stack protector support for RISC-V
Guo Ren implemented -fstack-protector (and -fstack-protector-strong) support for RISC-V. This is the initial global-canary support while the patches to GCC to support per-task canaries is getting finished (similar to the per-task canaries done for arm64). This will mean nearly all stack frame write overflows are no longer useful to attackers on this architecture. It’s nice to see this finally land for RISC-V, which is quickly approaching architecture feature parity with the other major architectures in the kernel.

new tasklet API
Romain Perier and Allen Pais introduced a new tasklet API to make their use safer. Much like the timer_list refactoring work done earlier, the tasklet API is also a potential source of simple function-pointer-and-first-argument controlled exploits via linear heap overwrites. It’s a smaller attack surface since it’s used much less in the kernel, but it is the same weak design, making it a sensible thing to replace. While the use of the tasklet API is considered deprecated (replaced by threaded IRQs), it’s not always a simple mechanical refactoring, so the old API still needs refactoring (since that CAN be done mechanically is most cases).

x86 FSGSBASE implementation
Sasha Levin, Andy Lutomirski, Chang S. Bae, Andi Kleen, Tony Luck, Thomas Gleixner, and others landed the long-awaited FSGSBASE series. This provides task switching performance improvements while keeping the kernel safe from modules accidentally (or maliciously) trying to use the features directly (which exposed an unprivileged direct kernel access hole).

filter x86 MSR writes
While it’s been long understood that writing to CPU Model-Specific Registers (MSRs) from userspace was a bad idea, it has been left enabled for things like MSR_IA32_ENERGY_PERF_BIAS. Boris Petkov has decided enough is enough and has now enabled logging and kernel tainting (TAINT_CPU_OUT_OF_SPEC) by default and a way to disable MSR writes at runtime. (However, since this is controlled by a normal module parameter and the root user can just turn writes back on, I continue to recommend that people build with CONFIG_X86_MSR=n.) The expectation is that userspace MSR writes will be entirely removed in future kernels.

uninitialized_var() macro removed
I made treewide changes to remove the uninitialized_var() macro, which had been used to silence compiler warnings. The rationale for this macro was weak to begin with (“the compiler is reporting an uninitialized variable that is clearly initialized”) since it was mainly papering over compiler bugs. However, it creates a much more fragile situation in the kernel since now such uses can actually disable automatic stack variable initialization, as well as mask legitimate “unused variable” warnings. The proper solution is to just initialize variables the compiler warns about.

function pointer cast removals
Oscar Carter has started removing function pointer casts from the kernel, in an effort to allow the kernel to build with -Wcast-function-type. The future use of Control Flow Integrity checking (which does validation of function prototypes matching between the caller and the target) tends not to work well with function casts, so it’d be nice to get rid of these before CFI lands.

flexible array conversions
As part of Gustavo A. R. Silva’s on-going work to replace zero-length and one-element arrays with flexible arrays, he has documented the details of the flexible array conversions, and the various helpers to be used in kernel code. Every commit gets the kernel closer to building with -Warray-bounds, which catches a lot of potential buffer overflows at compile time.

That’s it for now! Please let me know if you think anything else needs some attention. Next up is Linux v5.10.

© 2021, Kees Cook. This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 License.
CC BY-SA 4.0

April 05, 2021 11:24 PM

Dave Airlie (blogspot): crocus: gallium for the gen4-7 generation

The crocus project was recently mentioned in a phoronix article. The article covered most of the background for the project.

Crocus is a gallium driver to cover the gen4-gen7 families of Intel GPUs. The basic GPU list is 965, GM45, Ironlake, Sandybridge, Ivybridge and Haswell, with some variants thrown in. This hardware currently uses the Intel classic 965 driver. This is hardware is all gallium capable and since we'd like to put the classic drivers out to pasture, and remove support for the old infrastructure, it would be nice to have these generations supported by a modern gallium driver.

The project was initiated by Ilia Mirkin last year, and I've expended some time in small bursts to moving it forward. There have been some other small contributions from the community. The basis of the project is a fork of the iris driver with the old relocation based batchbuffer and state management added back in. I started my focus mostly on the older gen4/5 hardware since it was simpler and only supported GL 2.1 in the current drivers. I've tried to cleanup support for Ivybridge along the way.

The current status of the driver is in my crocus branch.

Ironlake is the best supported, it runs openarena and supertuxkart, and piglit has only around 100 tests delta vs i965 (mostly edgeflag related) and there is only one missing feature (vertex shader push constants). 

Ivybridge just stop hanging on second batch submission now, and glxgears runs on it. Openarena starts to the menu but is misrendering and a piglit run completes with some gpu hangs and a quite large delta. I expect IVB to move faster now that I've solved the worst hang.

Haswell runs glxgears as well.

I think once I take a closer look at Ivybridge/Haswell and can get Ilia (or anyone else) to do some rudimentary testing on Sandybridge, I will start taking a closer look at upstreaming it into Mesa proper.


April 05, 2021 02:38 AM

March 31, 2021

Paul E. Mc Kenney: Stupid RCU Tricks: So rcutorture is Still Not Aggressive Enough For You?

An earlier post discussed ways of making rcutorture more aggressive, but even with these techniques, rcutorture's level of aggression is limited by build time on the one hand and the confines of a single system on the other. This post describes some recent ways around those limitations.

Play It Again, Sam!

A full rcutorture run will do about 20 kernel builds, which can take some tens of minutes or, on slower systems, well over an hour. This can be extremely annoying when you simply want to re-run the last test in order to obtain better failure statistics or to get more test time on a recent bug fix.

The traditional rcutorture way of avoiding rebuilds is to optionally edit the qemu-cmd files for each scenario to be re-run, then manually invoke sh on each resulting file. The editing step allows you to avoid overwriting the previous run's console output, but may be omitted if you don't care about that console output or if you have already saved it off somewhere. This works, but is painstaking and error-prone.

This is where the new kvm-again.sh script comes in. Its first argument is the path to the directory for the old run, for one example on my laptop, tools/testing/selftests/rcutorture/res/2021.03.31-10.52.56. This can be a relative pathname as in this example, but use of absolute pathnames can make your life easier when reviewing output from prior kvm-again.sh runs. By default, the new run will have the same duration as the old run, but the --duration argument may be used to specify the new run's duration. Also by default, kvm-again.sh will generate the new run's directory based on the current date and time (suffixed with -again), but the --rundir argument may be used to specify some other location. Finally, and again by default, hard links are used to “copy” the needed data from the old run directory (such as the Linux kernel), but the --link argument can be used to specify soft links or explicit copy operations. The full set of scenarios generates some 20 kernels, each of which is somewhat larger than they would have been in the past. You may therefore need to exercise some caution when using --link copy, especially if you are doing repeated kvm-again.sh runs.

The re-run file in the new run directory gives the pathname of the old run directory. Although you can give a run directory produced by a prior kvm-again.sh invocation to a later kvm-again.sh invocation, best practice is to continue specifying the original run directory. If nothing else, following this best practice avoids ever-growing qemu-cmd files.

Of course, the shorter the runs, the greater an advantage kvm-again.sh provides. In the extreme case, it can be amazingly helpful when testing for rare boot-time failures.

Strength in Numbers

It seems likely that there are quite a few more people with access to eight 16-CPU systems than there are people with access to a single 128-CPU system. You can of course run kvm.sh on each of eight 16-CPU systems, but working out which scenarios to run on each of those systems can be time-consuming and error-prone. And this is why the new kvm-remote.sh script exists.

Build or Buy?

This script can be invoked in two different modes. In both cases, the first argument is a quoted list of system names, as in names that the ssh command understands. Specifying localhost or any of its synonyms might work, but is an option for the brave at this point. Should this prove useful, it will be take care of in a later version of this script.

The first form builds all needed kernels on the system on which the kvm-remote.sh script is run. In this case, the second and subsequent arguments can be anything accepted by the kvm.sh script.

In the second form, the second and subsequent arguments must be suitable for the kvm-again.sh script, that is, the second argument must specify the path to an old run directory and the third and subsequent arguments can be --duration, --rundir, and </tt>--link</tt>.

In both forms, once the kernels are available, a tarball of all scenarios is downloaded to all of the systems. Each such download is run sequentially, which means that downloading can take significant time, especially if low-bandwidth network links are involved. Once all systems have had the tarball downloaded and expanded, batches of scenarios are parceled out among the systems specified by the first argument. If there are more batches than there are systems, once a system completes its current batch, it will be given another batch.

Once all batches have completed, the results from each system are uploaded back to the system running the kvm-remote.sh script, where the usual end-of-run error-checking and analysis is carried out.

This script assumes that all systems have the same number of CPUs. Addressing this limitations is future work. In the meantime, one workaround is to do multiple --buildonly runs of kvm.sh, one for each type of system. Then multiple runs of the second form of the kvm-remote.sh script can safely be run concurrently on the same build system. Because all the pre-built kernels for each type of system are safely collected up in the corresponding old-run directory, the multiple invocations of kvm-remote.sh will not interfere with each other.

Why ssh?

The kvm-remote.sh script uses ssh to do all downloading, control, and uploading operations. This might seem to be a poor choice in this age of Kubernetes and friends, but the fact remains that ssh is widely available, easy to configure, and reasonably robust. In contrast, there is a wide variety of Kubernetes-like systems, and they can be configured in a wide variety of ways. It would be impossible to choose just one of these systems, and it would be quite difficult to accommodate all of the configurations, versions, and variants of even one of them.

However, please note that kvm-remote.sh assumes that all of the systems have been set up properly. This means that low-level virtualization support must be in place, and it also means that running an ssh command to any of the specified systems must complete without the need for any human interaction. For example, if ssh foo date does not open a connection to system foo, run the date command, and print the result without any need to type any sort password or passphrase, then system foo is not yet set up properly.

Similarly, kvm-remote.sh does not take any actions that might be necessary to reserve system foo for your exclusive use, nor does it do anything to release this system upon completion of the test. Thus, these system-configuration, reservation, and release operations are jobs for which you may wish to enlist the help of Kubernetes or of similar frameworks. For example, I use (admittedly crude) scripts that interact with Facebook's internal environment to reserve and configure the desired number and type of systems, invoke kvm-remote.sh once everything is set up, and then release those systems.

What Might The Future Hold?

Although the kvm-remote.sh approach of using ssh works reasonably well on a few tens of systems, if someone wanted to run rcutorture on thousands of systems, something else would likely be required. On the other hand, there are not that many sites where one would reasonably devote anywhere near that many systems to rcutorture. There might be downloading improvements at some point, most likely in the form of allowing a script to be provided to allow kvm-remote.sh to use some site-specific optimized multi-system download utility. Both kvm-again.sh and kvm-remote.sh might someday need a way to specify that only a subset of a prior run's scenarios be re-run, for example, to chase down a bug that occurred in only a few of those scenarios.

And as mentioned earlier, perhaps a future version of kvm-remote.sh will gracefully handle remote systems with varying numbers of CPUs or running actual tests on the system running the kvm-remote.sh script.

But if things go as they usually do, a fair fraction of the future changes will come as complete surprises.

March 31, 2021 11:29 PM

March 30, 2021

James Bottomley: Owning Your Own Copyrights in Open Source

This article covers several aspects: owning the copyrights you develop outside of your employed time and the more thorny aspect of owning the copyrights in open source projects you work on for your employer. It will also take a look at the middle ground of being a contract entity doing paid work on open source. This article follows the historical sweep of my journey through this field and so some aspects may be outdated and all are within the bounds of the US legal system and it’s most certainly not complete, just a description of what I did and what I learned.

Why Should you Own your Own Source code?

In the early days of open source, everything was a hobby project and everyone owned their own contributions. Owning your own contribution was a sort of mark of franchise in the project. Of course, there were some projects, notably the FSF ones, which didn’t believe in distributed ownership and insisted you contribute ownership of your copyrights to them so they could look after the project for you. Obviously, since I’m a Linux Kernel developer and with the Linux Kernel being a huge distributed copyright project, it’s easy to see which side of the argument I fall.

The main rights you give up if you don’t own the code you create are the right to re-licence and the right to enforce. It probably hadn’t occurred to you that if you actually find a licence violation in a project you contribute to for your employer, you’ll have no standing to demand that the problem get addressed. In fact, any enforcement on the code would have to be done by the proper owner: your employer. Plus your employer can control the ultimate destination of that ownership, including selling your code to a copyright troll if they so wished … while you may trust your employer now you work for them, do you trust them to do the right thing for all time, especially since they may be bought out by EvilCorp on down the road?

The relicensing problem can also be thorny: as a strong open source contributor you’ve likely been on the receiving end of requests to relicense (“I really like the code in your project X and would like to incorporate it in my open source project Y, but there’s a licence compatibility problem, would you dual license it?”) and thought nothing about saying “yes”. However, if your employer owns the code, you were likely lying when you said “yes” because you have no relicensing rights and you must ask your employer for permission to do the relicensing.

All the above points up the dangers in the current ecosystem. Project contributors often behave like they own the code but if they don’t they can be leaving a legal minefield in their wakes. The way to fix this is to own your own code … or at least understand the limitations of your rights if you don’t.

Open Source in Your Own Time

It’s a mistake to think that just because you work on something in your own time it isn’t actually owned by your employer. Historically, at least in the US, employment agreements contain incredibly broad provisions for invention ownership which basically try to claim anything you invent at any hour of the day or night that might be even vaguely related to your employment. Not unnaturally this caused huge volumes of litigation around startups where former employees successfully develop innovations their prior employer declined to pursue (at least until it started making money). This has lead to a slew of state based legal safe harbour protections for employee inventions. Most of them, like the Illinois Statute I first used, have similar wording

A provision in an employment agreement which provides that an employee shall assign or offer to assign any of the employee’s rights in an invention to the employer does not apply to an invention for which no equipment, supplies, facilities, or trade secret information of the employer was used and which was developed entirely on the employee’s own time … is … void and unenforceable.

765 ILCS 1060/2

In fact most states now require the wording to appear in the employment contract, so you likely don’t have to look up the statute to figure out what to do. The biggest requirements are that it be on your own time and you not be using any employer equipment, so the most important thing is to make sure you have your own laptop or computer. If you follow the requirements to the letter, you should be safe enough in owning your own time open source code. However, if you really want a guarantee you need to take extra precautions.

Own Time Open Source Carve Outs in Employment agreements

When you join a company, one of the things you’ll sign is a prior invention disclosure form, usually as an appendix to the invention assignment agreement as part of your employment contract. Here’s an example one from the SEC database (ironically for a Chinese subsidiary). Look particularly at section 2(a) “Inventions Retained and Licensed”. It’s basically pure CYA for the company, and most people leave Exhibit A blank, but you shouldn’t do that. What you should do is list all your current and future (by doing sweeping guesswork) own open source projects. The most useful clause in 2(a) says “I agree that I will not incorporate any Prior Inventions into any products …” so you and your employer have now agreed that all the listed projects are outside the scope of your employment agreement.

As far as I can tell, no-one really looks at Exhibit A at all, so I’ve been really general and put things like “The Linux Kernel” and “Open Source UEFI software” “Open Source cryptography such as gnupg, openssl and gnutls” and never been challenged on it.

One legitimate question, which will probably happen if your carve outs are very broad, is what happens if your employer specifically asks you to work on a project you’ve declared in Exhibit A? Ideally you could use this as an opportunity to negotiate an addendum to your contract covering your ownership of open source. However, if you don’t want to rock the boat, you can simply do nothing and rely on the fact that the agreement has something to say about this. The sample section 2(a) above goes on to give your employer a non-exclusive licence, which you could take as agreement to your continued ownership of the copyrights in the code, even through your employer is now instructing you (and paying you) to work on it. However, the say nothing approach has never been tested in court and may be vulnerable to challenge, so a safer course is to send your manager an email pointing out the issue and proposing to follow the licence in the employment contract. If they do nothing, thinking the matter settled, as most managers do, then you have legal cover for continuing to own your own copyrights. You can make it as vague as you like, so using the above sample agreement, something like “You’ve asked me to work on Project X which was listed in Exhibit A of my employment agreement. To move forward, I’m happy to licence all future works on this project to you under the terms of section 2(a)”. It looks innocuous, but it’s actually a statement that your company doesn’t get copyright ownership because of the actual wording in section 2(a) says the company gets a non-exclusive licence if you incorporate any works listed in Exhibit A. Remember to save the email somewhere safe (and any reply which is additional proof it was seen) just in case.

Owning Open Source Produced on Company Time

The first thing to note is that if your employer pays for you to work on open source, absent any side agreement, the code that you produce will be owned by your employer. This isn’t some US specific thing, this is a general principle of employment the world over (they pay you, so they own it). So even if you work in Europe, your employer will still own your open source copyrights if they pay you to work on the project, moral rights arguments notwithstanding. The only way to change this is to get some sort of explicit or implicit (if you want to go the carve out route above) agreement about the ownership.

Although I’ve negotiated both joint and exclusive ownership of open source via employment agreements, the actual agreements are still the property of the relevant corporations and thus, unfortunately, while I can describe some of the elements, I can’t publish the text (employment agreements are the crown jewels the HR dragons guard).

How to Negotiate

Most employers (or at least their lawyers) will refuse point blank to change the wording of employment agreements. However, what you want can be a side agreement and usually doesn’t require rewording the employment agreement at all. All you need is the understanding that the side agreement will get executed. One big problem can be that most negotiations over employment agreements occur with people from HR, which is a department with the least understanding of open source, so you don’t want to be negotiating the side agreement with them, you want to talk to the person that is hiring you. You also need to present your request as reasonable, so find out if anyone inside your prospective new company has done something similar. Often they have, and they’ll likely be someone in open source you’ve at least heard of so you can approach them and ask for details. “But you gave a copyright ownership side agreement to X” is often a great way to advance your cause. Don’t be afraid to ask and argue politely but firmly … hiring talented developers is very competitive nowadays so they have (or at least the manager who wants to hire you has) a vested interest in keeping you happy.

Consider Joint Ownership

Joint ownership is a specific legal term meaning the rights in a copyright are shared by the joint owners. Effectively this sharing means that either party may enforce without consulting the other and either party may license the work without consulting the other (but here they must share any profits from the licence equally among joint owners).

Joint ownership is often a good solution because it gives you the right to relicence and the right to enforce, while also giving your employer a share in what they paid to produce. Joint ownership is often far easier to sell to corporations than one or other of you having exclusive ownership because it gives them all the rights they would have had anyway. The only slight concern you may have down the road is it does give them the right to relicence or sell on their ownership, say to an open core business or to an enforcement troll. However, the good news is that as joint owner you now have a right to a half share of any profit they (and the new owner) make out of such a rights transfer, which can potentially act as a deterrent to the transaction if you remind them of this requirement.

Open Source as a Contractor

In some ways this is the best relationship. There are no work for hire assumptions about companies you contract for owning your free time, so doing other open source projects is easy. However, a contractor is bound by whatever contract you sign, so you need someone with legal training to help you make sure it is actually equitable. You can’t get around this legal requirement: the protections that exist for employees don’t exist for contractors, so if you sign a contract saying in exchange for a certain sum company X owns the entirety of your output, you will be bound by it. So remember: read the contract and negotiate the terms.

Copyright Ownership as a Contractor

Surprisingly, in a relationship where you’re contracted to get something upstream, it’s often in the client’s best interest to have the contractor own the copyrights in Open Source. It means the contractor is responsible for all the nitty gritty of pushing patches and dealing with contribution agreements and the client simply gets the end product: the thing they wanted upstream. I’ve found this a surprisingly easy sell to most legal departments. Even if the client does want some sort of ownership of the code, you can offer joint ownership as the easy route to you taking on all the hassle and them getting the benefits of ownership.

Trade Secrets

As a contractor, you’ll likely be forced to sign an NDA never to reveal client secrets. This is pretty usual, but the pitfall in open source, particularly if you’re doing a driver for a device whose programming manual is under NDA, is that you are going to be revealing them contrary to the NDA. You need this handled in an equitable fashion in the contract to avoid unpleasant problems long after the job is done. The simplest phrase you need is something like “Client understands that open source is developed in public and authorizes that all information necessary to producing X under this contract be disclosed to the public”.

Patents

Patents can be a huge minefield with contract open source, because as a contractor who owns the copyrights and negotiates the contribution agreements, you have no authority to bind your client’s patents. You really don’t want to find yourself being used as a conduit for a patent ambush on open source (where a client contracts with you to put code into a project which reads on a patent they hold and then turns around and patent trolls the ecosystem) so you need contract language binding the client patents at least in the work you’re doing for them. Something simple like “Client grants a perpetual and irrevocable licence, consistent with the terms of the open source licence for X, to all contributions made by contractor to X that read on patents client holds now or may in future acquire”. This latter is pretty narrow, so you could start out by trying to get a patent licence for the entirety of project X and negotiate down from there.

Conclusions

Owning your own copyrights in open source is possible provided you’re careful. The strategies outlined above are based on my own experiences (all in the US) as a contract employee from 1995-2008 there after as a regular employee but are not the only ones you could pursue, so ask around to see what others have done as well. The main problem with all the strategies above is that they work well when you’re negotiating your employment. If you’re already working at some corporation they’re unlikely to be helpful to you unless you really have a simple own time open source project. Oh, and just remember that while the snippets I quoted above for the contract case may actually have been in contracts I signed, this isn’t legal advice and you should have a lawyer advise you how best to incorporate the various points raised.

March 30, 2021 11:18 PM

March 25, 2021

Dave Airlie (blogspot): sketchy vulkan benchmarks: lavapipe vs swiftshader

 Mike, the zink dev, mentioned that swiftshader seemed slow at some stuff and I realised I've never expended much effort in checking swiftshader vs llvmpipe in benchmarks.

The thing is CPU rendering is pretty much going to top out on memory bandwidth pretty quickly but I decided to do some rough napkin benchmarks using the vulkan samples from Sascha Willems.

I'd also thought that due to having a few devs and the fact that it was used instead of mesa by google for lots of things that llvmpipe would be slower since it hasn't really gotten dedicated development resources.

I picked a random smattering of Vulkan samples and ran them on my Ryzen 

workstation without doing anything else, in their default window size.

The first number is lavapipe fps the second swiftshader.

I guess the swift is just good marketing name, now I'm not sure why llvmpipe/lavapipe isn't more of a development target for those devs, imagine how much better it could be if it has fulltime dedicate devs on it.

March 25, 2021 09:08 PM

Paul E. Mc Kenney: Parallel Programming: Second Edition

The second edition of “Is Parallel Programming Hard, And, If So, What Can You Do About It?” is now available. I have no plans to create a dead-tree version, but I have no objection to others doing so, whether individually or in groups.

Big-animal changes over the First Edition include:


  1. A full rewrite of the memory-barriers section, which is now its own chapter. This new chapter includes discussion of the Linux-kernel memory model, courtesy of Akira Yokosawa, who kindly pulled in the LWN article.
  2. A number of new tools have been added to the formal-verification chapter.
  3. A new section on SMP real-time programming.
  4. The “Tools of the Trade” chapter has been dragged kicking and screaming into the 2020s, courtesy of Akira Yokosawa, Junchang Wang, and Slavomir Kaslev.
  5. Hyperlinking between quizzes and answers, courtesy of Paolo Bonzini and Akira Yokosawa.
  6. Improved formatting and build system, courtesy of Akira Yokosawa.
  7. Bibliographic facelift, courtesy of Stamatis Karnouskos and Akira Yokosawa.
  8. Grammatical fixes from a great many people, but especially from translators SeongJae Park and Motohiro Kanda.
  9. Several new cartoons.
  10. Performance results from a system with hundreds of CPUs, courtesy of my employer, Facebook.
  11. Substantial updates pretty much everywhere else. (Yes, this might be the first time in a long time that I read through the entire book. Why do you ask?)

Contributors include Akira Yokosawa; SeongJae Park; Junchang Wang; Borislav Petkov; Stamatis Karnouskos; Palik, Imre; Paolo Bonzini; Praveen Kumar; Tobias Klauser; Andreea-Cristina Bernat; Balbir Singh; Bill Pemberton; Boqun Feng; Emilio G. Cota; Namhyung Kim; Andrew Donnellan; Dominik Dingel; Igor Dzreyev; Pierre Kuo; Yubin Ruan; Chris Rorvick; Dave; Mike Rapoport; Nicholas Krause; Patrick Marlier; Patrick Yingxi Pan; Slavomir Kaslev; Zhang, Kai; and Zygmunt Bazyli Krynicki. On behalf of all who read this book, I thank you all for all you did to help make this second edition a reality!

March 25, 2021 03:36 AM

Pete Zaitcev: A small billion-object Swift cluster

In the latest of Swift numbers: talked to someone today who mentioned that they have 1,025,311,000 objects, or almost exactly a billion. They are spread over only 480 disks. That is, if my arithmetic is correct, 2,000 times smaller than Amazon S3 was in 2013. But hey, not everyone is S3. And they aren't having any particular problems, things just work.

March 25, 2021 01:08 AM

March 24, 2021

Pete Zaitcev: ~avg on NoSQL

Just saving it from LinkedIn:

The real difference between SQL-based (and other relational databases) and NoSQL glorified KV stores is the presence of algebraic structure (i.e. Codd algebra). Algebra is basically all about transformations between equivalent expressions to arrive to a desireable form (i.e. simplified, or factorized, or whatever the goal is). These transformations have another name: optimizations.

Basically, when you have a real SQL database, you have ability to optimize execution plans. Which could easily yield orders of magnitude of improvement in performance.

(And, yes, modern relational databases (i.e. Snowflake) do internally convert semi-structured data into tabular form so that the optimizations are applicable to these as well).

If I had something to say about this, it would be something about stable, dependable performance having a value of its own. That is why TokyoCabinet was such a revelation and prompted the NoSQL revolution, which later ended with Mongo and reaction, like any revolution. But this is not my field, so let's just save it for future reference.

March 24, 2021 12:46 AM

March 22, 2021

Michael Kerrisk (manpages): man-pages-5.11 is released

Alex Colomar and I have released released man-pages-5.11. The release tarball is available on kernel.org. The browsable online pages can be found on man7.org. The Git repository for man-pages is available on kernel.org.

This release resulted from patches, bug reports, reviews, and comments from around 40 contributors. A number of wide-ranging global edits by Alex and me have resulted in one of the largest releases since I became involved with man-pages some 20 years ago. The release includes around 480 commits that changed around 950 (more than 90% of the) manual pages. The diff runs to more than 50k lines (which makes it the third largest release measured by lines changed).

The most notable of the changes in man-pages-5.11 are the following:

March 22, 2021 10:27 AM

March 18, 2021

Linux Plumbers Conference: CFP Open – Microconferences

We are pleased to announce the Call for Microconferences for the 2021 edition Linux Plumbers Conference, which we plan to hold in Dublin,
Ireland the last week of September in conjunction with The Linux Foundation Open Source Summit. If an in-person conference should prove to be impossible due to the circumstances at that time, Linux Plumbers will switch to a virtual only conference. Microconference runners should ideally be able to attend in person if circumstances permit, although arrangements may be possible to do so remotely. Please see our website or social media for regular updates.

A microconference is a collection of collaborative sessions focused on problems in a particular area of Linux plumbing, which includes the kernel, libraries, utilities, services, UI, and so forth, but can also focus on cross-cutting concerns such as security, scaling, energy efficiency, toolchains, container runtimes, or a particular use case. Good microconferences result in solutions to these problems and concerns, while the best microconferences result in patches that implement those solutions.

For more information on submitting a microconference proposal, visit our
CfP page.

The microconference submission process differs from that for presentations in the submissions may be (and, indeed, are expected to be) updated over time. The initial submission should include the topic of the microconference, a list of problems that are expected to be discussed, and a list of key developers that can make decisions for solutions to those problems. The Linux Plumbers program committee will work with the authors of the microconference submissions to help clarify the objectives of the microconerence.

Microconferences that have been at a previous Linux Plumbers should also
include in the submission, a list of accomplishments that were a result of that previous meet up and the topics listed for this year’s meet up should include a new set of topics and follow up work from the previous year’s topics.

Topics of a microconference should be thought of as “problem statements” and not an “abstract” like a presentation. Topics are meant to be mostly discussion oriented or presentations to facilitate discussions, but should not be a presentation to simply demonstrate what has already been accomplished. Microconferences are to discuss problems of today and tomorrow, and not to discuss accomplishments of yesterday.

Acceptance of microconferences will be done in the order the submissions become ready for acceptance. The microconference submitters should be prepared to write a blog entry advertising their microconference.

March 18, 2021 09:07 PM

March 16, 2021

Linux Plumbers Conference: RFQ SW development – Linux Plumbers Conference 2021

Reference: LPC2021-RFQ01

The Linux Plumbers Conference committee seeks to contract one or more suppliers for the development of Open Source software improvements to BigBlueButton; Matrix; and other associated work.

Offers must be received by Thursday March 25th end of day (EOD). Offers must be submitted electronically to contact@linuxplumbersconf.org.

Details of the RFQ are available publicly online here.

The RFQ document might be updated to reflect answers to questions or provide additional information.

The Linux Plumbers Conference successfully made use of BigBlueButton in 2020 and is planning to deploy it again in 2021. One improvement the committee is seeking is the ability to integrate instant messaging more effectively in 2021. The Matrix project, and related client and server components, looks very promising. We look forward to working together with members of these communities to improve the projects through features funding.

The details are in the RFQ document. More can be provided based on the information required by the projects.

March 16, 2021 02:26 AM

March 15, 2021

Matthew Garrett: Exploring my doorbell

I've talked about my doorbell before, but started looking at it again this week because sometimes it simply doesn't send notifications to my Home Assistant setup - the push notifications appear on my phone, but the doorbell simply doesn't trigger the HTTP callback it's meant to[1]. This is obviously suboptimal, but it's also tricky to debug a device when you have no access to it.

Normally I'd just head straight in with a screwdriver, but the doorbell is shared with the other units in this building and it seemed a little anti-social to interfere with a shared resource. So I bought some broken units from ebay and pulled one of them apart. There's several boards inside, but one of them had a conveniently empty connector at the top with "TX", "RX" and "GND" labelled. Sticking a USB-serial converter on this gave me output from U-Boot, and then kernel output. Confirmation that my doorbell runs Linux, but unfortunately it didn't give me a shell prompt. My next approach would often me to just dump the flash and look for vulnerabilities that way, but this device uses TSOP-48 packaged NAND flash rather than the more convenient SPI NOR flash that I already have adapters to access. Dumping this sort of NAND isn't terribly hard, but the easiest way to do it involves desoldering it from the board and plugging it into something like a Flashcat USB adapter, and my soldering's not good enough to put it back on the board afterwards. So I wanted another approach.

U-Boot gave a short countdown to hit a key before continuing with boot, and for once hitting a key actually did something. Unfortunately it then prompted for a password, and giving the wrong one resulted in boot continuing[2]. In the past I've had good luck forcing U-Boot to drop to a prompt by simply connecting one of the data lines on SPI flash to ground while it's trying to read the kernel - the failed read causes U-Boot to error out. It turns out the same works fine on raw NAND, so I just edited the kernel boot arguments to append "init=/bin/sh" and soon I had a shell.

From here on, things were made easier by virtue of the device using the YAFFS filesystem. Unlike many flash filesystems, it's read/write, so I could make changes that would persist through to the running system. There was a convenient copy of telnetd included, but it segfaulted on startup, which reduced its usefulness. Fortunately there was also a copy of Netcat[3]. If you make a fifo somewhere on the filesystem, you can cat the fifo to a shell, pipe the shell to a netcat listener, and then pipe netcat's output back to the fifo. The shell's output all gets passed to whatever connects to netcat, and whatever's sent to netcat gets passed through the fifo back to the shell. This is, obviously, horribly insecure, but it was enough to get a root shell over the network on the running device.

The doorbell runs various bits of software, one of which is Lighttpd to provide a local API and access to the device. Another component ("nxp-client") connects to the vendor's cloud infrastructure and passes cloud commands back to the local webserver. This is where I found something strange. Lighttpd was refusing to start because its modules wanted library symbols that simply weren't present on the device. My best guess is that a firmware update went wrong and left the device in a partially upgraded state - and without a working local webserver, there was no way to perform any further updates. This may explain why this doorbell was sitting on ebay.

Anyway. Now that I had shell, I could simply dump the flash by copying it directly off the /dev/mtdblock devices - since I had netcat, I could just pipe stuff through that back to my actual computer. Now I had access to the filesystem I could extract that locally and start digging into it more deeply. One incredibly useful tool for this is qemu-user. qemu is a general purpose hardware emulation platform, usually used to emulate entire systems. But in qemu-user mode, it instead only emulates the CPU. When a piece of code tries to make a system call to access the kernel, qemu-user translates that to the appropriate calling convention for the host kernel and makes that call instead. Combined with binfmt_misc, you can configure a Linux system to be able to run Linux binaries from other architectures. One of the best things about this is that, because they're still using the host convention for making syscalls, you can run the host strace on them and see what they're doing.

What I found was that nxp-client was calling back to the cloud platform, setting up an encrypted communication channel (using ChaCha20 and a bunch of key setup stuff I couldn't be bothered picking apart) and then waiting for commands from the cloud. It would then proxy those through to the local webserver. Since I couldn't run the local lighttpd, I just wrote a trivial Python app using http.server and waited to see what requests I got. The first was a GET to a CGI script called editcgi.cgi, along with a path name. I mocked up the GET request to respond with what was on the actual filesystem. The cloud then proceeded to POST to editcgi.cgi, with the same pathname and with new file contents. editcgi.cgi is apparently able to read and write to files on the filesystem.

But this is on the interface that's exposed to the cloud client, so this didn't appear immediately useful - and, indeed, trying to hit the same CGI binary over the local network gave me a 401 unauthorized error. There's a local API spec for these doorbells, but they all refer to scripts in the bha-api namespace, and this script was in the plain cgi-bin namespace. But then I noticed that the bha-api namespace didn't actually exist in the filesystem - instead, lighttpd's mod_alias was configured to rewrite requests to bha-api through to files in cgi-bin. And by using the documented API to get a session token, I could call editcgi.cgi to read and write arbitrary files on the doorbell. Which means I can drop an extra script in /etc/rc.d/rc3.d and get a shell on my doorbell.

This all requires the ability to have local authentication credentials, so it's not a big security deal other than it allowing you to retain access to a monitoring device even after you've moved out and had your credentials revoked. I'm sure it's all fine.

[1] I can ping the doorbell from the Home Assistant machine, so it's not that the network is flaky
[2] The password appears to be hy9$gnhw0z6@ if anyone else ends up in this situation
[3] https://twitter.com/mjg59/status/654578208545751040

comment count unavailable comments

March 15, 2021 07:04 PM

March 12, 2021

James Bottomley: Papering Over our TPM 2.0 TSS Divisions

For years I’ve been hoping that the Trusted Computing Group (TCG) based IBM and Intel TSS (TCG Software Stack) would simply integrate with one another into a single package. The rationale is pretty simple: the Intel TSS is already quite a large collection of libraries so adding one more (the IBM TSS has a single library) wouldn’t be too much of a burden. Both TSSs are based on TCG specifications, except that the IBM TSS is based on the TPM 2.0 Library Specification and the Intel TSS is based on the TPM Software Stack (also, not at all confusingly, abbreviated TSS). There’s actually very little overlap between these specifications so co-existence seems very reasonable. Before we get into the stories of these two stacks and what they do, I should confess my biases: while I’ve worked with the TCG over the years, I’ve always harboured the view that the complete lack of adoption of TPM 2.0’s predecessor (TPM 1.2) was because of the hugely complicated nature of the TCG mandated software stack which was implemented in Linux by trousers. It is my firm belief that the complexity of the API lead to the lack of uptake, even though I made several efforts over the years to make use of it.

My primary interest in the TPM has been as a secure laptop keystore (since I already paid for a TPM, I didn’t see the need to fork out again for one of the new security dongles; plus the TPM is infinitely scalable in the number of keys, unlike most dongles). The key to making the TPM usable in this form is integration with existing Cryptographic systems (via plugins if they do them). Since openssl has an engine plugin, I’ve already produced an openssl TPM2 engine, patches for gnupg and engine integration patches for openvpn (upstream in 2.5) and openssh as well as a PKC11 exporter (to make file based engine keys exportable as PKCS11 tokens). Note a lot of the patches aren’t strictly TPM patches, they’re actually making openssl engines work in places they previously didn’t. However, the one thing most of the patches that actually touch the TPM have in common is that they have to pick one or other of the available TSSs to operate with. Before describing the TSS agnostic solution, lets look at why these two TSSs exist and what the difference is between them and why you might choose one over the other.

Schizophrenia at the TCG

As I said in the introduction, both TSSs are based on TCG specifications. These standards aren’t ambiguous: they lay out in excruciating detail what the header files are called and what the prototypes and structures have to be. Both TSS implementations are the way they are because they wouldn’t be following the standards if they deviated even slightly. The problem is the standards don’t agree with each other in meaningful ways. For instance the TPM Library standards define every structure in terms of the fundamental unit of TPM data: the TPM2B structure, which defines a 16 bit big endian length followed by a data unit of that length. The TPM Library standards (in Part 4 section 9.10.6) lay out that every TPM2B_X structure shall be a union of a ‘b’ element which is a TPM2B and a ‘t’ element which is the actual structure. However the TPM Software Stack specification eliminates the plain TPM2B so every TPM2B_X structure in the latter specification are not unions, they are simply the ‘t’ form of the structure. This means that although TPM2B_X structures in each specification are byte for byte the same, they are definitionally different when written as C code and can’t be assigned to each other … oops. The TPM Library standard lays out additional structures for an elaborate calling convention for the TPM2_Command interfaces which are completely different from the ESYS_Command interfaces in the TPM Software Stack.

The reason it’s all done this way? well the specifications were built by completely different committees for what the committees saw as separate use cases, so they didn’t see a need to reconcile the differences. As long as the definitions were byte for byte compatible, everything would work out correctly on the wire. The problem was the TPM Library specification was released nearly a decade ahead of the TPM Software Stack specification, so the first TSS created had to follow the former because the latter didn’t exist.

Sessions, HMAC and Encryption

One of the perennial problems of a TPM is that integrity and security of the information going over the wire is the responsibility of the user. However, the encryption and integrity computations involved, particularly the key derivations, are incredibly involved (even though well documented in the TPM Library specification, so naturally everyone would like the TSS to do this. The problem the TPM Secure Stack had is that all the way up to its ESAPI specification, the security and integrity computations were still the responsibility of the user, so it didn’t begin to be useful until ESAPI was finalized a couple of years ago.

The Resource Manager Problem

TPM 2.0 was designed to be far leaner in terms of resources than TPM 1.2, which meant there was a very small limit to the number of sessions and volatile objects it could contain at any one time. This necessitated the use of a “resource manager” to control access otherwise applications would get unexpected out of resource errors. The Intel TSS has its own resource manager. However, the Linux Kernel itself incorporated a resource manager in the TPM device in 4.12 and the IBM TSS avoids the need for its own resource manager by using this, and will, therefore not work correctly on earlier kernel versions.

Inside the IBM TSS

Even though the IBM TSS is based on a solid and easily comprehensible and detailed specification, that specification itself suffers from a couple of defects. The first being it assumes you’re submitting to a physical TPM, so the specification has no functional (library based) submission API for TPM commands, so the IBM TSS had to invent API it called TSS_Execute() which is a way of sending TPM commands directly to the physical TPM over the kernel’s device interfaces. Secondly, the standard contains no routing interfaces (telling it what destination the TPM is on: should it open the /dev/tpmrm0 device or send the commands to the TPM over an IP socket), so this is controlled in the IBM TSS by several environment variables (TPM_INTERFACE_TYPE, which can be either “dev” or “socsim” for either a physical device or a network socket. The endpoints being controlled by TPM_DEVICE for “dev” type, which specifies which device to use, defaulting to /dev/tpmrm0 or TPM_SERVER_NAME and TPM_PLAFORM_PORT for “socsim”).

The invented TSS_Execute() API also does all the encryption and HMAC parts necessary for secure and integrity verified communication with the TPM, so it acts as a fully functional TSS. The main drawback of the IBM TSS is that it stores essential information about the sessions and handles in files which will, by default, be dropped into the local directory. Most users of the IBM TSS have to set TPM_DATA_DIR to be a specially created directory under /tmp to avoid leaving messy artifacts in users home directories.

Inside the Intel TSS

The TPM Software Stack consists of a large number of different specifications, including the resource manager (which is now unnecessary for kernels above 4.12) the TCTI which specifies the routing information for the TPM. It turns out that even in the Intel TSS, environment variables are the most convenient form to specify this information but, unfortunately, the name of the environment variable has been left up to each use case instead of being standardised in the library meaning you’ll have to consult the man page to figure out what it is. The next set of standards: SAPI and ESAPI define functional interfaces to the TPM with one submission API for each command and additionally a corresponding ..._Async()/..._Finish() pair for asynchronous programming. The only real difference between SAPI and ESAPI is that the latter also does the necessary session cryptography for security and integrity, so it’s pretty much the only usable interface for TPM commands. Unfortunately, the ESAPI interface, as constructed by the TCG, has several cases of premature abstraction the worst of which is a separate abstraction for the TPM handle interface which lives only as long as the lifetime of the connection object and which necessitates multiple conversions to and from internal handle objects if your session or object lives longer than the connection (which can be the case).

There is one final wrinkle is that in the handle abstraction, ESAPI has no API for retrieving the real TPM handle. I’d always wondered why the Intel TSS tpm2 tools always saved the objects they create to a context instead of simply returning the handle to them, but this is the reason: without the ability to transform an internal handle to an external one, you either save the context or let the object die when the connection terminates. This problem is one forced by the ESAPI standard, but eventually it became enough of a problem that the Intel TSS introduced its own additional API to remedy.

The other major difference between the Intel and IBM TSSs is memory handling for returned results: The IBM TSS requires pre-allocated structures whereas the Intel TSS insists on allocation on return. It looks like the Intel TSS should be able to tell if the return pointer is allocated or NULL, but right at the moment it always allocates and overwrites the pointer.

Constructing a unifying Interface for both the IBM and Intel TSSs

In essence the process for converting something that runs with the IBM TSS to being TSS Agnostic is a fairly simple three step process which I’ll illustrate by reference to the openssl tpm2 engine which has already been converted:

  1. Hide the structural differences by inserting a set of macros: VAL() and VAL_2B() which hide most of the TCG induced structure schizophrenia.
  2. Convert the API call structure to be functional instead of via a single TSS_Execute() call. This is quite involved so I did it by adding tpm2_Function() wrappers for each specific invocation.
  3. Introduce the correct premature abstraction for internal and external representation of handles. This was the nastiest step for me because handles are stored in long lived engine structures, and the internal and external representations are both forms of uint32_t even in ESAPI (meaning the compiler won’t complain if you assign one to the other) so it was incredibly painful to get this conversion correct.

Once this is done, the remaining step was to introduce a header which did the impedance matching between the Intel and IBM TSSs and an autoconf macro to detect which TSS is installed and the resulting configure and compile just works. The resulting code will now build and run under either TSS. I should point out that the Intel TSS is missing several helper routines, but these are added into the intel-tss.h header file by copying the from the original IBM TSS. Finally an autoconf check is added to look for the missing internal to external handle transform, and everything is ready to go.

It does seem like it would be easier to port an existing Intel TSS application to the IBM TSS, since points 2 and 3 will already be sorted out. However, all the major TSS library using applications are IBM TSS based, so I haven’t actually been able to verify this.

Remaining Problems and Anomalies

The biggest remaining issue was the test scripts. The openssl TPM2 engine has 27 of them all told, all designed to check the engine function by invoking it via openssl when connected to a software TPM. These scripts are all highly dependent on the IBM TSS command line binaries and the Intel TSS versions seem to be very unstable in terms of argument structure making it pretty much impossible to convert, so I elected finally to have the tests run only if the IBM TSS CLI is installed. The next problem was that the Intel TSS version of the engine didn’t actually pass all the tests. However this was quickly narrowed down to a bug in the Intel TSS when using bound sessions on the NULL seed.

The sole remaining issue is a curious performance anomaly. When running time make check with the IBM TSS, the result is:

real 0m6.100s
user 0m2.827s
sys 0m0.822s

and the same command with the Intel TSS (running one fewer test and skipping the NULL seed) is:

real	0m10.948s
user	0m6.822s
sys	0m0.859s

Showing that the Intel TSS is nearly twice as slow as the IBM one with most of the time differential being user time. Since the tests use a software TPM which can perform the cryptographic operations at the speed of the main CPU, this is showing some type of issue with the command transmission system of the Intel TSS, likely having to do with the fact that most applications use synchronous TPM operations (the engine certainly does) but in the Intel TSS, the synchronous operations are implemented as the corresponding asynchronous pair. Regardless of the root cause, this is unlikely to be a problem with real world TPM crypto where the time taken for any operation will be dominated by the slowness of the physical TPM.

Conclusion

The TSS agnostic scheme adopted by the openssl TPM2 engine should be easily adaptable for all the other non-engine TPM code bases, and thus should pave the way for users not having to choose between applications which only support the Intel or IBM TSSs and can choose to install the best supported one on their distribution. The next steps are to investigate adapting this infrastructure to the existing gnupg patches (done and upstream) and also see if it can be used to solve the gnutls conundrum over supporting TPM based keys.

March 12, 2021 11:59 PM

March 09, 2021

Matthew Garrett: Unauthenticated MQTT endpoints on Linksys Velop routers enable local DoS

(Edit: this is CVE-2021-1000002)

Linksys produces a series of wifi mesh routers under the Velop line. These routers use MQTT to send messages to each other for coordination purposes. In the version I tested against, there was zero authentication on this - anyone on the local network is able to connect to the MQTT interface on a router and send commands. As an example:
mosquitto_pub -h 192.168.1.1 -t "network/master/cmd/nodes_temporary_blacklist" -m '{"data": {"client": "f8:16:54:43:e2:0c", "duration": "3600", "action": "start"}}'
will ask the router to block the client with MAC address f8:16:54:43:e2:0c from the network for an hour. Various other MQTT topics pass parameters to shell scripts without quoting them or escaping metacharacters, so more serious outcomes may be possible.

The vendor has released two firmware updates since report - I have not verified whether either fixes this, but the changelog does not indicate any security issues were addressed.

Timeline:

2020-07-30: Submitted through the vendor's security vulnerability report form, indicating that I plan to disclose in either 90 days or after a fix is released. The form turns out to file a Bugcrowd submission.
2020-07-30: I claim the Bugcrowd submission.
2020-08-19: Vendor acknowledges the issue, is able to reproduce and assigns it a P3 priority.
2020-12-15: I ask if there's an update.
2021-02-02: I ask if there's an update.
2021-02-03: Bugcrowd raise a blocker on the issue, asking the vendor to respond.
2021-02-17: I ask for permission to disclose.
2021-03-09: In the absence of any response from the vendor since 2020-08-19, I violate Bugcrowd disclosure policies and unilaterally disclose.

comment count unavailable comments

March 09, 2021 08:14 PM

March 08, 2021

Linux Plumbers Conference: CFP Open – Refereed Presentations

The Call for Refereed Presentation Proposals for the 2021 edition of the Linux Plumbers Conference is now open, which we plan to hold in Dublin, Ireland the last week of September in conjunction with The Linux Foundation Open Source Summit. If an in-person conference should prove to be impossible due to the circumstances at that time, Linux Plumbers will switch to a virtual only conference. Submitters should ideally be able to give their presentation in person if circumstances permit, although presenting remotely will always be possible. Please see our website or social media for regular updates.

Refereed Presentations are 45 minutes in length and should focus on a specific aspect of the “plumbing” in the Linux system. Examples of Linux plumbing include core kernel subsystems, init systems, core libraries, windowing systems, management tools, device support, media creation/playback, and so on. The best presentations are not about finished work, but rather problem statements, proposals, or proof-of-concept solutions that require face-to-face discussions and debate.

The Refereed Presentations track will be running throughout all three days of the conference. Note that given the current Linux Plumbers Refereed track may overlap with the Open Source Summit.

Linux Plumbers Conference Program Committee members will be reviewing all submitted proposals.  High-quality submissions that cannot be accepted due to the limited number of slots will be forwarded to both the Open Source Summit and to organizers of suitable  Linux Plumbers Microconferences for further consideration.

To submit a Refereed Track Presentation proposal follow the instructions here [1]

Submissions are due on or before June 12 that 11:59PM UTC.

[1] https://linuxplumbersconf.org/event/11/abstracts/

March 08, 2021 05:52 PM

February 26, 2021

Rusty Russell: A Model for Bitcoin Soft Fork Activation

TL;DR: There should be an option, taproot=lockintrue, which allows users to set lockin-on-timeout to true. It should not be the default, though.

As stated in my previous post, we need actual consensus, not simply the appearance of consensus. I’m pretty sure we have that for taproot, but I would like a template we can use in future without endless debate each time.

This triumverate model may seem familiar, being widely used in various different governance systems. It seems the most robust to me, and is very close to what we have evolved into already. Formalizing it reduces uncertainty for any future changes, as well.

February 26, 2021 02:17 AM

February 21, 2021

Matthew Garrett: Making hibernation work under Linux Lockdown

Linux draws a distinction between code running in kernel (kernel space) and applications running in userland (user space). This is enforced at the hardware level - in x86-speak[1], kernel space code runs in ring 0 and user space code runs in ring 3[2]. If you're running in ring 3 and you attempt to touch memory that's only accessible in ring 0, the hardware will raise a fault. No matter how privileged your ring 3 code, you don't get to touch ring 0.

Kind of. In theory. Traditionally this wasn't well enforced. At the most basic level, since root can load kernel modules, you could just build a kernel module that performed any kernel modifications you wanted and then have root load it. Technically user space code wasn't modifying kernel space code, but the difference was pretty semantic rather than useful. But it got worse - root could also map memory ranges belonging to PCI devices[3], and if the device could perform DMA you could just ask the device to overwrite bits of the kernel[4]. Or root could modify special CPU registers ("Model Specific Registers", or MSRs) that alter CPU behaviour via the /dev/msr interface, and compromise the kernel boundary that way.

It turns out that there were a number of ways root was effectively equivalent to ring 0, and the boundary was more about reliability (ie, a process running as root that ends up misbehaving should still only be able to crash itself rather than taking down the kernel with it) than security. After all, if you were root you could just replace the on-disk kernel with a backdoored one and reboot. Going deeper, you could replace the bootloader with one that automatically injected backdoors into a legitimate kernel image. We didn't have any way to prevent this sort of thing, so attempting to harden the root/kernel boundary wasn't especially interesting.

In 2012 Microsoft started requiring vendors ship systems with UEFI Secure Boot, a firmware feature that allowed[5] systems to refuse to boot anything without an appropriate signature. This not only enabled the creation of a system that drew a strong boundary between root and kernel, it arguably required one - what's the point of restricting what the firmware will stick in ring 0 if root can just throw more code in there afterwards? What ended up as the Lockdown Linux Security Module provides the tooling for this, blocking userspace interfaces that can be used to modify the kernel and enforcing that any modules have a trusted signature.

But that comes at something of a cost. Most of the features that Lockdown blocks are fairly niche, so the direct impact of having it enabled is small. Except that it also blocks hibernation[6], and it turns out some people were using that. The obvious question is "what does hibernation have to do with keeping root out of kernel space", and the answer is a little convoluted and is tied into how Linux implements hibernation. Basically, Linux saves system state into the swap partition and modifies the header to indicate that there's a hibernation image there instead of swap. On the next boot, the kernel sees the header indicating that it's a hibernation image, copies the contents of the swap partition back into RAM, and then jumps back into the old kernel code. What ensures that the hibernation image was actually written out by the kernel? Absolutely nothing, which means a motivated attacker with root access could turn off swap, write a hibernation image to the swap partition themselves, and then reboot. The kernel would happily resume into the attacker's image, giving the attacker control over what gets copied back into kernel space.

This is annoying, because normally when we think about attacks on swap we mitigate it by requiring an encrypted swap partition. But in this case, our attacker is root, and so already has access to the plaintext version of the swap partition. Disk encryption doesn't save us here. We need some way to verify that the hibernation image was written out by the kernel, not by root. And thankfully we have some tools for that.

Trusted Platform Modules (TPMs) are cryptographic coprocessors[7] capable of doing things like generating encryption keys and then encrypting things with them. You can ask a TPM to encrypt something with a key that's tied to that specific TPM - the OS has no access to the decryption key, and nor does any other TPM. So we can have the kernel generate an encryption key, encrypt part of the hibernation image with it, and then have the TPM encrypt it. We store the encrypted copy of the key in the hibernation image as well. On resume, the kernel reads the encrypted copy of the key, passes it to the TPM, gets the decrypted copy back and is able to verify the hibernation image.

That's great! Except root can do exactly the same thing. This tells us the hibernation image was generated on this machine, but doesn't tell us that it was done by the kernel. We need some way to be able to differentiate between keys that were generated in kernel and ones that were generated in userland. TPMs have the concept of "localities" (effectively privilege levels) that would be perfect for this. Userland is only able to access locality 0, so the kernel could simply use locality 1 to encrypt the key. Unfortunately, despite trying pretty hard, I've been unable to get localities to work. The motherboard chipset on my test machines simply doesn't forward any accesses to the TPM unless they're for locality 0. I needed another approach.

TPMs have a set of Platform Configuration Registers (PCRs), intended for keeping a record of system state. The OS isn't able to modify the PCRs directly. Instead, the OS provides a cryptographic hash of some material to the TPM. The TPM takes the existing PCR value, appends the new hash to that, and then stores the hash of the combination in the PCR - a process called "extension". This means that the new value of the TPM depends not only on the value of the new data, it depends on the previous value of the PCR - and, in turn, that previous value depended on its previous value, and so on. The only way to get to a specific PCR value is to either (a) break the hash algorithm, or (b) perform exactly the same sequence of writes. On system reset the PCRs go back to a known value, and the entire process starts again.

Some PCRs are different. PCR 23, for example, can be reset back to its original value without resetting the system. We can make use of that. The first thing we need to do is to prevent userland from being able to reset or extend PCR 23 itself. All TPM accesses go through the kernel, so this is a simple matter of parsing the write before it's sent to the TPM and returning an error if it's a sensitive command that would touch PCR 23. We now know that any change in PCR 23's state will be restricted to the kernel.

When we encrypt material with the TPM, we can ask it to record the PCR state. This is given back to us as metadata accompanying the encrypted secret. Along with the metadata is an additional signature created by the TPM, which can be used to prove that the metadata is both legitimate and associated with this specific encrypted data. In our case, that means we know what the value of PCR 23 was when we encrypted the key. That means that if we simply extend PCR 23 with a known value in-kernel before encrypting our key, we can look at the value of PCR 23 in the metadata. If it matches, the key was encrypted by the kernel - userland can create its own key, but it has no way to extend PCR 23 to the appropriate value first. We now know that the key was generated by the kernel.

But what if the attacker is able to gain access to the encrypted key? Let's say a kernel bug is hit that prevents hibernation from resuming, and you boot back up without wiping the hibernation image. Root can then read the key from the partition, ask the TPM to decrypt it, and then use that to create a new hibernation image. We probably want to prevent that as well. Fortunately, when you ask the TPM to encrypt something, you can ask that the TPM only decrypt it if the PCRs have specific values. "Sealing" material to the TPM in this way allows you to block decryption if the system isn't in the desired state. So, we define a policy that says that PCR 23 must have the same value at resume as it did on hibernation. On resume, the kernel resets PCR 23, extends it to the same value it did during hibernation, and then attempts to decrypt the key. Afterwards, it resets PCR 23 back to the initial value. Even if an attacker gains access to the encrypted copy of the key, the TPM will refuse to decrypt it.

And that's what this patchset implements. There's one fairly significant flaw at the moment, which is simply that an attacker can just reboot into an older kernel that doesn't implement the PCR 23 blocking and set up state by hand. Fortunately, this can be avoided using another aspect of the boot process. When you boot something via UEFI Secure Boot, the signing key used to verify the booted code is measured into PCR 7 by the system firmware. In the Linux world, the Shim bootloader then measures any additional keys that are used. By either using a new key to tag kernels that have support for the PCR 23 restrictions, or by embedding some additional metadata in the kernel that indicates the presence of this feature and measuring that, we can have a PCR 7 value that verifies that the PCR 23 restrictions are present. We then seal the key to PCR 7 as well as PCR 23, and if an attacker boots into a kernel that doesn't have this feature the PCR 7 value will be different and the TPM will refuse to decrypt the secret.

While there's a whole bunch of complexity here, the process should be entirely transparent to the user. The current implementation requires a TPM 2, and I'm not certain whether TPM 1.2 provides all the features necessary to do this properly - if so, extending it shouldn't be hard, but also all systems shipped in the past few years should have a TPM 2, so that's going to depend on whether there's sufficient interest to justify the work. And we're also at the early days of review, so there's always the risk that I've missed something obvious and there are terrible holes in this. And, well, given that it took almost 8 years to get the Lockdown patchset into mainline, let's not assume that I'm good at landing security code.

[1] Other architectures use different terminology here, such as "supervisor" and "user" mode, but it's broadly equivalent
[2] In theory rings 1 and 2 would allow you to run drivers with privileges somewhere between full kernel access and userland applications, but in reality we just don't talk about them in polite company
[3] This is how graphics worked in Linux before kernel modesetting turned up. XFree86 would just map your GPU's registers into userland and poke them directly. This was not a huge win for stability
[4] IOMMUs can help you here, by restricting the memory PCI devices can DMA to or from. The kernel then gets to allocate ranges for device buffers and configure the IOMMU such that the device can't DMA to anything else. Except that region of memory may still contain sensitive material such as function pointers, and attacks like this can still cause you problems as a result.
[5] This describes why I'm using "allowed" rather than "required" here
[6] Saving the system state to disk and powering down the platform entirely - significantly slower than suspending the system while keeping state in RAM, but also resilient against the system losing power.
[7] With some handwaving around "coprocessor". TPMs can't be part of the OS or the system firmware, but they don't technically need to be an independent component. Intel have a TPM implementation that runs on the Management Engine, a separate processor built into the motherboard chipset. AMD have one that runs on the Platform Security Processor, a small ARM core built into their CPU. Various ARM implementations run a TPM in Trustzone, a special CPU mode that (in theory) is able to access resources that are entirely blocked off from anything running in the OS, kernel or otherwise.

comment count unavailable comments

February 21, 2021 08:37 AM

February 18, 2021

Rusty Russell: Bitcoin Consensus and Solidarity

Bitcoin’s consensus rules define what is valid, but this isn’t helpful when we’re looking at changing the rules themselves. The trend in Bitcoin has been to make such changes in an increasingly inclusive and conservative manner, but we are still feeling our way through this, and appreciating more nuance each time we do so.

To use Bitcoin, you need to remain in the supermajority of consensus on what the rules are. But you can never truly know if you are. Everyone can signal, but everyone can lie. You can’t know what software other nodes or miners are running: even expensive testing of miners by creating an invalid block only tests one possible difference, may still give a false negative, and doesn’t mean they can’t change a moment later.

This risk of being left out is heightened greatly when the rules change. This is why we need to rely on multiple mechanisms to reassure ourselves that consensus will be maintained:

  1. Developers assure themselves that the change is technically valid, positive and has broad support. The main tools for this are open communication, and time. Developers signal support by implementing the change.
  2. Users signal their support by upgrading their nodes.
  3. Miners signal their support by actually tagging their blocks.

We need actual consensus, not simply the appearance of consensus. Thus it is vital that all groups know they can express their approval or rejection, in a way they know will be heard by others. In the end, the economic supermajority of Bitcoin users can set the rules, but no other group or subgroup should have inordinate influence, nor should they appear to have such control.

The Goodwill Dividend

A Bitcoin community which has consensus and knows it is not only safest from a technical perspective: the goodwill and confidence gives us all assurance that we can make (or resist!) changes in future.

It will also help us defend against the inevitable attacks and challenges we are going to face, which may be a more important effect than any particular soft-fork feature.

February 18, 2021 03:29 AM

February 09, 2021

Kees Cook: security things in Linux v5.8

Previously: v5.7

Linux v5.8 was released in August, 2020. Here’s my summary of various security things that caught my attention:

arm64 Branch Target Identification
Dave Martin added support for ARMv8.5’s Branch Target Instructions (BTI), which are enabled in userspace at execve() time, and all the time in the kernel (which required manually marking up a lot of non-C code, like assembly and JIT code).

With this in place, Jump-Oriented Programming (JOP, where code gadgets are chained together with jumps and calls) is no longer available to the attacker. An attacker’s code must make direct function calls. This basically reduces the “usable” code available to an attacker from every word in the kernel text to only function entries (or jump targets). This is a “low granularity” forward-edge Control Flow Integrity (CFI) feature, which is important (since it greatly reduces the potential targets that can be used in an attack) and cheap (implemented in hardware). It’s a good first step to strong CFI, but (as we’ve seen with things like CFG) it isn’t usually strong enough to stop a motivated attacker. “High granularity” CFI (which uses a more specific branch-target characteristic, like function prototypes, to track expected call sites) is not yet a hardware supported feature, but the software version will be coming in the future by way of Clang’s CFI implementation.

arm64 Shadow Call Stack
Sami Tolvanen landed the kernel implementation of Clang’s Shadow Call Stack (SCS), which protects the kernel against Return-Oriented Programming (ROP) attacks (where code gadgets are chained together with returns). This backward-edge CFI protection is implemented by keeping a second dedicated stack pointer register (x18) and keeping a copy of the return addresses stored in a separate “shadow stack”. In this way, manipulating the regular stack’s return addresses will have no effect. (And since a copy of the return address continues to live in the regular stack, no changes are needed for back trace dumps, etc.)

It’s worth noting that unlike BTI (which is hardware based), this is a software defense that relies on the location of the Shadow Stack (i.e. the value of x18) staying secret, since the memory could be written to directly. Intel’s hardware ROP defense (CET) uses a hardware shadow stack that isn’t directly writable. ARM’s hardware defense against ROP is PAC (which is actually designed as an arbitrary CFI defense — it can be used for forward-edge too), but that depends on having ARMv8.3 hardware. The expectation is that SCS will be used until PAC is available.

Kernel Concurrency Sanitizer infrastructure added
Marco Elver landed support for the Kernel Concurrency Sanitizer, which is a new debugging infrastructure to find data races in the kernel, via CONFIG_KCSAN. This immediately found real bugs, with some fixes having already landed too. For more details, see the KCSAN documentation.

new capabilities
Alexey Budankov added CAP_PERFMON, which is designed to allow access to perf(). The idea is that this capability gives a process access to only read aspects of the running kernel and system. No longer will access be needed through the much more powerful abilities of CAP_SYS_ADMIN, which has many ways to change kernel internals. This allows for a split between controls over the confidentiality (read access via CAP_PERFMON) of the kernel vs control over integrity (write access via CAP_SYS_ADMIN).

Alexei Starovoitov added CAP_BPF, which is designed to separate BPF access from the all-powerful CAP_SYS_ADMIN. It is designed to be used in combination with CAP_PERFMON for tracing-like activities and CAP_NET_ADMIN for networking-related activities. For things that could change kernel integrity (i.e. write access), CAP_SYS_ADMIN is still required.

network random number generator improvements
Willy Tarreau made the network code’s random number generator less predictable. This will further frustrate any attacker’s attempts to recover the state of the RNG externally, which might lead to the ability to hijack network sessions (by correctly guessing packet states).

fix various kernel address exposures to non-CAP_SYSLOG
I fixed several situations where kernel addresses were still being exposed to unprivileged (i.e. non-CAP_SYSLOG) users, though usually only through odd corner cases. After refactoring how capabilities were being checked for files in /sys and /proc, the kernel modules sections, kprobes, and BPF exposures got fixed. (Though in doing so, I briefly made things much worse before getting it properly fixed. Yikes!)

RISCV W^X detection
Following up on his recent work to enable strict kernel memory protections on RISCV, Zong Li has now added support for CONFIG_DEBUG_WX as seen for other architectures. Any writable and executable memory regions in the kernel (which are lovely targets for attackers) will be loudly noted at boot so they can get corrected.

execve() refactoring continues
Eric W. Biederman continued working on execve() refactoring, including getting rid of the frequently problematic recursion used to locate binary handlers. I used the opportunity to dust off some old binfmt_script regression tests and get them into the kernel selftests.

multiple /proc instances
Alexey Gladkov modernized /proc internals and provided a way to have multiple /proc instances mounted in the same PID namespace. This allows for having multiple views of /proc, with different features enabled. (Including the newly added hidepid=4 and subset=pid mount options.)

set_fs() removal continues
Christoph Hellwig, with Eric W. Biederman, Arnd Bergmann, and others, have been diligently working to entirely remove the kernel’s set_fs() interface, which has long been a source of security flaws due to weird confusions about which address space the kernel thought it should be accessing. Beyond things like the lower-level per-architecture signal handling code, this has needed to touch various parts of the ELF loader, and networking code too.

READ_IMPLIES_EXEC is no more for native 64-bit
The READ_IMPLIES_EXEC flag was a work-around for dealing with the addition of non-executable (NX) memory when x86_64 was introduced. It was designed as a way to mark a memory region as “well, since we don’t know if this memory region was expected to be executable, we must assume that if we need to read it, we need to be allowed to execute it too”. It was designed mostly for stack memory (where trampoline code might live), but it would carry over into all mmap() allocations, which would mean sometimes exposing a large attack surface to an attacker looking to find executable memory. While normally this didn’t cause problems on modern systems that correctly marked their ELF sections as NX, there were still some awkward corner-cases. I fixed this by splitting READ_IMPLIES_EXEC from the ELF PT_GNU_STACK marking on x86 and arm/arm64, and declaring that a native 64-bit process would never gain READ_IMPLIES_EXEC on x86_64 and arm64, which matches the behavior of other native 64-bit architectures that correctly didn’t ever implement READ_IMPLIES_EXEC in the first place.

array index bounds checking continues
As part of the ongoing work to use modern flexible arrays in the kernel, Gustavo A. R. Silva added the flex_array_size() helper (as a cousin to struct_size()). The zero/one-member into flex array conversions continue with over a hundred commits as we slowly get closer to being able to build with -Warray-bounds.

scnprintf() replacement continues
Chen Zhou joined Takashi Iwai in continuing to replace potentially unsafe uses of sprintf() with scnprintf(). Fixing all of these will make sure the kernel avoids nasty buffer concatenation surprises.

That’s it for now! Let me know if there is anything else you think I should mention here. Next up: Linux v5.9.

© 2021, Kees Cook. This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 License.
CC BY-SA 4.0

February 09, 2021 12:47 AM

February 05, 2021

Greg Kroah-Hartman: 8 bits are enough for a version number...

As was pointed out to us stable kernel maintainers last week, the overflow of the .y release number was going to happen soon, and our proposed solution for it (use 16 bits instead of 8), turns out to be breaking a userspace-visable api.

As we can’t really break this, I did a release of the 4.4.256 and 4.9.256 releases today that contain nothing but a new version number. See the links for the full technical details if curious.

Right now I’m asking that everyone who uses these older kernel releases to upgrade to this release, and do a full rebuild of their systems in order to see what might, or might not, break. If problems happen, please let us know on the stable@vger.kernel.org mailing list as soon as possible as I can only hold off on doing new stable releases for these branches for a single week only (i.e. February 12, 2021).

February 05, 2021 02:29 PM

February 03, 2021

Greg Kroah-Hartman: Helping out with LTS kernel releases

A recent email thread about “Why isn’t the 5.10 stable kernel listed as supported for 6 years yet!” on the linux-kernel mailing list ended up generating a bunch of direct emails to me asking what could different companies and individuals due to help out. What exactly was I looking for here?

Instead of having to respond to private emails with the same information over and over, I figured it was better to just put it here so that everyone can see what exactly I am expecting with regards to support in order to be able to maintain a kernel for longer than 2 years:

What I need help with

All I request is that people test the -rc releases when I announce them, and let me know if they work or not for their systems/workloads/tests/whatever.

If you look at the -rc announcements today, you will see a number of different people/groups responding with this information. If they want, they can provide a Tested-by: ... line that I will add to the release commit, or not, that’s up to them.

Here and here and here and here and here are all great examples of how people let me know that all is ok with the -rc kernels so that I know it is “safe” to do the release.

I also have a few companies send me private emails that all is good, there’s no requirement to announce this in public if you don’t want to (but it is nice, as kernel development should be done in public.)

Some companies can’t do tests on -rc releases due to their build infrastructures not handling that very well, so they email me after the stable release is out, saying all is good. Worst case, we end up reverting a patch in a released kernel, but it’s better to quickly do that based on testing than to miss it entirely because no one is testing at all.

And that’s it!

But, if you want to do more, I always really appreciate when people email me, or stable@vger.kernel.org, git commit ids that are needed to be backported to specific stable kernel trees because they found them in their testing/development efforts. You know what problems you hit better than anyone, and once those issues are found and fixed, making sure they get backported is a good thing, so I always want to know that.

Again, if you look on the stable@vger.kernel.org list, you will see different companies and developers providing backports of things they want backported, or just a list of the git commit ids if the backports apply cleanly.

Does that sound reasonable? I want to make sure that the LTS kernels that you rely on actually work for you without regressions, so testing is key, as is finding any fixes that are needed for them.

It’s not much, but I can’t do it alone :)

So, 6 years or not for 5.10?

The above is what I need in order to be able to support a kernel for 6 years, constant testing by users of the kernels. If we don’t have that, then why even do these releases because that must mean that no one is using them? So email me and let me know.

As of this point in time (February 3, 2021), I do not have enough committments by companies to help out with this effort to be able to say I can do this for 6 years right now (note, no response yet from the company that originally asked this question…) Hopefully that changes soon, and if it does, the kernel.org release page will be updated with the new date.

February 03, 2021 04:37 PM

January 08, 2021

Linux Plumbers Conference: Welcome to the 2021 Linux Plumbers Conference

Planning for the 2021 Linux Plumbers Conference is well underway. The hope is to be in Dublin co-located with OSS EU (although with hopefully non-overlapping dates). However, the Linux Foundation is still negotiating for a suitable venue so we can’t fully confirm the location yet.

There is an outside (and hopefully receding) chance that we may have to go back to being fully on-line this year, but if that happens, we’ll be sure to alert you through the usual channels of this blog and twitter.

January 08, 2021 10:58 PM

December 31, 2020

James Bottomley: Deploying Encrypted Images for Confidential Computing

In the previous post I looked at how you build an encrypted image that can maintain its confidentiality inside AMD SEV or Intel TDX. In this post I’ll discuss how you actually bring up a confidential VM from an encrypted image while preserving secrecy. However, first a warning: This post represents the state of the art and includes patches that are certainly not deployed in distributions and may not even be upstream, so if you want to follow along at home you’ll need to patch things like qemu, grub and OVMF. I should also add that, although I’m trying to make everything generic to confidential environments, this post is based on AMD SEV, which is the only confidential encrypted1 environment currently shipping.

The Basics of a Confidential Computing VM

At its base, current confidential computing environments are about using encrypted memory to run the virtual machine and guarding the encryption key so that the owner of the host system (the cloud service provider) can’t get access to it. Both SEV and TDX have the encryption technology inside the main memory controller meaning the L1 cache isn’t encrypted (still vulnerable to cache side channels) and DMA to devices must also be done via unencryped memory. This latter also means that both the BIOS and the Operating System of the guest VM must be enlightened to understand which pages to encrypted and which must not. For this reason, all confidential VM systems use OVMF2 to boot because this contains the necessary enlightening. To a guest, the VM encryption looks identical to full memory encryption on a physical system, so as long as you have a kernel which supports Intel or AMD full memory encryption, it should boot.

Each confidential computing system has a security element which sits between the encrypted VM and the host. In SEV this is an aarch64 processor called the Platform Security Processor (PSP) and in TDX it is an SGX enclave running Intel proprietary code. The job of the PSP is to bootstrap the VM, including encrypting the initial OVMF and inserting the encrypted pages. The security element also includes a validation certificate, which incorporates a Diffie-Hellman (DH) key. Once the guest owner obtains and validates the DH key it can use it to construct a one time ECDH encrypted bundle that can be passed to the security element on bring up. This bundle includes an encryption key which can be used to encrypt secrets for the security element and a validation key which can be used to verify measurements from the security element.

The way QEMU boots a Q35 machine is to set up all the configuration (including a disk device attached to the VM Image) load up the OVMF into rom memory and start the system running. OVMF pulls in the QEMU configuration and constructs the necessary ACPI configuration tables before executing grub and the kernel from the attached storage device. In a confidential VM, the first task is to establish a Guest Owner (the person whose encrypted VM it is) which is usually different from the Host Owner (the person running or controlling the Physical System). Ownership is established by transferring an encrypted bundle to the Secure Element before the VM is constructed.

The next step is for the VMM (QEMU in this case) to ask the secure element to provision the OVMF Firmware. Since the initial OVMF is untrusted, the Guest Owner should ask the Secure Element for an attestation of the memory contents before the VM is started. Since all paths lead through the Host Owner, who is also untrusted, the attestation contains a random nonce to prevent replay and is HMAC’d with a Guest Supplied key from the Launch Bundle. Once the Guest Owner is happy with the VM state, it supplies the Wrapped Key to the secure element (along with the nonce to prevent replay) and the Secure Element unwraps the key and provisions it to the VM where the Guest OS can use it for disc encryption. Finally, the enlightened guest reads the encrypted disk to unencrypted memory using DMA but uses the disk encryptor to decrypt it to encrypted memory, so the contents of the Encrypted VM Image are never visible to the Host Owner.

The Gaps in the System

The most obvious gap is that EFI booting systems don’t go straight from the OVMF firmware to the OS, they have to go via an EFI bootloader (grub, usually) which must be an efi binary on an unencrypted vFAT partition. The second gap is that grub must be modified to pick the disk encryption key out of wherever the Secure Element has stashed it. The third is that the key is currently stashed in VM memory before OVMF starts, so OVMF must know not to use or corrupt the memory. A fourth problem is that the current recommended way of booting OVMF has a flash drive for persistent variable storage which is under the control of the host owner and which isn’t part of the initial measurement.

Plugging The Gaps: OVMF

To deal with the problems in reverse order: the variable issue can be solved simply by not having a persistent variable store, since any mutable configuration information could be used to subvert the boot and leak the secret. This is achieved by stripping all the mutable variable handling out of OVMF. Solving key stashing simply means getting OVMF to set aside a page for a secret area and having QEMU recognise where it is for the secret injection. It turns out AMD were already working on a QEMU configuration table at a known location by the Reset Vector in OVMF, so the secret area is added as one of these entries. Once this is done, QEMU can retrieve the injection location from the OVMF binary so it doesn’t have to be specified in the QEMU Machine Protocol (QMP) command. Finally OVMF can protect the secret and package it up as an EFI configuration table for later collection by the bootloader.

The final OVMF change (which is in the same patch set) is to pull grub inside a Firmware Volume and execute it directly. This certainly isn’t the only possible solution to the problem (adding secure boot or an encrypted filesystem were other possibilities) but it is the simplest solution that gives a verifiable component that can be invariant across arbitrary encrypted boots (so the same OVMF can be used to execute any encrypted VM securely). This latter is important because traditionally OVMF is supplied by the host owner rather than being part of the VM image supplied by the guest owner. The grub script that runs from the combined volume must still be trusted to either decrypt the root or reboot to avoid leaking the key. Although the host owner still supplies the combined OVMF, the measurement assures the guest owner of its correctness, which is why having a fairly invariant component is a good idea … so the guest owner doesn’t have potentially thousands of different measurements for approved firmware.

Plugging the Gaps: QEMU

The modifications to QEMU are fairly simple, it just needs to scan the OVMF file to determine the location for the injected secret and inject it correctly using a QMP command.. Since secret injection is already upstream, this is a simple find and make the location optional patch set.

Plugging the Gaps: Grub

Grub today only allows for the manual input of the cryptodisk password. However, in the cloud we can’t do it this way because there’s no guarantee of a secure tty channel to the VM. The solution, therefore, is to modify grub so that the cryptodisk can use secrets from a provider, in addition to the manual input. We then add a provider that can read the efi configuration tables and extract the secret table if it exists. The current incarnation of the proposed patch set is here and it allows cryptodisk to extract a secret from an efisecret provider. Note this isn’t quite the same as the form expected by the upstream OVMF patch in its grub.cfg because now the provider has to be named on the cryptodisk command line thus

cryptodisk -s efisecret

but in all other aspects, Grub/grub.cfg works. I also discovered several other deviations from the initial grub.cfg (like Fedora uses /boot/grub2 instead of /boot/grub like everyone else) so the current incarnation of grub.cfg is here. I’ll update it as it changes.

Putting it All Together

Once you have applied all the above patches and built your version of OVMF with grub inside, you’re ready to do a confidential computing encrypted boot. However, you still need to verify the measurement and inject the encrypted secret. As I said before, this isn’t easy because, due to replay defeat requirements, the secret bundle must be constructed on the fly for each VM boot. From this point on I’m going to be using only AMD SEV as the example because the Intel hardware doesn’t yet exist and AMD kindly gave IBM research a box to play with (Anyone with a new EPYC 7xx1 or 7xx2 based workstation can likely play along at home, but check here). The first thing you need to do is construct a launch bundle. AMD has a tool called sev-tool to do this for you and the first thing you need to do is obtain the platform Diffie Hellman certificate (pdh.cert). The tool will extract this for you

sevtool --pdh_cert_export

Or it can be given to you by the cloud service provider (in this latter case you’ll want to verify the provenance using sevtool –validate_cert_chain, which contacts the AMD site to verify all the details). Once you have a trusted pdh.cert, you can use this to generate your own guest owner DH cert (godh.cert) which should be used only one time to give a semblance of ECDHE. godh.cert is used with pdh.cert to derive an encryption key for the launch bundle. You can generate this with

sevtool --generate_launch_blob <policy>

The gory details of policy are in the SEV manual chapter 3, but most guests use 1 which means no debugging. This command will generate the godh.cert, the launch_blob.bin and a tmp_tk.bin file which you must save and keep secure because it contains the Transport Encryption and Integrity Keys (TEK and TIK) which will be used to encrypt the secret. Figuring out the qemu command line options needed to launch and pause a SEV guest is a bit of a palaver, so here is mine. You’ll likely need to change things, like the QMP port and the location of your OVMF build and the launch secret.

Finally you need to get the launch measure from QMP, verify it against the sha256sum of OVMF.fd and create the secret bundle with the correct GUID headers. Since this is really fiddly to do with sevtool, I wrote this python script3 to do it all (note it requires qmp.py from the qemu git repository). You execute it as

sevsecret.py --passwd <disk passwd> --tiktek-file <location of tmp_tk.bin> --ovmf-hash <hash> --socket <qmp socket>

And it will verify the launch measure and encrypt the secret for the VM if the measure is correct and start the VM. If you got everything correct the VM will simply boot up without asking for a password (if you inject the wrong secret, it will still ask). And there you have it: you’ve booted up a confidential VM from an encrypted image file. If you’re like me, you’ll also want to fire up gdb on the qemu process just to show that the entire memory of the VM is encrypted …

Conclusions and Caveats

The above script should allow you to boot an encrypted VM anywhere: locally or in the cloud, provided you can access the QMP port (most clouds use libvirt which introduces yet another additional layering pain). The biggest drawback, if you refer to the diagram, is the yellow box: you must trust the secret element, which in both Intel and AMD is proprietary4, in order to get confidential computing to work. Although there is hope that in future the secret element could be fully open source, it isn’t today.

The next annoyance is that launching a confidential VM is high touch requiring collaboration from both the guest owner and the host owner (due to the anti-replay nonce). For a single launch, this is a minor annoyance but for an autoscaling (launch VMs as needed) platform it becomes a major headache. The solution seems to be to have some Hardware Security Module (HSM), like the cloud uses today to store encryption keys securely, and have it understand how to measure and launch encrypted VMs on behalf of the guest owner.

The final conclusion to remember is that confidentiality is not security: your VM is as exploitable inside a confidential encrypted VM as it was outside. In many ways confidentiality and security are opposites, in that security in part requires reducing the trusted code and confidentiality requires pulling as much as possible inside. Confidential VMs do have an answer to the Cloud trust problem since the enterprise can now deploy VMs without fear of tampering by the cloud provider, but those VMs are as insecure in the cloud as they were in the Enterprise Data Centre. All of this argues that Confidential Computing, while an important milestone, is only one step on the journey to cloud security.

Patch Status

The OVMF patches are upstream (including modifications requested by Intel for TDX). The QEMU and grub patch sets are still on the lists.

December 31, 2020 10:40 PM

December 30, 2020

Paul E. Mc Kenney: Parallel Programming: December 2020 Update

This release of Is Parallel Programming Hard, And, If So, What Can You Do About It? features numerous improvments:

 


  1. LaTeX and build-system upgrades (including helpful error checking and reporting), formatting improvements (including much nicer display of hyperlinks and of Quick Quizzes, polishing of numerous figures and tables, plus easier builds for A4 paper), refreshing of numerous broken URLs, an improved “make help” command (see below), improved FAQ-BUILD material, and a prototype index, all courtesy of Akira Yokosawa.
  2. A lengthy Quick Quiz on the relationship of half-barriers, compilers, CPUs, and locking primitives, courtesy of Patrick Yingxi Pan.
  3. Updated performance results throughout the book, courtesy of a large x86 system kindly provided by Facebook.
  4. Compiler tricks, RCU semantics, and other material from the Linux-kernel memory model added to the memory-ordering and tools-of-the-trade chapters.
  5. Improved discussion of non-blocking-synchronization algorithms.
  6. Many new citations, cross-references, fixes, and touchups throughout the book.
A number of issues were spotted by Motohiro Kanda in the course of his translation of this book to Japanese, and Borislav Petkov, Igor Dzreyev, and Junchang Wang also provided much-appreciated fixes.

The output of the aforementioned make help is as follows:
Official targets (Latin Modern Typewriter for monospace font):
  Full,              Abbr.
  perfbook.pdf,      2c:   (default) 2-column layout
  perfbook-1c.pdf,   1c:   1-column layout

Set env variable PERFBOOK_PAPER to change paper size:
   PERFBOOK_PAPER=A4: a4paper
   PERFBOOK_PAPER=HB: hard cover book
   other (default):   letterpaper

make help-full" will show the full list of available targets.

The following excerpt of the make help-full command's output might be of interest to those who find Quick Quizzes distracting:
Experimental targets:
  Full,              Abbr.
  perfbook-qq.pdf,   qq:   framed Quick Quizzes
  perfbook-nq.pdf,   nq:   no inline Quick Quizzes (chapterwise Answers)

Thus, the make nq command creates a perfbook-nq.pdf with Quick Quizzes and their answers grouped at the end of each chapter, in the usual textbook style, while still providing PDF navigation from each Quick Quiz to the relevant portion of that chapter.

Finally, this release also happens to be the first release candidate for the long-awaited Second Edition, which should be available shortly.

December 30, 2020 05:33 AM

December 23, 2020

James Bottomley: Building Encrypted Images for Confidential Computing

With both Intel and AMD announcing confidential computing features to run encrypted virtual machines, IBM research has been looking into a new format for encrypted VM images. The first question is why a new format, after all qcow2 only recently deprecated its old encrypted image format in favour of luks. The problem is that in confidential computing, the guest VM runs inside the secure envelope but the host hypervisor (including the QEMU process) is untrusted and thus runs outside the secure envelope and, unfortunately, even for the new luks format, the encryption of the image is handled by QEMU and so the encryption key would be outside the secure envelope. Thus, a new format is needed to keep the encryption key (and, indeed, the encryption mechanism) within the guest VM itself. Fortunately, encrypted boot of Linux systems has been around for a while, and this can be used as a practical template for constructing a fully confidential encrypted image format and maintaining that confidentiality within a hostile cloud environment. In this article, I’ll explore the state of the art in encrypted boot, constructing EFI encrypted boot images, and finally, in the follow on article, look at deploying an encrypted image into a confidential environment and maintaining key secrecy in the cloud.

Encrypted Boot State of the Art

Luks and the cryptsetup toolkit have been around for a while and recently (in 2018), the luks format was updated to version 2. However, actually booting a linux kernel from an encrypted partition has always been a bit of a systems problem, primarily because the bootloader (grub) must decrypt the partition to actually load the kernel. Fortunately, grub can do this, but unfortunately the current grub in most distributions (2.04) can only read the version 1 luks format. Secondly, the user must type the decryption passphrase into grub (so it can pull the kernel and initial ramdisk out of the encrypted partition to boot them), but grub currently has no mechanism to pass it on to the initial ramdisk for mounting root, meaning that either the user has to type their passphrase twice (annoying) or the initial ramdisk itself has to contain a file with the disk passphrase. This latter is the most commonly used approach and only has minor security implications when the system is in motion (the ramdisk and the key file must be root read only) and the password is protected at rest by the fact that the initial ramdisk is also on the encrypted volume. Even more annoying is the fact that there is no distribution standard way of creating the initial ramdisk. Debian (and Ubuntu) have the most comprehensive documentation on how to do this, so the next section will look at the much less well documented systemd/dracut mechanism.

Encrypted Boot for Systemd/Dracut

Part of the problem here seems to be less that stellar systems co-ordination between the two components. Additionally, the way systemd supports passphraseless encrypted volumes has been evolving for a while but changed again in v246 to mirror the Debian method. Since cloud images are usually pretty up to date, I’ll describe this new way. Each encrypted volume is referred to by UUID (which will be the UUID of the containing partition returned by blkid). To get dracut to boot from an encrypted partition, you must pass in

rd.luks.uuid=<UUID>

but you must also have a key file named

/etc/cryptsetup-keys.d/luks-<UUID>.key

And, since dracut hasn’t yet caught up with this, you usually need a cryptodisk.conf file in /etc/dracut.conf.d/ which contains

install_items+=" /etc/cryptsetup-keys.d/* "

Grub and EFI Booting Encrypted Images

Traditionally grub is actually installed into the disk master boot record, but for EFI boot that changed and the disk (or VM image) must have an EFI System partition which is where the grub.efi binary is installed. Part of the job of the grub.efi binary is to find the root partition and source the /boot/grub1/grub.cfg. When you install grub on an EFI partition a search for the root by UUID is actually embedded into the grub binary. Another problem is likely that your distribution customizes the location of grub and updates the boot variables to tell the system where it is. However, a cloud image can’t rely on the boot variables and must be installed in the default location (\EFI\BOOT\bootx64.efi). This default location can be achieved by adding the –removable flag to grub-install.

For encrypted boot, this becomes harder because the grub in the EFI partition must set up the cryptographic location by UUID. However, if you add

GRUB_ENABLE_CRYPTODISK=y

To /etc/default/grub it will do the necessary in grub-install and grub-mkconfig. Note that on Fedora, where every other GRUB_ENABLE parameter is true/false, this must be ‘y’, unfortunately grub-install will look for =y not =true.

Putting it all together: Encrypted VM Images

Start by extracting the root of an existing VM image to a tar file. Make sure it has all the tools you will need, like cryptodisk and grub-efi. Create a two partition raw image file and loopback mount it (I usually like 4GB) with a small efi partition (p1) and an encrypted root (p2):

truncate -s 4GB disk.img
parted disk.img mklabel gpt
parted disk.img mkpart primary 1Mib 100Mib
parted disk.img mkpart primary 100Mib 100%
parted disk.img set 1 esp on
parted disk.img set 1 boot on

Now setup the efi and cryptosystem (I use ext4, but it’s not required). Note at this time luks will require a password. Use a simple one and change it later. Also note that most encrypted boot documents advise filling the encrypted partition with random numbers. I don’t do this because the additional security afforded is small compared with the advantage of converting the raw image to a smaller qcow2 one.

losetup -P -f disk.img          # assuming here it uses loop0
l=($(losetup -l|grep disk.img)) # verify with losetup -l
mkfs.vfat ${l}p1
blkid ${l}p1       # remember the EFI partition UUID
cryptsetup --type luks1 luksFormat ${l}p2 # choose temp password
blkid ${l}p2       # remember this as <UUID> you'll need it later 
cryptsetup luksOpen ${l}p2 cr_root
mkfs.ext4 /dev/mapper/cr_root
mount /dev/mapper/cr_root /mnt
tar -C /mnt -xpf <vm root tar file>
for m in run sys proc dev; do mount --bind /$m /mnt/$m; done
chroot /mnt

Create or modify /etc/fstab to have root as /dev/disk/cr_root and the EFI partition by label under /boot/efi. Now set up grub for encrypted boot2

echo "GRUB_ENABLE_CRYPTODISK=y" >> /etc/default/grub
mount /boot/efi
grub-install --removable --target=x86_64-efi
grub-mkconfig -o /boot/grub/grub.cfg

For Debian, you’ll need to add an /etc/crypttab entry for the encrypted disk:

cr_root UUID=<uuid> luks none

And then re-create the initial ramdisk. For dracut systems, you’ll have to modify /etc/default/grub so the GRUB_CMDLINE_LINUX has a rd.luks.uuid=<UUID> entry. If this is a selinux based distribution, you may also have to trigger a relabel.

Now would also be a good time to make sure you have a root password you know or to install /root/.ssh/authorized_keys. You should unmount all the binds and /mnt and try EFI booting the image. You’ll still have to type the password a couple of times, but once the image boots you’re operating inside the encrypted envelope. All that remains is to create a fast boot high entropy low iteration password and replace the existing one with it and set the initial ramdisk to use it. This example assumes your image is mounted as SCSI disk sda, but it may be a virtual disk or some other device.

dd if=/dev/urandom bs=1 count=33|base64 -w 0 > /etc/cryptsetup-keys.d/luks-<UUID>.key
chmod 600 /etc/cryptsetup-keys.d/luks-<UUID>.key
cryptsetup --key-slot 1 luksAddKey /dev/sda2 # permanent recovery key
cryptsetup --key-slot 0 luksRemoveKey /dev/sda2 # remove temporary
cryptsetup --key-slot 0 --iter-time 1 luksAddKey /dev/sda2 /etc/cryptsetup-keys.d/luks-<UUID>.key

Note the “-w 0” is necessary to prevent the password from having a trailing newline which will make it difficult to use. For mkinitramfs systems, you’ll now need to modify the /etc/crypttab entry

cr_root UUID=<UUID> /etc/cryptsetup-keys.d/luks-<UUID>.key luks

For dracut you need the key install hook in /etc/dracut.conf.d as described above and for Debian you need the keyfile pattern:

echo "KEYFILE_PATTERN=\"/etc/cryptsetup-keys.d/*\"" >>/etc/cryptsetup-initramfs/conf-hook

You now rebuild the initial ramdisk and you should now be able to boot the cryptosystem using either the high entropy password or your rescue one and it should only prompt in grub and shouldn’t prompt again. This image file is now ready to be used for confidential computing.

December 23, 2020 06:10 PM

December 22, 2020

Michael Kerrisk (manpages): man-pages-5.10 is released

Starting with this release, Alejandro (Alex) Colomar has joined me as project comaintainer, and we've released man-pages-5.10. The release tarball is available on kernel.org. The browsable online pages can be found on man7.org. The Git repository for man-pages is available on kernel.org.

This release resulted from patches, bug reports, reviews, and comments from around 25 contributors. The release includes just over 150 commits that changed around 140 pages.

The most notable of the changes in man-pages-5.10 are the following:

December 22, 2020 09:56 AM

December 16, 2020

Pete Zaitcev: Google outage

It's very funny to hear about people who were unable to turn on their lights because their houses were "smart". Not a good look for Google Nest! But I had a real problem:

Google outage crashed my Thunderbird so good that the only fix is to delete the ~/.thunderbird and re-add all accounts.

Yes, really.

December 16, 2020 06:20 AM

November 13, 2020

Dave Airlie (blogspot): lavapipe: a *software* swrast vulkan layer FAQ

(project was renamed from vallium to lavapipe)

I had some requirements for writing a vulkan software rasterizer within the Mesa project. I took some time to look at the options and realised that just writing a vulkan layer on top of gallium's llvmpipe would be a good answer for this problem. However in doing so I knew people would ask why this wouldn't work for a hardware driver.

tl;dr DO NOT USE LAVAPIPE OVER A GALLIUM HW DRIVER,

What is lavapipe?

The lavapipe layer is a gallium frontend. It takes the Vulkan API and roughly translates it into the gallium API.

How does it do that?

Vulkan is a lowlevel API, it allows the user to allocate memory, create resources, record command buffers amongst other things. When a hw vulkan driver is recording a command buffer, it is putting hw specific commands into it that will be run directly on the GPU. These command buffers are submitted to queues when the app wants to execute them.

Gallium is a context level API, i.e. like OpenGL/D3D10. The user has to create resources and contexts and the driver internally manages command buffers etc. The driver controls internal flushing and queuing of command buffers.
 
In order to bridge the gap, the lavapipe layer abstracts the gallium context into a separate thread of execution. When recording a vulkan command buffer it creates a CPU side command buffer containing an encoding of the Vulkan API. It passes that recorded CPU command buffer to the thread on queue submission. The thread then creates a gallium context, and replays the whole CPU recorded command buffer into the context, one command at a time.

That sounds horrible, isn't it slow?

Yes.

Why doesn't that matter for *software* drivers?

Software rasterizers are a very different proposition from an overhead point of view than real hardware. CPU rasterization is pretty heavy on the CPU load, so nearly always 90% of your CPU time will be in the rasterizer and fragment shader. Having some minor CPU overheads around command submission and queuing isn't going to matter in the overall profile of the user application. CPU rasterization is already slow, the Vulkan->gallium translation overhead isn't going to be the reason for making it much slower.
For real HW drivers which are meant to record their own command buffers in the GPU domain and submit them direct to the hw, adding in a CPU layer that just copies the command buffer data is a massive overhead and one that can't easily be removed from the lavapipe layer.

The lavapipe execution context is also pretty horrible, it has to connect all the state pieces like shaders etc to the gallium context, and disconnect them all at the end of each command buffer. There is only one command submission queue, one context to be used. A lot of hardware exposes more queues etc that this will never model.

I still don't want to write a vulkan driver, give me more reasons.

Pipeline barriers:

Pipeline barriers in Vulkan are essential to efficient driver hw usage. They are one of the most difficult to understand and hard to get right pieces of writing a vulkan driver. For a software rasterizer they are also mostly unneeded. When I get a barrier I just completely hardflush the gallium context because I know the sw driver behind it. For a real hardware driver this would be a horrible solution. You spend a lot of time trying to make anything optimal here.

Memory allocation:

Vulkan is built around the idea of separate memory allocation and objects binding to those allocations. Gallium is built around object allocation with the memory allocs happening implicitly. I've added some simple memory allocation objects to the gallium API for swrast. These APIs are in no way useful for hw drivers. There is no way to expose memory types or heaps from gallium usefully. The current memory allocation API works for software drivers because I know all they want is an aligned_malloc. There is no decent way to bridge this gap without writing a new gallium API that looks like Vulkan. (in which case just write a vulkan driver already).

Can this make my non-Vulkan capable hw run Vulkan?

No. If the hardware can't do virtual memory properly, or expose features for vulkan this can't be fixed with a software layer that just introduces overhead.


November 13, 2020 02:16 AM

November 12, 2020

Dave Airlie (blogspot): Linux graphics, why sharing code with Windows isn't always a win.

A recent article on phoronix has some commentary about sharing code between Windows and Linux, and how this seems to be a metric that Intel likes.

I'd like to explore this idea a bit and explain why I believe it's bad for Linux based distros and our open source development models in the graphics area.

tl;dr there is a big difference between open source released and open source developed projects in terms of sustainability and community.

The Linux graphics stack from a distro vendor point of view is made up of two main projects, the Linux kernel and Mesa userspace. These two projects are developed in the open with completely open source vendor agnostic practices. There is no vendor controlling either project and both projects have a goal of try to maximise shared code and shared processes/coding standards across drivers from all vendors.

This cross-vendor synergy is very important to the functioning ecosystem that is the Linux graphics stack. The stack also relies in some places on the LLVM project, but again LLVM upstream is vendor agnostic and open source developed.

The value to distros is they have central places to pick up driver stacks with good release cycles and a minimal number of places they have to deal with to interact with those communities. Now usually hardware vendors don't see the value in the external communities as much as Linux distros do. From a hardware vendor internal point of view they see more benefit in creating a single stack shared between their Windows and Linux to maximise their return on investment, or make their orgchart prettier or produce less powerpoints about why their orgchart isn't optimal.

A shared Windows/Linux stack as such is a thing the vendors want more for their own reasons than for the benefit of the Linux community.

Why is it a bad idea?

I'll start by saying it's not always a bad idea. In theory it might be possible to produce such a stack with the benefits of open source development model, however most vendors seem to fail at this. They see open source as a release model, they develop internally and shovel the results over the fence into a github repo every X weeks after a bunch of cycles. They build products containing these open source pieces, but they never expend the time building projects or communities around them.

As an example take AMDVLK vs radv. I started radv because AMD had been promising the world an open source Vulkan driver for Linux that was shared with their Windows stack. Even when it was delivered it was open source released but internally developed. There was no avenue for community participation in the driver development. External contributors were never on the same footing as an AMD employee. Even AMD employees on different teams weren't on the same footing. Compare this to the radv project in Mesa where it allowed Valve to contribute the ACO backend compiler and provide better results than AMD vendor shared code could ever have done, with far less investement and manpower.

Intel have a non-mesa compiler called Intel Graphics Compiler mentioned in the article. This is fully developed by intel internally, there is little info on project direction or how to get involved or where the community is. There doesn't seem to be much public review, patches seem to get merged to the public repo by igcbot which may mean they are being mirrored from some internal repo. There are not using github merge requests etc. Compare this to development of a Mesa NIR backend where lots of changes are reviewed and maximal common code sharing is attempted so that all vendors benefit from the code.

One area where it has mostly sort of worked out what with the AMD display code in the kernel. I believe this code to be shared with their Windows driver (but I'm not 100% sure). They do try to engage with community changes to the code, but the code is still pretty horrible and not really optimal on Linux. Integrating it with atomic modesetting and refactoring was a pain. So even in the best case it's not an optimal outcome even for the vendor. They have to work hard to make the shared code be capable of supporting different OS interactions.

How would I do it?

If I had to share Windows/Linux driver stack I'd (biased opinion) start from the most open project and bring that into the closed projects. I definitely wouldn't start with a new internal project that tries to disrupt both. For example if I needed to create a Windows GL driver, I could:

a) write a complete GL implementation and throw it over the wall every few weeks. and make Windows/Linux use it, Linux users lose out on the shared stack, distros lose out on one dependency instead having to build a stack of multiple per vendor deps, Windows gains nothing really, but I'm so in control of my own destiny (communities don't matter).

b) use Mesa and upstream my driver to share with the Linux stack, add the Windows code to the Mesa stack. I get to share the benefits of external development by other vendors and Windows gains that benefit, and Linux retains the benefits to it's ecosystem.

A warning then to anyone wishing for more vendor code sharing between OSes it generally doesn't end with Linux being better off, it ends up with Linux being more fragmented, harder to support and in the long run unsustainable.


November 12, 2020 12:05 AM

November 03, 2020

Brendan Gregg: BPF binaries: BTF, CO-RE, and the future of BPF perf tools

Two new technologies, BTF and CO-RE, are paving the way for BPF to become a billion-dollar industry. Right now there are many BPF (eBPF) startups building networking, security, and performance products (and more in stealth), yet requiring customers to install the LLVM, Clang, and kernel header dependencies – which can consume over 100 Mbytes of storage – is an adoption drag. BTF and CO-RE eliminate these dependencies at runtime, not only making BPF more practical for embedded Linux environments, but for adoption everywhere. These technologies are: - BTF: BPF Type Format, which provides struct information to avoid needing Clang and kernel headers. - CO-RE: BPF Compile-Once Run-Everywhere, which allows compiled BPF bytecode to be relocatable, avoiding the need for recompilation by LLVM. Clang and LLVM are still required for compilation, but the result is a lightweight ELF binary that includes the precompiled BPF bytecode and can be run everywhere. The BCC project has a collection of these, called [libbpf tools]. As an example, I ported over my opensnoop(8) tool:

# ./opensnoop
PID    COMM              FD ERR PATH
27974  opensnoop         28   0 /etc/localtime
1482   redis-server       7   0 /proc/1482/stat
1657   atlas-system-ag    3   0 /proc/stat
[…]
This opensnoop(8) is an ELF binary that doesn't use libLLVM or libclang:
# file opensnoop
opensnoop: ELF 64-bit LSB shared object, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/l, for GNU/Linux 3.2.0, BuildID[sha1]=b4b5320c39e5ad2313e8a371baf5e8241bb4e4ed, with debug_info, not stripped

# ldd opensnoop
	linux-vdso.so.1 (0x00007ffddf3f1000)
	libelf.so.1 => /usr/lib/x86_64-linux-gnu/libelf.so.1 (0x00007f9fb7836000)
	libz.so.1 => /lib/x86_64-linux-gnu/libz.so.1 (0x00007f9fb7619000)
	libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f9fb7228000)
	/lib64/ld-linux-x86-64.so.2 (0x00007f9fb7c76000)

# ls -lh opensnoop opensnoop.stripped
-rwxr-xr-x 1 root root 645K Feb 28 23:18 opensnoop
-rwxr-xr-x 1 root root 151K Feb 28 23:33 opensnoop.stripped
... and stripped is only 151 Kbytes. Now imagine a BPF product: instead of requiring customers install various heavyweight (and brittle) dependencies, a BPF agent may now be a single tiny binary that works on any kernel that has BTF. ## How this works It's not just a matter of saving the BPF bytecode in ELF and then sending it to any other kernel. Many BPF programs walk kernel structs that can change from one kernel version to another. Your BPF bytecode may still execute on different kernels, but it may be reading the wrong struct offsets and printing garbage output! opensnoop(8) doesn't walk kernel structs since it instruments stable tracepoints and their arguments, but many other tools do. This is an issue of *relocation*, and both BTF and CO-RE solve this for BPF binaries. BTF provides type information so that struct offsets and other details can be queried as needed, and CO-RE records which parts of a BPF program need to be rewritten, and how. CO-RE developer Andrii Nakryiko has written long posts explaining this in more depth: [BPF Portability and CO-RE] and [BTF Type Information]. ## CONFIG_DEBUG_INFO_BTF=y These new BPF binaries are only possible if this kernel config option is set. It adds about 1.5 Mbytes to the kernel image (this is tiny in comparison to DWARF debuginfo, which can be hundreds of Mbytes). Ubuntu 20.10 has already made this config option the default, and all other distros should follow. Note to distro maintainers: it requires pahole >= 1.16. ## The future of BPF performance tools, BCC Python, and bpftrace For BPF performance tools, you should start with running [BCC] and [bpftrace] tools, and then coding in bpftrace. The BCC tools should eventually be switched from Python to libbpf C under the hood, but will work the same. **Coding performance tools in BCC Python is now considered deprecated** as we move to libbpf C with BTF and CO-RE (although we still have library work to do, such as for USDT support, so the Python versions will be needed for a while). Note that there are other use cases of BCC that may continue to use the Python interface; BPF co-maintainer Alexei Starovoitov and myself briefly discussed this on [iovisor-dev]. My [BPF Performance Tools] book focused on running BCC tools and coding in bpftrace, and that doesn't change. However, **Appendix C's Python programming examples are now considered deprecated.** Apologies for the inconvenience. Fortunately it's only 15 pages of appendix material out of the 880-page book. What about bpftrace? It does support BTF, and in the future we're looking at reducing its installation footprint as well (it can currently get to [29 Mbytes], and we think it can go a lot smaller). Given an average libbpf program size of 229 Kbytes (based on the current libbpf tools, stripped), and an average bpftrace program size of 1 Kbyte (my book tools), a large collection of bpftrace tools plus the bpftrace binary may become a smaller installation footprint than the equivalent in libbpf. Plus the bpftrace versions can be modified on the fly. libbpf is better suited for more complex and mature tools that needs custom arguments and libraries. As screenshots, the future of BPF performance tools is this:
# ls /usr/share/bcc/tools /usr/sbin/*.bt
argdist       drsnoop         mdflush         pythongc     tclobjnew
bashreadline  execsnoop       memleak         pythonstat   tclstat
[...]
/usr/sbin/bashreadline.bt    /usr/sbin/mdflush.bt    /usr/sbin/tcpaccept.bt
/usr/sbin/biolatency.bt      /usr/sbin/naptime.bt    /usr/sbin/tcpconnect.bt
[...]
... and this:
# bpftrace -e 'BEGIN { printf("Hello, World!\n"); }'
Attaching 1 probe...
Hello, World!
^C
... and **not** this:
#!/usr/bin/python

from bcc import BPF
from bcc.utils import printb

prog = """
int hello(void *ctx) {
    bpf_trace_printk("Hello, World!\\n");
    return 0;
}
"""
[...]
Thanks to Yonghong Song (Facebook) for leading development of BTF, Andrii Nakryiko (Facebook) for leading development of CO-RE, and everyone else involved in making this happen. [BPF Portability and CO-RE]: https://facebookmicrosites.github.io/bpf/blog/2020/02/19/bpf-portability-and-co-re.html [BTF Type Information]: https://facebookmicrosites.github.io/bpf/blog/2018/11/14/btf-enhancement.html [BPF Performance Tools]: /bpf-performance-tools-book.html [29 Mbytes]: https://github.com/iovisor/bpftrace/issues/342 [iovisor-dev]: https://lists.iovisor.org/g/iovisor-dev/topic/future_of_bcc_python_tools/77827559?p=,,,20,0,0,0::recentpostdate%2Fsticky,,,20,2,0,77827559 [BCC]: https://github.com/iovisor/bcc [bpftrace]: https://github.com/iovisor/bpftrace [libbpf tools]: https://github.com/iovisor/bcc/tree/master/libbpf-tools

November 03, 2020 01:00 PM

November 02, 2020

Michael Kerrisk (manpages): man-pages-5.09 is released

I've released man-pages-5.09. The release tarball is available on kernel.org. The browsable online pages can be found on man7.org. The Git repository for man-pages is available on kernel.org.

This release resulted from patches, bug reports, reviews, and comments from more than 40 contributors. The release includes more than 500 commits that changed nearly 600 pages. Nine new pages were added in this release.

The most notable of the changes in man-pages-5.09 are the following:

As is probably clear, Alejandro Colomar owns this release. With 265 commits, he was by some margin the top contributor, and I'm very happy to report that he beat me into second place as a contributor to this release (something that happened only once before since I became maintainer).

November 02, 2020 05:55 AM

October 30, 2020

Dave Airlie (blogspot): llvmpipe is OpenGL 4.5 conformant.

(I just sent the below email to mesa3d developer list).

Just to let everyone know, a month ago I submitted the 20.2 llvmpipe
driver for OpenGL 4.5 conformance under the SPI/X.org umbrella, and it
is now official[1].

Thanks to everyone who helped me drive this forward, and to all the
contributors both to llvmpipe and the general Mesa stack that enabled
this.

Big shout out to Roland Scheidegger for helping review the mountain of
patches I produced in this effort.

My next plans involved submitting lavapipe for Vulkan 1.0, it's at 99%
or so CTS, but there are line drawing, sampler accuracy and some snorm
blending failure I have to work out.
I also ran the OpenCL 3.0 conformance suite against clover/llvmpipe
yesterday and have some vague hopes of driving that to some sort of
completion.

(for GL 4.6 only texture anisotropy is really missing, I've got
patches for SPIR-V support, in case someone was feeling adventurous).

Dave.

[1] https://www.khronos.org/conformance/adopters/conformant-products/opengl#submission_272

October 30, 2020 08:25 PM

Andy Grover: Upgrading to Fedora 33: Removing Your Old Swap File on EFI Machine

Fedora 33 adds a compressed-memory-based swap device using zram. Cool! Now you can remove your old swap device, if you were a curmudgeon like me and even had one in the first place.

If you are NOT on an EFI system or not using LVM, be aware of this and make changes to these steps as needed. (Specifically, the path given in step 6 will be different.)

  1. After upgrading to Fedora 33, run free. Notice that swap size is the sum of the 4G zram device plus your previous disk-based swap device. Try zramctl and lsblk commands for more info.
  2. Stop swapping to the swap device we’re about to remove. If using LVM, expect the VG and LV names to be different.
    swapoff /dev/vg0/swap
  3. If LVM, remove the no-longer-needed logical volume.
    lvremove /dev/vg0/swap
  4. Edit /etc/fstab and remove (or comment out) the line for your swap device.
  5. Edit /etc/default/grub.
    In the GRUB_CMDLINE_LINUX line, remove the “resume=” part referring to the now-gone swap partition, and the "rd.lvm.lv=” part that also refers to it.
  6. Apply above changes to actual GRUB configuration:
    grub2-mkconfig -o /boot/efi/EFI/fedora/grub.cfg

Reboot and your system should come back up. Enjoy using that reclaimed disk space for more useful things — it’s now unused space in the LVM volume group. If you want to actually use it, look into lvextend, and also resize2fs or xfs_growfs.

October 30, 2020 07:01 PM

October 29, 2020

Paul E. Mc Kenney: Stupid RCU Tricks: Torturing RCU Fundamentally, Parts IV and V

Continuing further into the Linux-kernel Documentation/RCU/Design/Requirements/Requirements.rst file uncovers RCU's final two fundamental guarantees:

 

  1. The common-case RCU primitives are unconditional, and
  2. RCU users can perform a guaranteed read-to-write upgrade.

The first guarantee is trivially verified by inspection of the RCU API. The type of rcu_read_lock(), rcu_read_unlock(), synchronize_rcu(), call_rcu(), and rcu_assign_pointer() are all void. These API members therefore have no way to indicate failure. Even primitives like rcu_dereference(), which do have non-void return types, will succeed any time a load of their pointer argument would succeed. That is, if you do rcu_dereference(*foop), where foop is a NULL pointer, then yes, you will get a segmentation fault. But this segmentation fault will be unconditional, as advertised!

The second guarantee is a consequence of the first four guarantees, and must be tested not within RCU itself, but rather within the code using RCU to carry out the read-to-write upgrade.

Thus for these last two fundamental guarantees there is no code in rcutorture. But maybe even rcutorture deserves a break from time to time! ;–)

October 29, 2020 11:27 PM

Paul E. Mc Kenney: Stupid RCU Tricks: Torturing RCU Fundamentally, Part III

Even more reading of the Linux-kernel Documentation/RCU/Design/Requirements/Requirements.rst file encounters RCU's memory-barrier guarantees. These guarantees are a bit ornate, but roughly speaking guarantee that RCU read-side critical sections lapping over one end of a given grace period are fully ordered with anything past the other end of that same grace period. RCU's overall approach towards this guarantee is shown in the Linux-kernel Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst file, so one approach would be to argue that these guarantees are proven by a combination of this documentation along with periodic code inspection. Although this approach works well for some properties, the periodic code inspections require great attention to detail spanning a large quantity of intricate code. As such, these inspections are all too vulnerable to human error.

Another approach is formal verification, and in fact RCU's guarantees have been formally verified. Unfortunately, these formal-verification efforts, groundbreaking though they are, must be considered to be one-off tours de force. In contrast, RCU needs regular regression testing.

This leaves rcutorture, which has the advantage of being tireless and reasonably thorough, especially when compared to human beings. Except that rcutorture does not currently test RCU's memory-barrier guarantees.

Or at least it did not until today.

A new commit (which has since been accepted into Linux kernel v5.11) enlists the existing RCU readers. Each reader frequently increments a free-running counter, which can then be used to check memory ordering: If the counter appears to have counted backwards, something is broken. Each reader samples and records a randomly selected reader's counter, and assigns some other randomly selected reader to check for backwardsness. A flag is set at the end of each grace period, and once this flag is set, that other reader takes another sample of that same counter and compares them.

The test strategy for this particular fundamental property of RCU is more complex and likely less effective than the memory-ordering property described earlier, but life is like that sometimes.

October 29, 2020 10:47 PM

October 14, 2020

Paul E. Mc Kenney: Stupid RCU Tricks: Torturing RCU Fundamentally, Part II

Further reading of the Linux-kernel Documentation/RCU/Design/Requirements/Requirements.rst file encounters RCU's publish/subscribe guarantee. This guarantee ensures that RCU readers that traverse a newly inserted element of an RCU-protected data structure never see pre-initialization garbage in that element. In CONFIG_PREEMPT_NONE=y kernels, this guarantee combined with the grace-period guarantee permits RCU readers to traverse RCU-protected data structures using exactly the same sequence of instructions that would be used if these data structures were immutable. As always, free is a very good price!

However, some care is required to make use of this publish-subscribe guarantee. When inserting a new element, updaters must take care to first initialize everything that RCU readers might access and only then use an RCU primitive to carry out the insertion. Such primitives include rcu_assign_pointer() and list_add_rcu(), but please see The RCU API, 2019 edition or the Linux-kernel source code for the full list.

For their part, readers must use an RCU primitive to carry out their traversals, for example, rcu_dereference() or list_for_each_entry_rcu(). Again, please see The RCU API, 2019 edition or the Linux-kernel source code for the full list of such primitives.

Of course, rcutorture needs to test this publish/subscribe guarantee. It does this using yet another field in the rcu_torture structure:

struct rcu_torture {
  struct rcu_head rtort_rcu;
  int rtort_pipe_count;
  struct list_head rtort_free;
  int rtort_mbtest;
};

This additional field is ->rtort_mbtest, which is set to zero when a given rcu_torture structure is freed for reuse (see the rcu_torture_pipe_update_one() function), and then set to 1 just before that structure is made available to readers (see the rcu_torture_writer() function). For its part, the rcu_torture_one_read() function checks to see if this field is zero, and if so flags the error by atomically incrementing the global n_rcu_torture_mberror counter. As you would expect, any run ending with a non-zero value in this counter is considered to be a failure.

Thus we have an important fundamental property of RCU that nevertheless happens to have a simple but effective test strategy. To the best of my knowledge, this was also the first aspect of Linux-kernel RCU that was subjected to an automated proof of correctness.

Sometimes you get lucky! ;–)

October 14, 2020 11:16 PM