Kernel Planet

September 23, 2023

Linux Plumbers Conference: Linux Plumbers Conference General Registration Reopened

Now that the MC selection process is finished, we’ve recovered enough passes to reopen general registration. If you still wish to register, please go to our Attend page.

Hopefully we recovered enough passes to keep registration open for a couple of weeks, if not longer, but please don’t wait …

September 23, 2023 06:57 PM

September 14, 2023

Rusty Russell: The HRF Bounty On Human-Readable Offers

At least two people have contacted me concerning the 2 BTC bounty:

2 BTC for a human-readable bolt 12 offer generator feature integrated into a popular iOS or android bitcoin wallet. “Human-readable” means something that can be used on feature phone without QR or copy/paste ability. For example, something that looks like LN address.

This, of course, is asking to solve Zooko’s Triangle, so one of decentralizationm, human readability, or security needs to compromise! Fortunately, the reference to LN address gives a hint on how we might proceed.

The scenario, presumably, is Bob wants to pay Alice, where Alice shows Bob a “Human Readable Offer” and Bob types it into his phone. Each one runs Phoenix, Greenlight, or (if their phone is too low-end) uses some hosted service, but any new third party trust should be minimized.

There are three parts we need here:

  1. Bob finds Alice’s node.
  2. Bob requests Alice’s node for invoice.
  3. If she wants, Alice can easily check Bob’s going to pay the right thing.

The Imagined Scenario

Consider the normal offer case: the offer encodes Alice’s nodeid and description (and maybe other info) about what’s on offer. Bob turns this into an invoice_request, sends an onion message to Alice’s node, which returns the (signed) invoice, which Bob pays. We need to encode that nodeid and extra information as compactly as we can.

Part 1: Finding Alice’s Node from a Human Readable Offer

The issue of “finding Alice’s node” has been drafted already for BOLT12, at https://github.com/rustyrussell/bolt12address (but it needs updating!). This means that if you say “rusty@blockstream.com” you can get a valid generic offer, either by contacting the webserver at “blockstream.com” or having someone else do it for you (important for privacy!), or even downloading a public list of common receivers.

Note that it’s easier to type * than @ on feature phones, so I suggest allowing both rusty@blockstream.com and RUSTY*BLOCKSTREAM.COM.

What’s Needed On The Server

  1. The BOLT 12 Address Format needs to be updated.
  2. It needs to be implemented for some Web server.
  3. Ideally, integrate it into BTC Payserver or the like.

Part 2: Getting the Invoice

Now, presumably, we want a specific invoice: it might be some default “donate to Alice”, but it could be a specific thing “$2 hot dog”. So you really want some (short!) short-code to indicate which invoice you want. I suggest a hash, followed by some randomly chosen alphanumeric string here (case-insensitive!): an implementation may choose to restrict themselves to numbers however, as that’s faster to enter on a feature phone.

What’s Needed On The Server

  1. We can put the short-code in the invreq_payer_note field in BOLT 12 or add a new odd field.
  2. We need to implement (presumably in Core Lightning):
    • A way to specify/assign a short-code for each offer.
    • A way of serving a particular invoice based on this short-code match.

Part 3: Checking the Invoice

So, did you even get the right node id? That’s the insecure part; you’re trusting blockstream.com! Checking the nodeid is hard: someone can grind out a nodeid with the same first 16 digits in a few weeks. But I think you can provide some assurance, by creating a 4-color “flag” using the node id and the latest bitcoin blocks: this will change every new block, and is comparable between Alice and Bob at a glance:

Example nodeid flag for block 807747

This was made using this hacky code which turns my node id 024b9a1fa8e006f1e3937f65f66c408e6da8e1ca728ea43222a7381df1cc449605 into an RGB color (by hashing the nodeid+blockhash).

For a moment, when a new block comes in, one image might be displaced, hence the number, but it’ll only be out by one.

Putting it All Together

What’s Needed On Alice’s Client

  1. Alice needs to configure her BOLT12 Address with some provider when she sets up the phone: it should check that it works!
  2. She should be able to choose an existing offer (may be a “donation” by default), or create a new one on the fly (with a new short code).
  3. Display the BOLT12-ADDRESS # SHORT-CODE, and the current nodeid flag.

What’s Needed On Bob’s Client

  1. It needs to be able to convert BOLT12-ADDRESS into a bolt12 address request:
    • Either via some service (to be implemented!), or by directly query (ideally over Tor).
  2. It needs to be able to produce an offer from the returns bolt12 address response, by putting the SHORT-CODE into the invreq_payer_note.
  3. It needs to be able to fetch an invoice for this offer.
  4. It needs to be able to display the current nodeid flag for the invoice’s node id.
  5. Allow Bob to confirm to send payment.

Is There Anything Else?

There are probably other ways of doing this, but this method has the advantage of driving maturity in several different areas which we want to see in Bitcoin:

  1. bolt12 address to support vendor field validation for offers.
  2. Simple name support for bootstrapping.
  3. Driving Bitcoin to be more accessible to everyone!

Feel free to contact me with questions!

September 14, 2023 02:30 PM

September 13, 2023

Matthew Garrett: Reconstructing an invalid TPM event log

TPMs contain a set of registers ("Platform Configuration Registers", or PCRs) that are used to track what a system boots. Each time a new event is measured, a cryptographic hash representing that event is passed to the TPM. The TPM appends that hash to the existing value in the PCR, hashes that, and stores the final result in the PCR. This means that while the PCR's value depends on the precise sequence and value of the hashes presented to it, the PCR value alone doesn't tell you what those individual events were. Different PCRs are used to store different event types, but there are still more events than there are PCRs so we can't avoid this problem by simply storing each event separately.

This is solved using the event log. The event log is simply a record of each event, stored in RAM. The algorithm the TPM uses to calculate the PCR values is known, so we can reproduce that by simply taking the events from the event log and replaying the series of events that were passed to the TPM. If the final calculated value is the same as the value in the PCR, we know that the event log is accurate, which means we now know the value of each individual event and can make an appropriate judgement regarding its security.

If any value in the event log is invalid, we'll calculate a different PCR value and it won't match. This isn't terribly helpful - we know that at least one entry in the event log doesn't match what was passed to the TPM, but we don't know which entry. That means we can't trust any of the events associated with that PCR. If you're trying to make a security determination based on this, that's going to be a problem.

PCR 7 is used to track information about the secure boot policy on the system. It contains measurements of whether or not secure boot is enabled, and which keys are trusted and untrusted on the system in question. This is extremely helpful if you want to verify that a system booted with secure boot enabled before allowing it to do something security or safety critical. Unfortunately, if the device gives you an event log that doesn't replay correctly for PCR 7, you now have no idea what the security state of the system is.

We ran into that this week. Examination of the event log revealed an additional event other than the expected ones - a measurement accompanied by the string "Boot Guard Measured S-CRTM". Boot Guard is an Intel feature where the CPU verifies the firmware is signed with a trusted key before executing it, and measures information about the firmware in the process. Previously I'd only encountered this as a measurement into PCR 0, which is the PCR used to track information about the firmware itself. But it turns out that at least some versions of Boot Guard also measure information about the Boot Guard policy into PCR 7. The argument for this is that this is effectively part of the secure boot policy - having a measurement of the Boot Guard state tells you whether Boot Guard was enabled, which tells you whether or not the CPU verified a signature on your firmware before running it (as I wrote before, I think Boot Guard has user-hostile default behaviour, and that enforcing this on consumer devices is a bad idea).

But there's a problem here. The event log is created by the firmware, and the Boot Guard measurements occur before the firmware is executed. So how do we get a log that represents them? That one's fairly simple - the firmware simply re-calculates the same measurements that Boot Guard did and creates a log entry after the fact[1]. All good.

Except. What if the firmware screws up the calculation and comes up with a different answer? The entry in the event log will now not match what was sent to the TPM, and replaying will fail. And without knowing what the actual value should be, there's no way to fix this, which means there's no way to verify the contents of PCR 7 and determine whether or not secure boot was enabled.

But there's still a fundamental source of truth - the measurement that was sent to the TPM in the first place. Inspired by Henri Nurmi's work on sniffing Bitlocker encryption keys, I asked a coworker if we could sniff the TPM traffic during boot. The TPM on the board in question uses SPI, a simple bus that can have multiple devices connected to it. In this case the system flash and the TPM are on the same SPI bus, which made things easier. The board had a flash header for external reprogramming of the firmware in the event of failure, and all SPI traffic was visible through that header. Attaching a logic analyser to this header made it simple to generate a record of that. The only problem was that the chip select line on the header was attached to the firmware flash chip, not the TPM. This was worked around by simply telling the analysis software that it should invert the sense of the chip select line, ignoring all traffic that was bound for the flash and paying attention to all other traffic. This worked in this case since the only other device on the bus was the TPM, but would cause problems in the event of multiple devices on the bus all communicating.

With the aid of this analyser plugin, I was able to dump all the TPM traffic and could then search for writes that included the "0182" sequence that corresponds to the command code for a measurement event. This gave me a couple of accesses to the locality 3 registers, which was a strong indication that they were coming from the CPU rather than from the firmware. One was for PCR 0, and one was for PCR 7. This corresponded to the two Boot Guard events that we expected from the event log. The hash in the PCR 0 measurement was the same as the hash in the event log, but the hash in the PCR 7 measurement differed from the hash in the event log. Replacing the event log value with the value actually sent to the TPM resulted in the event log now replaying correctly, supporting the hypothesis that the firmware was failing to correctly reconstruct the event.

What now? The simple thing to do is for us to simply hard code this fixup, but longer term we'd like to figure out how to reconstruct the event so we can calculate the expected value ourselves. Unfortunately there doesn't seem to be any public documentation on this. Sigh.

[1] What stops firmware on a system with no Boot Guard faking those measurements? TPMs have a concept of "localities", effectively different privilege levels. When Boot Guard performs its initial measurement into PCR 0, it does so at locality 3, a locality that's only available to the CPU. This causes PCR 0 to be initialised to a different initial value, affecting the final PCR value. The firmware can't access locality 3, so can't perform an equivalent measurement, so can't fake the value.

comment count unavailable comments

September 13, 2023 09:02 PM

September 03, 2023

Linux Plumbers Conference: Registration Currently Sold Out, We’re Trying to Add More Places

Linux Plumbers is now sold out and in-person registration is closed.

This year it happened not as fast as in 2022, but the registration is still sold out long before the event.

We are setting up a waitlist for in-person registration (virtual attendee places are still available). Please fill in this form and try to be clear about your reasons for wanting to attend. This year we’re giving waitlist priority to new attendees and people expected to contribute content.

September 03, 2023 10:00 AM

Linux Plumbers Conference: Containers and Checkpoint/Restore MC CFP

The Containers and Checkpoint/Restore micro-conference focuses on both userspace and kernel related work. The micro-conference targets the wider container ecosystem ideally with participants from all major container runtimes as well as init system developers.

The microconference will be discussing recent advancements in container technologies with some of the usual candidates being:

On the checkpoint/restore front, some of the potential topics include:

And quite likely a variety of other container and checkpoint/restore topics as things evolve between now and the event.

Past editions of this micro-conference have been the source of many developments in the Linux kernel, including:

Use LPC abstract submission page to submit your proposals and select “Containers and Checkpoint/Restart” track.

September 03, 2023 09:26 AM

September 01, 2023

Dave Airlie (blogspot): Talk about compute and community and where things are at.

 Sriram invited me to the oneAPI meetup, and I felt I hadn't summed up the state of compute and community development in a while. Enjoy 45 minutes of opinions!

https://www.youtube.com/watch?v=HzzLY5TdnZo




September 01, 2023 07:12 PM

August 31, 2023

Linux Plumbers Conference: Power Management and Thermal Control MC CFP

The Power Management and Thermal Control microconference focuses on power management and thermal control infrastructure, CPU and device power-management mechanisms, and thermal control methods.

In particular, we are interested in improving the thermal control infrastructure in the kernel to cover more use cases and utilizing energy-saving opportunities offered by modern hardware in new ways.

The goal is to facilitate cross-framework and cross-platform discussions that can help improve energy-awareness and thermal control in Linux.

The current list of topics proposed so far includes the following:

 

August 31, 2023 11:37 AM

August 29, 2023

Matthew Garrett: Unix sockets, Cygwin, SSH agents, and sadness

Work involves supporting Windows (there's a lot of specialised hardware design software that's only supported under Windows, so this isn't really avoidable), but also involves git, so I've been working on extending our support for hardware-backed SSH certificates to Windows and trying to glue that into git. In theory this doesn't sound like a hard problem, but in practice oh good heavens.

Git for Windows is built on top of msys2, which in turn is built on top of Cygwin. This is an astonishing artifact that allows you to build roughly unmodified POSIXish code on top of Windows, despite the terrible impedance mismatches inherent in this. One is that until 2017, Windows had no native support for Unix sockets. That's kind of a big deal for compatibility purposes, so Cygwin worked around it. It's, uh, kind of awful. If you're not a Cygwin/msys app but you want to implement a socket they can communicate with, you need to implement this undocumented protocol yourself. This isn't impossible, but ugh.

But going to all this trouble helps you avoid another problem! The Microsoft version of OpenSSH ships an SSH agent that doesn't use Unix sockets, but uses a named pipe instead. So if you want to communicate between Cygwinish OpenSSH (as is shipped with git for Windows) and the SSH agent shipped with Windows, you need something that bridges between those. The state of the art seems to be to use npiperelay with socat, but if you're already writing something that implements the Cygwin socket protocol you can just use npipe to talk to the shipped ssh-agent and then export your own socket interface.

And, amazingly, this all works? I've managed to hack together an SSH agent (using Go's SSH agent implementation) that can satisfy hardware backed queries itself, but forward things on to the Windows agent for compatibility with other tooling. Now I just need to figure out how to plumb it through to WSL. Sigh.

comment count unavailable comments

August 29, 2023 06:57 AM

Linux Plumbers Conference: Confidential Computing MC CFP

Confidential Computing is continuing to remain a popular topic in computing industry. From memory encryption to trusted I/O, hardware has been constantly improving and broadening. In the past years,  confidential computing microconferences have brought together developers working on various features in hypervisors, firmware, Linux kernel, low level userspace up to container runtimes. We have  discussed a broad range of topics, ranging from, hardware enablement to generic attestation workflows.

Just in the last year, we have seen support for Intel TDX and AMD SEV-SNP guests merged into Linux. Support for unaccepted memory has also landed in mainline. We have also had support for running as a CVM under Hyper-V partially merged into the kernel. However, there is still a long way to go before a complete Confidential Computing stack with open source software and Linux as the hypervisor becomes a reality. We invite contributions to this microconference to help make progress to that goal.

Topics of interest include

Please use the LPC CfP process to submit your proposals. Submissions can be made via the LPC abstract submission page. Make sure to select “Confidential Computing MC” as the track.

August 29, 2023 04:48 AM

August 23, 2023

Linux Plumbers Conference: IoT MC CFP

The IoT Microconference is a forum for developers to discuss all things IoT. Topics include tools, telemetry, device drivers, and protocols in not only the Linux kernel but also Real-Time Operating Systems such as Zephyr.

Since last year, there have been a number of new technical topics with significant updates.

Current Problems that require attention (stakeholders):

On a slightly less technical topic.

We are pleased to announce that the IoT Microconference is now accepting proposals!

If you are interested in presenting an IoT-related topic involving the Linux kernel, userspace tools, firmware, Zephyr, or frameworks, please upload your submission before September 15th.

Submissions can be made via the LPC Call for Proposals, by selecting Internet of Things MC for your track.

August 23, 2023 02:53 PM

August 21, 2023

Linux Plumbers Conference: VFIO/IOMMU/PCI MC CFP

On behalf of the PCI sub-system maintainers, we would like to invite everyone to join the VFIO/IOMMU/PCI micro-conference (MC) this year.

We are hoping to bring together, both in person and online, everyone interested in the VFIO, IOMMU, and PCI space to talk about the latest developments and challenges in these areas.

The PCI interconnect specification, the devices that implement it, and the system IOMMUs that provide memory and access control to them are nowadays a de-facto standard for connecting high-speed components, incorporating more and more features such as:

These features are aimed at high-performance systems, server and desktop computing, embedded and SoC platforms, virtualisation, and ubiquitous IoT devices.

The kernel code that enables these new system features focuses on coordination between the PCI devices, the IOMMUs they are connected to, and the VFIO layer used to manage them (for userspace access and device passthrough) with related kernel interfaces and userspace APIs to be designed in-sync and in a clean way for all three sub-systems.

The VFIO/IOMMU/PCI MC focuses on the kernel code that enables these new system features, often requiring coordination between the VFIO, IOMMU and PCI sub-systems.

Following the success of LPC 2017, 2019, 2020, 2021, and 2022 VFIO/IOMMU/PCI MC, the Linux Plumbers Conference 2023 VFIO/IOMMU/PCI track will focus on promoting discussions on the PCI core but also current kernel patches aimed at VFIO/IOMMU/PCI sub-systems with specific sessions targeting discussions requiring the three sub-systems coordination.

See the following video recordings from 2022: LPC 2022 – VFIO/IOMMU/PCI MC

Older recordings can be accessed through our official YouTube channel at @linux-pci and the archived LPC 2017 VFIO/IOMMU/PCI MC web page at Linux Plumbers Conference 2017, where the audio recordings from the MC track and links to presentation materials are available.

The tentative schedule will provide an update on the current state of VFIO/IOMMU/PCI kernel sub-systems, followed by a discussion of current issues in the proposed topics.

The following was a result of last year’s successful Linux Plumbers MC:

Tentative topics that are under consideration for this year include (but are not limited to):

If you are interested in participating in this MC and have topics to propose, please use the Call for Proposals (CfP) process.

Otherwise, join us to discuss helping Linux keep up with the new features added to the PCI interconnect specification. We hope to see you there!

Proposals can be submitted here here by selecting Track “VFIO/IOMMU/PCI MC

August 21, 2023 08:41 AM

August 17, 2023

Linux Plumbers Conference: Tracing MC CFP

The Linux kernel has grown in complexity over the years. Complete understanding of how it works via code inspection has become virtually impossible. Today, tracing is used to follow the kernel as it performs its complex tasks. Tracing is used today for much more than simply debugging. Its framework has become the way for other parts of the Linux kernel to enhance and even make possible new features. Live kernel patching is based on the infrastructure of function tracing, as well as BPF function hooks. It is now even possible to model the behavior and correctness of the system via runtime verification which attaches to trace points. There is still much more that is happening in this space, and this microconference will be the forum to explore current and new ideas.

Results and accomplishments from the last Tracing microconference (2021):

Possible ideas for topics for this year’s conference:

For more information, feel free to contact the MC Leads:
Steven Rostedt <rostedt@goodmis.org>
Masami Hiramatsu <mhiramat@kernel.org>
Please follow the suggestions from this BLOG post when submitting a CFP for this track.
Submissions are made via LPC submission system, selecting Track “Tracing MC”

August 17, 2023 01:09 PM

August 09, 2023

Linux Plumbers Conference: Kernel Testing & Dependability MC CFP

Once again The Kernel Testing & Dependability Micro-conference will be taking place at LPC 2023, to discuss testing and dependability related topics.

Please submit proposals for discussion via LPC submission system.

The Linux Plumbers 2023 Kernel Testing & Dependability track focuses on advancing the current state of testing of the Linux Kernel and its related infrastructure. The main purpose is to improve software quality and dependability for applications that require predictability and trust.

The goal of this micro-conference is making connections between folks working on similar projects, and help individual projects make progress.

This track is intended to promote collaboration between all the communities and people interested in the Kernel testing & dependability. This will help move the conversation forward from where we left off at the LPC 2022 Kernel Testing & Dependability MC.

We ask that any topic discussions focus on issues/problems they are facing and possible alternatives to resolving them. The Micro-conference is open to all topics related to testing on Linux, not necessarily in the kernel space.

Suggested topics:

List of accomplishments this past year after LPC 2022:

Proposals can be submitted here, by August 20th:

MC leads can be reached for question and further information::
Shuah Khan (shuah@kernel.org)
Sasha Levin <sashal@kernel.org>
Guillaume Tucker <guillaume.tucker@collabora.com>

August 09, 2023 03:40 PM

Linux Plumbers Conference: Live Patching MC CFP

After a three-year hiatus, the Live Patching Microconference is back for 2023.

Accomplishments post 2019 Microconference:

Discussion Topics

The following topics have been proposed:

These potential discussion topics were selected from on-going livepatching mailing list threads, but additional livepatching related topics are welcome for consideration as well. For ideas on what makes for an ideal Microconference topic, checkout this post.

August 09, 2023 03:38 PM

August 08, 2023

Matthew Garrett: Updating Fedora the unsupported way

I dug out a computer running Fedora 28, which was released 2018-04-01 - over 5 years ago. Backing up the data and re-installing seemed tedious, but the current version of Fedora is 38, and while Fedora supports updates from N to N+2 that was still going to be 5 separate upgrades. That seemed tedious, so I figured I'd just try to do an update from 28 directly to 38. This is, obviously, extremely unsupported, but what could possibly go wrong?

Running sudo dnf system-upgrade download --releasever=38 didn't successfully resolve dependencies, but sudo dnf system-upgrade download --releasever=38 --allowerasing passed and dnf started downloading 6GB of packages. And then promptly failed, since I didn't have any of the relevant signing keys. So I downloaded the fedora-gpg-keys package from F38 by hand and tried to install it, and got a signature hdr data: BAD, no. of bytes(88084) out of range error. It turns out that rpm doesn't handle cases where the signature header is larger than a few K, and RPMs from modern versions of Fedora. The obvious fix would be to install a newer version of rpm, but that wouldn't be easy without upgrading the rest of the system as well - or, alternatively, downloading a bunch of build depends and building it. Given that I'm already doing all of this in the worst way possible, let's do something different.

The relevant code in the hdrblobRead function of rpm's lib/header.c is:

int32_t il_max = HEADER_TAGS_MAX;
int32_t dl_max = HEADER_DATA_MAX;

if (regionTag == RPMTAG_HEADERSIGNATURES) {
il_max = 32;
dl_max = 8192;
}

which indicates that if the header in question is RPMTAG_HEADERSIGNATURES, it sets more restrictive limits on the size (no, I don't know why). So I installed rpm-libs-debuginfo, ran gdb against librpm.so.8, loaded the symbol file, and then did disassemble hdrblobRead. The relevant chunk ends up being:

0x000000000001bc81 <+81>: cmp $0x3e,%ebx
0x000000000001bc84 <+84>: mov $0xfffffff,%ecx
0x000000000001bc89 <+89>: mov $0x2000,%eax
0x000000000001bc8e <+94>: mov %r12,%rdi
0x000000000001bc91 <+97>: cmovne %ecx,%eax

which is basically "If ebx is not 0x3e, set eax to 0xffffffff - otherwise, set it to 0x2000". RPMTAG_HEADERSIGNATURES is 62, which is 0x3e, so I just opened librpm.so.8 in hexedit, went to byte 0x1bc81, and replaced 0x3e with 0xfe (an arbitrary invalid value). This has the effect of skipping the if (regionTag == RPMTAG_HEADERSIGNATURES) code and so using the default limits even if the header section in question is the signatures. And with that one byte modification, rpm from F28 would suddenly install the fedora-gpg-keys package from F38. Success!

But short-lived. dnf now believed packages had valid signatures, but sadly there were still issues. A bunch of packages in F38 had files that conflicted with packages in F28. These were largely Python 3 packages that conflicted with Python 2 packages from F28 - jumping this many releases meant that a bunch of explicit replaces and the like no longer existed. The easiest way to solve this was simply to uninstall python 2 before upgrading, and avoiding the entire transition. Another issue was that some data files had moved from libxcrypt-common to libxcrypt, and removing libxcrypt-common would remove libxcrypt and a bunch of important things that depended on it (like, for instance, systemd). So I built a fake empty package that provided libxcrypt-common and removed the actual package. Surely everything would work now?

Ha no. The final obstacle was that several packages depended on rpmlib(CaretInVersions), and building another fake package that provided that didn't work. I shouted into the void and Bill Nottingham answered - rpmlib dependencies are synthesised by rpm itself, indicating that it has the ability to handle extensions that specific packages are making use of. This made things harder, since the list is hard-coded in the binary. But since I'm already committing crimes against humanity with a hex editor, why not go further? Back to editing librpm.so.8 and finding the list of rpmlib() dependencies it provides. There were a bunch, but I couldn't really extend the list. What I could do is overwrite existing entries. I tried this a few times but (unsurprisingly) broke other things since packages depended on the feature I'd overwritten. Finally, I rewrote rpmlib(ExplicitPackageProvide) to rpmlib(CaretInVersions) (adding an extra '\0' at the end of it to deal with it being shorter than the original string) and apparently nothing I wanted to install depended on rpmlib(ExplicitPackageProvide) because dnf finished its transaction checks and prompted me to reboot to perform the update. So, I did.

And about an hour later, it rebooted and gave me a whole bunch of errors due to the fact that dbus never got started. A bit of digging revealed that I had no /etc/systemd/system/dbus.service, a symlink that was presumably introduced at some point between F28 and F38 but which didn't get automatically added in my case because well who knows. That was literally the only thing I needed to fix up after the upgrade, and on the next reboot I was presented with a gdm prompt and had a fully functional F38 machine.

You should not do this. I should not do this. This was a terrible idea. Any situation where you're binary patching your package manager to get it to let you do something is obviously a bad situation. And with hindsight performing 5 independent upgrades might have been faster. But that would have just involved me typing the same thing 5 times, while this way I learned something. And what I learned is "Terrible ideas sometimes work and so you should definitely act upon them rather than doing the sensible thing", so like I said, you should not do this in case you learn the same lesson.

comment count unavailable comments

August 08, 2023 05:54 AM

August 05, 2023

Linux Plumbers Conference: Build Systems MC CFP

In the Linux ecosystems, there are many ways to build all the software used to put together a running system. Whether it’s building all the binary packages for a binary Linux distribution, using a source-based distribution, or building an embedded system from scratch, there are a lot of shared challenges which each system solves in its own way.

This microconference is a way to get people who work on disparate build systems to discuss common problems and possible shared solutions across the entire problem space. The kinds of topics we want to discuss are the following:
Build Systems micorconference would like to gather representatives (developers and maintainers) from all the various build systems and related technologies. This is not a definitive list of possible attendees.
For more information, feel free to contact the MC Leads:
Behan Webster <behanw@converseincode.com>
Philip Balister <philip@balister.org>
Please follow the suggestions from this BLOG post when submitting a CFP for this track.
Submissions are made via LPC submission system, selecting Track “Build Systems MC”

August 05, 2023 09:48 AM

August 04, 2023

Dave Airlie (blogspot): nvk: the kernel changes needed

The initial NVK (nouveau vulkan) experimental driver has been merged into mesa master[1], and although there's lots of work to be done before it's application ready, the main reason it was merged was because the initial kernel work needed was merged into drm-misc-next[2] and will then go to drm-next for the 6.6 merge window. (This work is separate from the GSP firmware enablement required for reclocking, that is a parallel development, needed to make nvk useable). Faith at Collabora will have a blog post about the Mesa side, this is more about the kernel journey.

What was needed in the kernel?

The nouveau kernel API was written 10 years or more ago, and was designed around OpenGL at the time. There were two major restrictions in the current uAPI that made it unsuitable for Vulkan.

  1. buffer objects (physical memory allocations) were allocated 1:1 with virtual memory allocations for a file descriptor. This meant the kernel managed the virtual address space. For proper Vulkan support, the bo allocation and vm allocation have to be separate, and userspace should control the virtual address space.
  2. Command submission didn't use sync objects. The nouveau command submission wasn't wired up to the modern sync objects. These are pretty much a requirement for Vulkan fencing and semaphores to work properly.

How to implement these?

When we kicked off the nvk idea I made a first pass at implementing a new user API, to allow the above features. I took at look at how the GPU VMA management was done in current drivers and realized that there was a scope for a common component to manage the GPU VA space. I did a hacky implementation of some common code and a nouveau implementation. Luckily at the time, Danilo Krummrich had joined my team at Red Hat and needed more kernel development experience in GPU drivers. I handed my sketchy implementation to Danilo and let him run with it. He spent a lot of time learning and writing copious code. His GPU VA manager code was merged into drm-misc-next last week and his nouveau code landed today.

What is the GPU VA manager?

The idea behind the GPU VA manager is that there is no need for every driver to implement something that should essentially not be a hardware specific problem. The manager is designed to track VA allocations from userspace, and keep track of what GEM objects they are currently bound to. The implementation went through a few twists and turns and experiments. 

For a long period we considered using maple tree as the core of it, but we hit a number of messy interactions between the dma-fence locking and memory allocations required to add new nodes to the maple tree. The dma-fence critical section is a hard requirement to make others deal with. In the end Danilo used an rbtree to track things. We will revisit if we can deal with maple tree again in the future. 

We had a long discussion and a couple of implement it both ways and see, on whether we needed to track empty sparse VMA ranges in the manager or not,  nouveau wanted these but generically we weren't sure they were helpful, but that also affected the uAPI as it needed explicit operations to create/drop these. In the end we started tracking these in the driver and left the core VA manager cleaner.

Now the code is in tree we will start to push future drivers to use it instead of spinning their own.

What changes are needed for nouveau?

Now that the VAs are being tracked, the nouveau API needed two new entrypoints. Since BO allocation will no longer create a VM, a new API is needed to bind BO allocations with VM addresses. This is called the VM_BIND API. It has two variants

  1. a synchronous version that immediately maps a BO to a VM and is used for the common allocation paths.
  2. an asynchronous version that is modeled after the Vulkan sparse API, and takes in/out sync objects, which use the drm scheduler to schedule the vm/bo binding.
The VM BIND backend then does all the page table manipulation required.
 
The second API added was an EXEC call. This takes in/out sync objects and a set of addresses that point to command buffers to execute. This uses the drm scheduler to deal with the synchronization and hands the firmware the command buffer address to execute.
Internally for nouveau this meant having to add support for the drm scheduler, adding new internal page table manipulation APIs, and wiring up the GPU VA. 

Shoutouts:

My input was the sketchy sketch at the start, and doing the userspace changes to the nvk codebase to allow testing.

The biggest shoutout to Danilo, who took a sketchy sketch of what things should look like, created a real implementation, did all the experimental ideas I threw at him, and threw them and others back at me, negotiated with other drivers to use the common code, and built a great foundational piece of drm kernel infrastructure.

Faith at Collabora who has done the bulk of the work on nvk did a code review at the end and pointed out some missing pieces of the API and the optimisations it enables.

Karol at Red Hat on the main nvk driver and Ben at Red Hat for nouveau advice on how things worked, while he smashed away at the GSP rock.

(and anyone else who has contributed to nvk, nouveau and even NVIDIA for some bits :-)

[1] https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/24326

[2] https://cgit.freedesktop.org/drm-misc/log/

August 04, 2023 10:26 PM

August 02, 2023

Linux Plumbers Conference: Linux Plumbers Refereed Track – Upcoming Deadline – August 6, 2023

August is now upon us, and the deadline for refereed track submissions is August 6, which is right around the corner. We have already received some excellent submissions, for which we gratefully thank our submitters!

For those thinking about submitting, please polish off your ideas, and point your browsers at the call-for-proposals page. Looking forward to your submissions.

Reminder: we’ve got a tight deadline to prepare the submissions for the LPC program committee to review, so, as communicated last year, we will not be extending the deadline this year, please submit by August 6th, anywhere on earth.

August 02, 2023 04:03 PM

July 30, 2023

Linux Plumbers Conference: Rust MC CFP

LPC 2023 will host the second edition of the Rust MC. This microconference intends to cover talks and discussions on both Rust for Linux as well as other non-kernel Rust topics. Proposals can be submitted via LPC submission system, selecting the Rust MC track.

Rust is a systems programming language that is making great strides in becoming the next big one in the domain. Rust for Linux is the project adding support for the Rust language to the Linux kernel.

Rust has a key property that makes it very interesting as the second language in the kernel: it guarantees no undefined behavior takes place (as long as unsafe code is sound). This includes no use-after-free mistakes, no double frees, no data races, etc. It also provides other important benefits, such as improved error handling, stricter typing, sum types, pattern matching, privacy, closures, generics, etc.

Possible Rust for Linux topics:

Possible Rust topics:

Last year was the first edition of the Rust MC and the focus was on showing the ongoing efforts by different parties (compilers, Rust for Linux, CI, eBPF…). Shortly after the Rust MC, Rust got merged into the Linux kernel. Abstractions are getting upstreamed, with the first major drivers looking to be merged soon: Android Binder, the Asahi GPU driver and the NVMe driver (presented in that MC).

July 30, 2023 07:05 AM

July 28, 2023

Daniel Vetter: EOSS Prague: Kernel Locking Engineering

EOSS in Prague was great, lots of hallway track, good talks, good food, excellent tea at meetea - first time I had proper tea in my life, quite an experience. And also my first talk since covid, pack room with standing audience, apparently one of the top ten most attended talks per LF’s conference report.

The video recording is now uploaded, I’ve uploaded the fixed slides, including the missing slide that I accidentally cut in a last-minute edit. It’s the same content as my blog posts from last year, first talking about locking engineering principles and then the hierarchy of locking engineering patterns.

July 28, 2023 12:00 AM

July 27, 2023

Linux Plumbers Conference: Android MC CFP

The Android Microconference brings the upstream community and Android systems developers together to discuss issues and changes to the Android platform and their dependencies and interactions with the Linux  kernel, allowing for collaboration on solutions for upstream.

Since last year’s conference, there has been quite a bit of progress, specifically around:

Currently planned discussion topics for this year include:

People are encouraged to submit topics related to new  Android functionality as well as issues in getting that functionality upstream.

Please consider that the goal is to discuss open problems, preferably with patch set submissions already in discussion on LKML. The slots are very short (10-15 mins), and the main portion of the time should be given to the debate – thus, the importance of having an open and relevant problem, with people in the community engaged in the solution.

The CFP for the Android Micro-conference closes on Aug 15th, so get your topics in early!

Additionally, we already have a busy tentative schedule, but please submit your topics, and should it not fit, we hope to have additional discussion space in a follow-on BoF.

July 27, 2023 05:12 AM

July 23, 2023

Linux Plumbers Conference: All microconferences are now accepting topics!

Here are the list of microconferences at the 2023 Linux Plumbers Conference:

Some of the above already have a blog describing them in detail, and blogs for the rest will be coming shortly. If you plan on submitting a topic to one of these microconferences, please read the blog on what an ideal microconference topic submission is. After that, submit your topic and make sure that you select the appropriate track that you are submitting for (they are all listed under LPC Microconference and end with MC).

July 23, 2023 08:15 PM

July 21, 2023

Linux Plumbers Conference: Compute Express Link (CXL) MC CFP

We are pleased to announce that we will have a CXL MC this year at Plumbers, and hereby invite the community in our call for participation.

Compute Express Link is a cache coherent fabric that in recent years has been gaining momentum in the industry. CXL 3.0 launched just before Plumbers 2022 (where very early discussions took place),  bringing new challenges such as dynamic capacity devices and large scale fabrics, two features that bring significant challenges to Linux. There also has been controversy and confusion in the Linux kernel   community about the state and future of CXL, regarding its usage and integration into, for example, the core memory management subsystem. Many concerns have been put to rest through proper clarification and setting of expectations.

The Compute Express Link microconference focuses on how to evolve the Linux CXL kernel driver and userspace components for support of the CXL 2.0 spec (and beyond). The microconference provides a  pace to open the discussion, incorporate more perspectives, and grow the CXL community with a goal that the CXL Linux plumbing serves the needs of the CXL ecosystem while balancing the needs of the  Linux project. Specifically, this microconference welcomes submissions detailing industry and academia use cases in order to develop usage model scenarios. Finally, it will be a good opportunity to have  existing upstream CXL developers available in a forum to discuss current CXL support and to communicate areas that need additional involvement.

Suggested topics:

Proposals can be submitted here, by September 1st:

https://lpc.events/event/17/abstracts/

For more information, feel free to contact the Compute Express Link MC Leads:
Davidlohr Bueso <dave@stgolabs.net>
Jonathan Cameron <Jonathan.Cameron@Huawei.com>
Adam Manzanares <a.manzanares@samsung.com>
Dan Williams <dan.j.williams@intel.com>

July 21, 2023 07:33 AM

July 14, 2023

Dave Airlie (blogspot): tinygrad + rusticl + aco: why not?

I recently came across tinygrad as a small powerful nn framework that had an OpenCL backend target and could run LLaMA model.

I've been looking out for rusticl workloads, and this seemed like a good one, and I could jump on the AI train, and run an LLM in my house!

I started it going on my Radeon 6700XT with the latest rusticl using radeonsi with the LLVM backend, and I could slowly interrogate a model with a question, and it would respond. I've no idea how performant it is vs ROCm yet which seems to be where tinygrad is more directed, but I may get to that next week.

While I was there though I decided to give the Mesa ACO compiler backend a go, it's been tied into radeonsi recently, and I done some hacks before to get compute kernels to run. I reproduced said hacks on the modern code and gave it a run.

tinygrad comes with a benchmark script called benchmark_train_efficientnet so I started playing with it to see what low hanging fruit I could find in an LLVM vs ACO shootout.

The bench does 10 runs, the first is where lots of compilation happens, the last is well primed cache wise. There are the figures from the first and last runs with a release build of llvm and mesa. (and the ACO hacks).

LLVM:

215.78 ms cpy,  12245.04 ms run,  120.33 ms build, 12019.45 ms realize,  105.26 ms CL,   -0.12 loss,  421 tensors, 0.04 GB used,      0.94 GFLOPS

10.25 ms cpy,   221.02 ms run,   83.50 ms build,   36.25 ms realize,  101.27 ms CL,   -0.01 loss,  421 tensors, 0.04 GB used,     52.11 GFLOPS

ACO:

71.10 ms cpy,  3443.04 ms run,  112.58 ms build, 3214.13 ms realize,  116.34 ms CL,   -0.04 loss,  421 tensors, 0.04 GB used,      3.35 GFLOPS
10.36 ms cpy,   234.90 ms run,   84.84 ms build,   36.51 ms realize,  113.54 ms CL,    0.05 loss,  421 tensors, 0.04 GB used,     49.03 GFLOPS

So ACO is about 4 times faster to compile but produces binaries that are less optimised.

The benchmark produces 148 shaders:

LLVM:

126 Max Waves: 16 
  6 Max Waves: 10
  5 Max Waves: 9
  6 Max Waves: 8
  5 Max Waves: 4


ACO:

 96 Max Waves: 16
 36 Max Waves: 12
  2 Max Waves: 10
 10 Max Waves: 8
  4 Max Waves: 4

So ACO doesn't quite get the optimal shaders for a bunch of paths, even with some local hackery I've done to make it do better.[1]

I'll investigate ROCm next week maybe, got a bit of a cold/flu, and large GPU stacks usually make me want to wipe the machine after I test them :-P

[1] https://gitlab.freedesktop.org/airlied/mesa/-/commits/radeonsi-rusticl-aco-wip


July 14, 2023 04:30 AM

July 11, 2023

Matthew Garrett: Roots of Trust are difficult

The phrase "Root of Trust" turns up at various points in discussions about verified boot and measured boot, and to a first approximation nobody is able to give you a coherent explanation of what it means[1]. The Trusted Computing Group has a fairly wordy definition, but (a) it's a lot of words and (b) I don't like it, so instead I'm going to start by defining a root of trust as "A thing that has to be trustworthy for anything else on your computer to be trustworthy".

(An aside: when I say "trustworthy", it is very easy to interpret this in a cynical manner and assume that "trust" means "trusted by someone I do not necessarily trust to act in my best interest". I want to be absolutely clear that when I say "trustworthy" I mean "trusted by the owner of the computer", and that as far as I'm concerned selling devices that do not allow the owner to define what's trusted is an extremely bad thing in the general case)

Let's take an example. In verified boot, a cryptographic signature of a component is verified before it's allowed to boot. A straightforward implementation of a verified boot implementation has the firmware verify the signature on the bootloader or kernel before executing it. In this scenario, the firmware is the root of trust - it's the first thing that makes a determination about whether something should be allowed to run or not[2]. As long as the firmware behaves correctly, and as long as there aren't any vulnerabilities in our boot chain, we know that we booted an OS that was signed with a key we trust.

But what guarantees that the firmware behaves correctly? What if someone replaces our firmware with firmware that trusts different keys, or hot-patches the OS as it's booting it? We can't just ask the firmware whether it's trustworthy - trustworthy firmware will say yes, but the thing about malicious firmware is that it can just lie to us (either directly, or by modifying the OS components it boots to lie instead). This is probably not sufficiently trustworthy!

Ok, so let's have the firmware be verified before it's executed. On Intel this is "Boot Guard", on AMD this is "Platform Secure Boot", everywhere else it's just "Secure Boot". Code on the CPU (either in ROM or signed with a key controlled by the CPU vendor) verifies the firmware[3] before executing it. Now the CPU itself is the root of trust, and, well, that seems reasonable - we have to place trust in the CPU, otherwise we can't actually do computing. We can now say with a reasonable degree of confidence (again, in the absence of vulnerabilities) that we booted an OS that we trusted. Hurrah!

Except. How do we know that the CPU actually did that verification? CPUs are generally manufactured without verification being enabled - different system vendors use different signing keys, so those keys can't be installed in the CPU at CPU manufacture time, and vendors need to do code development without signing everything so you can't require that keys be installed before a CPU will work. So, out of the box, a new CPU will boot anything without doing verification[4], and development units will frequently have no verification.

As a device owner, how do you tell whether or not your CPU has this verification enabled? Well, you could ask the CPU, but if you're doing that on a device that booted a compromised OS then maybe it's just hotpatching your OS so when you do that you just get RET_TRUST_ME_BRO even if the CPU is desperately waving its arms around trying to warn you it's a trap. This is, unfortunately, a problem that's basically impossible to solve using verified boot alone - if any component in the chain fails to enforce verification, the trust you're placing in the chain is misplaced and you are going to have a bad day.

So how do we solve it? The answer is that we can't simply ask the OS, we need a mechanism to query the root of trust itself. There's a few ways to do that, but fundamentally they depend on the ability of the root of trust to provide proof of what happened. This requires that the root of trust be able to sign (or cause to be signed) an "attestation" of the system state, a cryptographically verifiable representation of the security-critical configuration and code. The most common form of this is called "measured boot" or "trusted boot", and involves generating a "measurement" of each boot component or configuration (generally a cryptographic hash of it), and storing that measurement somewhere. The important thing is that it must not be possible for the running OS (or any pre-OS component) to arbitrarily modify these measurements, since otherwise a compromised environment could simply go back and rewrite history. One frequently used solution to this is to segregate the storage of the measurements (and the attestation of them) into a separate hardware component that can't be directly manipulated by the OS, such as a Trusted Platform Module. Each part of the boot chain measures relevant security configuration and the next component before executing it and sends that measurement to the TPM, and later the TPM can provide a signed attestation of the measurements it was given. So, an SoC that implements verified boot should create a measurement telling us whether verification is enabled - and, critically, should also create a measurement if it isn't. This is important because failing to measure the disabled state leaves us with the same problem as before; someone can replace the mutable firmware code with code that creates a fake measurement asserting that verified boot was enabled, and if we trust that we're going to have a bad time.

(Of course, simply measuring the fact that verified boot was enabled isn't enough - what if someone replaces the CPU with one that has verified boot enabled, but trusts keys under their control? We also need to measure the keys that were used in order to ensure that the device trusted only the keys we expected, otherwise again we're going to have a bad time)

So, an effective root of trust needs to:

1) Create a measurement of its verified boot policy before running any mutable code
2) Include the trusted signing key in that measurement
3) Actually perform that verification before executing any mutable code

and from then on we're in the hands of the verified code actually being trustworthy, and it's probably written in C so that's almost certainly false, but let's not try to solve every problem today.

Does anything do this today? As far as I can tell, Intel's Boot Guard implementation does. Based on publicly available documentation I can't find any evidence that AMD's Platform Secure Boot does (it does the verification, but it doesn't measure the policy beforehand, so it seems spoofable), but I could be wrong there. I haven't found any general purpose non-x86 parts that do, but this is in the realm of things that SoC vendors seem to believe is some sort of value-add that can only be documented under NDAs, so please do prove me wrong. And then there are add-on solutions like Titan, where we delegate the initial measurement and validation to a separate piece of hardware that measures the firmware as the CPU reads it, rather than requiring that the CPU do it.

But, overall, the situation isn't great. On many platforms there's simply no way to prove that you booted the code you expected to boot. People have designed elaborate security implementations that can be bypassed in a number of ways.

[1] In this respect it is extremely similar to "Zero Trust"
[2] This is a bit of an oversimplification - once we get into dynamic roots of trust like Intel's TXT this story gets more complicated, but let's stick to the simple case today
[3] I'm kind of using "firmware" in an x86ish manner here, so for embedded devices just think of "firmware" as "the first code executed out of flash and signed by someone other than the SoC vendor"
[4] In the Intel case this isn't strictly true, since the keys are stored in the motherboard chipset rather than the CPU, and so taking a board with Boot Guard enabled and swapping out the CPU won't disable Boot Guard because the CPU reads the configuration from the chipset. But many mobile Intel parts have the chipset in the same package as the CPU, so in theory swapping out that entire package would disable Boot Guard. I am not good enough at soldering to demonstrate that.

comment count unavailable comments

July 11, 2023 07:58 AM

July 08, 2023

Rusty Russell: Covenants in Bitcoin: A Useful Review Taxonomy

Covenants are a construction to allow introspection: a transaction output can place conditions on the transaction which spends it (beyond the specific “must provide a valid signature of itself and a particular pubkey”).

This power extends script in useful ways, such as allow creation of force spending paths (such as vaults which force spending delays), and rebindable inputs (such as required for Lightning “state fixup” proposals, aka LN-Symmetry). But when we discuss specific proposals (such as OP_TX, OP_TXHASH or OP_CHECKTEMPLATEVERIFY) it’s been difficult to nail down the exact trade-offs made for each one. So I want to describe the landscape (or taxonomy) which we can use to categorize and assess covenants.

The Simplest Covenant (Which Doesn’t Quite Work!)

Firstly, consider the simplest covenant: OP_TXIDVERIFY. This would check the txid of the spending transaction is equal to the given txid. This is easy to implement both in existing script (replacing OP_NOP3), and in tapscript. It doesn’t actually work, since it makes a commitment circle (the txid contains the txid of the input, which contains the txid…; thanks Jeremy Rubin), but it’s a useful thought experiment.

Fully Complete Covenants

Now, consider the most complete covenant proposal: OP_TX. The idea is to push some specified field of the spending transaction onto the stack. For efficiency, it would take a bitmap to push multiple fields at once and (because we don’t have OP_CAT) have an option to concatenate them all as push them as one element. There are details here which matter, such as how primitive Bitcoin script is when dealing with numbers, and stack space limits for large transactions, but the idea is simple.

This allows you to do things like “output amount must be > 100000 sats”.

Equality Covenants

On the spectrum from simplest to most complete, is Russell O’Connor’s OP_TXHASH (which I generalized into the OP_TX proposal), which takes a bitmap from the stack and hashes those fields together, then pushes the resulting hash onto the stack. Of course, you can have multiple OP_IF branches allowing different equalities, but only a handful: you’ll run out of scripts space quite fast. With OP_CAT you could extend this further to assemble a template to compare against at runtime, but we don’t have that so I’ll ignore that for now.

This allows for simple equality tests, such as “output amount must be 100000 sats”.

OP_CHECKTEMPLATEVERIFY Covenant

OP_CHECKTEMPLATEVERIFY is a further restriction on basic equality covenants: it’s like OP_TXHASH with a fixed bitmap. It’s an opinionated subset though, which makes it more powerful than OP_TXIDVERIFY (and usable!): in particular it doesn’t commit to inputs at all (except the input number), but commits to all the outputs; this means you can theoretically add fees and still match, but you can’t have a change output. It’s also usable outside tapscript, since it’s written in the old “don’t-touch-the-stack” script soft-fork style.

Taproot Allows Us To Design, Then Restrict

Designing in the second half of 2023, I think it’s reasonable to assume covenants are only relevant inside taproot.

This means we have the ability to easily limit it in a way which can be unlimited in stages later via future soft-forks:

As an example, let’s turn OP_TX into OP_TXIDVERIFY. We only define one bit: OP_TX_BIT_TXID. That bit means “push the txid on the stack”, and if anything else is set, OP_TX is interpreted as OP_SUCCESS:

01 OP_TX_BIT_TXID OP_TX <txid> OP_EQUALVERIFY

Similarly, if we want OP_CHECKTEMPLATEVERIFY, we require the following OP_TX bits to be defined:

  1. OP_TX_BIT_COMBINE (meaning to concatenate onto one stack element)
  2. OP_TX_BIT_NVERSION
  3. OP_TX_BIT_NLOCKTIME
  4. OP_TX_BIT_SCRIPTSIG_HASHES
  5. OP_TX_BIT_INPUT_INDEX
  6. OP_TX_BIT_NSEQUENCES
  7. OP_TX_BIT_NUM_INPUTS
  8. OP_TX_BIT_OUTPUTS_AMOUNT
  9. OP_TX_BIT_OUTPUTS_SCRIPT
  10. OP_TX_BIT_NUM_OUTPUTS

i.e: (assuming they’re assigned bits from 0 to 10)

02 b1111111111 OP_TX OP_SHA256 <hash> OP_EQUALVERIFY

There are some differences in how fields are hashed, and perhaps their order, but these are cosmetic not functional differences.

Extending this in future simply means defining other fields, and what combinations are allowed.

The Recursion Distraction

There are many ways we can argue about how to clip covenants’ wings. But I want to address (and dismiss) one specifically: the idea of restricting recursive covenants which restrict all future descendants.

Any covenant system listed here can restrict outputs. That means I can require that the spending transaction spend to a spending transaction that spends to a spending transaction that spends to…. 100 million transactions later… an output to me. You could prevent this by requiring that any covenant-spending tx itself is not allowed to use covenants at all, but that adds complexity and reduces usefulness.

Mathematically, there’s a difference between being able to restrict transactions to arbitrary depth and to infinite depth. Nobody else cares: either way, there are far better ways to render your coins useless than placing them in a giant chain or loop.

Covenants by the Back Door

I wrote a previous post on Covenants via Signatures which noted that signatures with BIP-118 can be used to make covenants. This is not a neat design, it’s more like “we have a jackhammer, we can use it to knock in a nail”. On the covenant spectrum, it’s an Equality Covenant between OP_TXHASH and OP_CHECKTEMPLATEVERIFY, in that it can be used with several different field bitmaps, according to the SIGHASH flags used on the signature.

Introspection Is Not All We Want

It’s worth noting that Bitcoin’s OP_CHECKSIG (and family) do three things:

  1. Assemble parts of the current transaction.
  2. Hash it.
  3. Check the hash is signed with a given key.

OP_TX implements the first, and we already have various OP_SHA256 and similar operations for the second. OP_TXHASH and OP_CHECKTEMPLATEVERIFY combine the first two.

It’s logical to want a separate operation for the third one, hence the proposal to be able to check a signature signs a given hash: OP_CHECKSIGFROMSTACK. This would let you simulate any OP_CHECKSIG variation (depending on what OP_TX/OP_TXHASH flags were enabled):

02 <flags> OP_TX OP_SHA256 <pubkey> OP_CHECKSIGFROMSTACK

Summary

We should enable ANYPREVOUT. This will enable LN-symmetry which makes Lightning simpler (and thus more robust!), which has already been implemented. It will also enable covenants, though with a weird requirement for a signature-in-output, which makes them less efficient than they could be, but enables real uses and experimentation to inform future soft forks.

For future covenant soft forks, we should look at complete designs like OP_TX, then clip their wings as desired so we can enable the full functionality later. This may well end up looking like OP_CHECKTEMPLATEVERIFY!

Meanwhile, Greg Sanders, who both refined the OP_VAULT proposal and implemented LN-Symmetry (nee Eltoo), expressed the opinion that we’re fast approaching the edge of Bitcoin Script usability, and he now was firmly of the opinion that a soft fork to introduce Simplicity would be better. Perhaps that will happen instead of OP_TX or the like?

July 08, 2023 02:30 PM

July 07, 2023

Rusty Russell: Covenants Via Signatures

Covenants are in Bitcoin already, almost?

[EDIT: An earlier version claimed we do covenants already. Oops! Thanks Ruben Somsen and Jimmy Song.]

In Bitcoin, covenants refers to restricting how an output is spent: this is from the legal term where conditions on property persist beyond sale. This is usually defined to exclude the “obvious” common requirement that the spending transaction be signed by a given key.

But to step back, there are logically three things OP_CHECKSIG (and friends) do:

  1. Assemble parts of the spending transaction (which parts depends on the SIGHASH flags).
  2. Hash that.
  3. Validates the hash has been signed by the given key.

For covenants, you really just want the first part: the ability to introspect so you can check whatever feature of the transaction you care about (this is the basis for OP_TX, which takes a bitmap telling it what about the spending transaction to push onto the stack so you can test it). But if you only care about equality, you can get away with something that does the first two things (this insight was the basis for Russell O’Connor’s OP_TXHASH, like OP_TX but always hashes before putting on the stack), and just check the hash is what you expected.

But, Burak points out that if you only care about equality, and are happy to specify all the fields that are covered by signatures (with the various SIGHASH variants), you can simply put a pubkey with pre-made signature and an OP_CHECKSIG into the output script! That constrains the transaction’s fields to hash to whatever that OP_CHECKSIG expects.

This, of course, is overkill: you don’t actually care about the signature operation, you’re just using it test hashes to create covenants. But unfortunately (as pointed out when I posted on Twitter, all excited!), it doesn’t quite work yet, because your signature has to commit to the script which contains the signature: a circular dependency.

But BIP-118 (a.k.a. ANYPREVOUT) proposes new SIGHASH flags which allow you not to commit to the input script, so this covenant-via-signature is possible, and Rearden Code has a tweak which makes this more powerful. I also suspect that you can probably use a simple 0x1 as the pubkey in some cases, since BIP-118 defines that to mean the taproot internal key, saving 32 bytes.

July 07, 2023 02:30 PM

July 01, 2023

Linux Plumbers Conference: Linux Kernel Debugging MC CFP

We are pleased to announce the first ever Linux Kernel Debugging Microconference, and we are now accepting proposals and problem statements.

Kernel debugging can be done in many ways with many purpose-built tools, from printk to Crash, Drgn, KDB/KGDB, and more. These tools are built on layers of standards, formats, implicit standards, and undocumented assumptions that make everything tick. When things work well, the tools stay out of your way and help you resolve your bug. But when things don’t work so well, you’re left debugging your debugger.

The Linux Kernel Debugging Microconference aims to bring together the developers and users of these tools to discuss the shared problems we face. We hope to discuss ongoing work that will improve the state of kernel debuggers, as well as new ideas that will require coordinated development across projects. Some possible topics might include:

Topics outside this narrow list are welcomed: we welcome any topic that would improve the debugging experience, or merits the attention of the developers of these tools & kernel subsystems. The best submissions will describe active work or open problems, and they will welcome debate, discussion, and community consensus.

Submissions can be made via the LPC Call for Proposals, by selecting Linux Kernel Debugging MC for your track.

July 01, 2023 04:22 PM

June 26, 2023

Linux Plumbers Conference: The Ideal Microconference Topic Session

The Linux Plumbers’ microconference is a three and a half hour session focused on one general focus area. It can be on Android, power management, tracing, real-time or any of the other many subsystems in the Linux ecosystem. These sessions are broken up into smaller topics that are highly focused work meetings with the goal of accomplishing something during the brief discussions that happen during that time. A topic session ranges from 15 to 30 minutes in length, where no more than half the time is a presentation to bring everyone in the room (or online) up to speed about the issues that need to be discussed, and the rest of the time is spent on brainstorming ideas with the audience on how to accomplish solving the problems at hand. The problem does not need to be solved in this short time, but when time is up, the audience should understand what is at stake well enough to be productive offline in mailing lists and chat rooms.

Submitting a microconference topic

A microconference topic submission should be considered a problem statement and not an abstract. The submission should explain what the issue is that the submitter is struggling with, what has currently been done to try to solve it, and sometimes that means showing multiple solutions where there are pros and cons to each solution and the submitter wants to discuss which is better with the audience. There is the possible chance that the audience may even come up with a new solution that is better than what is being presented. The topic should be focused on what is currently being worked on and not about what was already done, unless the submitter wants to talk about what new can be done with what was already done.

Presenting the topic

The topic should start off with a presentation. The goal of the session is to come up with answers to the problem at hand. If the audience does not know the details of the issue, they are highly unlikely to come up with any productive input. The more the audience understands the problem, the likelier they will be able to help out. Due to the short time of the microconference topic session, it is imperative that the presentation is extremely focused on a need to know basis. That is, only present what is critical knowledge to understand the problem at hand. The quicker the audience can come up to speed, the more time there will be to have a productive discussion with them. There is no limit to the number of slides, but the focus should be on the time spent on the presentation.

Another difference between a microconference topic session and a normal presentation, is that there is no Q and A, but only discussions. A Q and A in presentations is where the audience asks the presenter questions and the presenter answers them. In a microconference topic session, the presenter starts with asking the audience questions and then there should be a back and forth between the audience and the presenter as well as between different members of the audience.

General information topics

One exception to the above is if the general focus area requires an understanding of a specific topic that all the other topics depend on. Some examples of this include RISC-V coming out with a new specification. The first topic in the microconference may be a 30 minute presentation about what details the new specification has that will impact further development. This is required information for the rest of the microconference to know in order to have proper decision making. The Android microconference had a similar case where the presentations were required for the other topics to be discussed. The general rule of thumb is that if a presentation is needed to have productive discussions then it is allowed. Due to the short time of a microconference, it is encouraged to have few of these types of presentations and better yet to have people do their homework before attending the microconference.

Attendee preparation

The focus of a microconference is to solve problems that exist today and come up with further innovations of tomorrow. The time constraint requires that everyone involved should be well prepared for the discussions that are to take place. The topics descriptions should include links to patch discussions on mailing lists, to wiki pages that describe the general focus area, or to anything that is not common knowledge to those not directly involved in the work. Linux Plumbers is about getting other experts outside the field to give input with a different perspective. Attendees should make an effort to read through the topics of all the microconferences and if there’s a topic of interest, they should read the links and familiarize themselves with the discussions that will take place. This will allow the attendees to be more productive than if they just come in without the understanding of the general focus area.

By following these general guidelines, Linux Plumbers will remain the most productive technical conference that one can attend.

June 26, 2023 12:02 PM

June 23, 2023

Linux Plumbers Conference: RISC-V Microconference CFP

We’re holding another edition of the RISC-V microconference for Plumbers  at 2023. Broadly speaking anything related to both Linux and RISC-V is  on topic, but discussions tend to involve the following categories:

Accomplishments post 2022 Microconference

All the talks at the 2022 Plumbers microconference have made at least some progress, with many of them resulting in big chunks of merged code.
Specifically:

Likely Topics for Discussion Sections

The actual list of topics tends to be hard to pin down this early, but here’s a few topics that have been floating around the mailing lists and may be easier to resolve in real-time:

Submissions are made via LPC submission systems, selecting Track RISC-V MC

June 23, 2023 09:34 PM

June 20, 2023

Linux Plumbers Conference: Real-time and Scheduling Microconference CFP

The real-time and scheduling micro-conference joins these two intrinsically connected communities to discuss the next steps together.

Over the past decade, many parts of PREEMPT_RT have been included in the official Linux codebase. Examples include real-time mutexes, high-resolution timers, lockdep, ftrace, RCU_PREEMPT, threaded interrupt handlers, and more. The number of patches that need integration has been significantly reduced, and the rest is mature enough to make their way into mainline Linux.

The scheduler is at the core of Linux performance. With different topologies and workloads, giving the user the best experience possible is challenging, from low latency to high throughput and from small power-constrained devices to HPC, where CPU isolation is critical.

The following accomplishments have been made as a result of last year’s micro-conference:

Ideas of topics to be discussed include (but are not limited to):

It is fine if you have a new topic that is not on the list. People are encouraged to submit any topic related to real-time and scheduling.

Please consider that the goal is to discuss open problems, preferably with patch set submissions already in discussion on LKML. The presentations are very short, and the main portion of the time should be given to the debate – thus, the importance of having an open and relevant problem, with people in the community engaged in the solution.

Submissions are made via LPC submission systems, selecting Track Real-time and Scheduling MC

June 20, 2023 08:59 PM

June 16, 2023

Linux Plumbers Conference: Registration for LPC 2023 is open

We’re happy to announce that registration for LPC 2023 is now open. To register please go to our attend page.

To try to prevent the instant sellout we had last year we’ve updated our cancellation policy to no refunds only transfers of registrations. You will find more details during the registration process. LPC 2023 follows the Linux Foundation’s health & safety policy.

As usual we expect to sell our rather quickly so don’t delay your registration for too long!

June 16, 2023 03:00 PM

June 14, 2023

Linux Plumbers Conference: Registration for LPC 2023 is almost here

Registration for LPC 2023 will be opened soon. Past experience told us that in-person registration would be sold out very fast. If you plan to join us in Richmond, please follow our blog and social media for the announcements about the registration!

June 14, 2023 09:04 AM

June 12, 2023

Paul E. Mc Kenney: Parallel Programming: June 2023 Update

The v2023.06.11a release of Is Parallel Programming Hard, And, If So, What Can You Do About It? is now available! The double-column version is also available from arXiv.org.

This release contains a new section on thermal throttling (along with a new cartoon), improvements to the memory-ordering chapter (including intuitive subsets of the Linux-kernel memory model), fixes to the deferred-processing chapter, additional clocksource-deviation material to the "What Time Is It?" section, and numerous fixes inspired by questions and comments from readers.  Discussions with Yariv Aridor were especially fruitful.  Akira Yokosawa contributed some quick quizzes and other upgrades of the technical discussions, along with a great many improvements to grammar, glossaries, epigraphs, and the build system.  Leonardo Bras also provided some much-appreciated build-system improvements, and also started up continuous integration for some of the code samples.

Elad Lahav, Alan Huang, Zhouyi Zhou, and especially SeongJae Park contributed numerous excellent fixes for grammatical and typographical errors.  SeongJae's fixes were from his Korean translation of this book.

Elad Lahav, Alan Huang, and Patrick Pan carried out some much-needed review of the code samples and contributed greatly appreciated fixes and improvements.  In some cases, they drug the code kicking and screaming into the 2020s.  :-)

June 12, 2023 04:40 PM

May 22, 2023

Dave Airlie (blogspot): lavapipe and sparse memory bindings: part two

 Thanks for all the suggestions, on here, and on twitter and on mastodon, anyway who noted I could use a single fd and avoid all the pain was correct!

I hacked up an ever growing ftruncate/madvise memfd and it seemed to work fine. In order to use it for sparse I have to use it for all device memory allocations in lavapipe which means if I push forward I probably have to prove it works and scales a bit better to myself. I suspect layering some of the pb bufmgr code on top of an ever growing fd might work, or maybe just having multiple 2GB buffers might be enough.

Not sure how best to do shaderResourceResidency, userfaultfd might be somewhat useful, mapping with PROT_NONE and then using write(2) to get a -EFAULT is also promising, but I'm not sure how best to avoid segfaults for read/writes to PROT_NONE regions.

Once I got that going, though I ran headfirst into something that should have been obvious to me, but I hadn't thought through.

llvmpipe allocates all it's textures linearly, there is no tiling (even for vulkan optimal). Sparse textures are incompatible with linear implementations. For sparseImage2D you have to be able to give the sparse tile sizes from just the image format. This typically means you have to work out how large the tile that fits into a hw page is in w/h. Of course for a linear image, this would be dependent on the image stride not just the format, and you just don't have that information.

I guess it means texture tiling in llvmpipe might have to become a thing, we've thought about it over the years but I don't think there's ever been a solid positive for implementing it.

Might have to put sparse support on the back burner for a little while longer.

May 22, 2023 03:12 AM

May 17, 2023

Dave Airlie (blogspot): lavapipe and sparse memory bindings

Mike nerdsniped me into wondering how hard sparse memory support would be in lavapipe.

The answer is unfortunately extremely.

Sparse binding essentially allows creating a vulkan buffer/image of a certain size, then plugging in chunks of memory to back it in page-size multiple chunks.

This works great with GPU APIs where we've designed this, but it's actually hard to pull off on the CPU.

Currently lavapipe allocates memory with an aligned malloc. It allocates objects with no backing and non-sparse bindings connect objects to the malloced memory.

However with sparse objects, the object creation should allocate a chunk of virtual memory space, then sparse binding should bind allocated device memory into the virtual memory space. Except Linux has no interfaces for doing this without using a file descriptor.

You can't mmap a chunk of anonymous memory that you allocated with malloc to another location. So if I malloc backing memory A at 0x1234000, but the virtual memory I've used for the object is at 0x4321000, there's no nice way to get the memory from the malloc to be available at the new location (unless I missed an API).

However you can do it with file descriptors. You can mmap a PROT_NONE area for the sparse object, then allocate the backing memory into file descriptors, then mmap areas from those file descriptors into the correct places.

But there are limits on file descriptors, you get 1024 soft, or 4096 hard limits by default, which is woefully low for this. Also *all* device memory allocations would need to be fd backed, not just ones going to be used in sparse allocations.

Vulkan has a limit maxMemoryAllocationCount that could be used for this, but setting it to the fd limit is a problem because some fd's are being used by the application and just in general by normal operations, so reporting 4096 for it, is probably going to explode if you only have 3900 of them left.

Also the sparse CTS tests don't respect the maxMemoryAllocationCount anyways :-)

I shall think on this a bit more, please let me know if anyone has any good ideas!

May 17, 2023 07:28 AM

May 11, 2023

Matthew Garrett: Twitter's e2ee DMs are better than nothing

(Edit 2023-05-10: This has now launched for a subset of Twitter users. The code that existed to notify users that device identities had changed does not appear to have been enabled - as a result, in its current form, Twitter can absolutely MITM conversations and read your messages)

Elon Musk appeared on an interview with Tucker Carlson last month, with one of the topics being the fact that Twitter could be legally compelled to hand over users' direct messages to government agencies since they're held on Twitter's servers and aren't encrypted. Elon talked about how they were in the process of implementing proper encryption for DMs that would prevent this - "You could put a gun to my head and I couldn't tell you. That's how it should be."

tl;dr - in the current implementation, while Twitter could subvert the end-to-end nature of the encryption, it could not do so without users being notified. If any user involved in a conversation were to ignore that notification, all messages in that conversation (including ones sent in the past) could then be decrypted. This isn't ideal, but it still seems like an improvement over having no encryption at all. More technical discussion follows.

For context: all information about Twitter's implementation here has been derived from reverse engineering version 9.86.0 of the Android client and 9.56.1 of the iOS client (the current versions at time of writing), and the feature hasn't yet launched. While it's certainly possible that there could be major changes in the protocol between now launch, Elon has asserted that they plan to launch the feature this week so it's plausible that this reflects what'll ship.

For it to be impossible for Twitter to read DMs, they need to not only be encrypted, they need to be encrypted with a key that's not available to Twitter. This is what's referred to as "end-to-end encryption", or e2ee - it means that the only components in the communication chain that have access to the unencrypted data are the endpoints. Even if the message passes through other systems (and even if it's stored on other systems), those systems do not have access to the keys that would be needed to decrypt the data.

End-to-end encrypted messengers were initially popularised by Signal, but the Signal protocol has since been incorporated into WhatsApp and is probably much more widely used there. Millions of people per day are sending messages to each other that pass through servers controlled by third parties, but those third parties are completely unable to read the contents of those messages. This is the scenario that Elon described, where there's no degree of compulsion that could cause the people relaying messages to and from people to decrypt those messages afterwards.

But for this to be possible, both ends of the communication need to be able to encrypt messages in a way the other end can decrypt. This is usually performed using AES, a well-studied encryption algorithm with no known significant weaknesses. AES is a form of what's referred to as a symmetric encryption, one where encryption and decryption are performed with the same key. This means that both ends need access to that key, which presents us with a bootstrapping problem. Until a shared secret is obtained, there's no way to communicate securely, so how do we generate that shared secret? A common mechanism for this is something called Diffie Hellman key exchange, which makes use of asymmetric encryption. In asymmetric encryption, an encryption key can be split into two components - a public key and a private key. Both devices involved in the communication combine their private key and the other party's public key to generate a secret that can only be decoded with access to the private key. As long as you know the other party's public key, you can now securely generate a shared secret with them. Even a third party with access to all the public keys won't be able to identify this secret. Signal makes use of a variation of Diffie-Hellman called Extended Triple Diffie-Hellman that has some desirable properties, but it's not strictly necessary for the implementation of something that's end-to-end encrypted.

Although it was rumoured that Twitter would make use of the Signal protocol, and in fact there are vestiges of code in the Twitter client that still reference Signal, recent versions of the app have shipped with an entirely different approach that appears to have been written from scratch. It seems simple enough. Each device generates an asymmetric keypair using the NIST P-256 elliptic curve, along with a device identifier. The device identifier and the public half of the key are uploaded to Twitter using a new API endpoint called /1.1/keyregistry/register. When you want to send an encrypted DM to someone, the app calls /1.1/keyregistry/extract_public_keys with the IDs of the users you want to communicate with, and gets back a list of their public keys. It then looks up the conversation ID (a numeric identifier that corresponds to a given DM exchange - for a 1:1 conversation between two people it doesn't appear that this ever changes, so if you DMed an account 5 years ago and then DM them again now from the same account, the conversation ID will be the same) in a local database to retrieve a conversation key. If that key doesn't exist yet, the sender generates a random one. The message is then encrypted with the conversation key using AES in GCM mode, and the conversation key is then put through Diffie-Hellman with each of the recipients' public device keys. The encrypted message is then sent to Twitter along with the list of encrypted conversation keys. When each of the recipients' devices receives the message it checks whether it already has a copy of the conversation key, and if not performs its half of the Diffie-Hellman negotiation to decrypt the encrypted conversation key. One it has the conversation key it decrypts it and shows it to the user.

What would happen if Twitter changed the registered public key associated with a device to one where they held the private key, or added an entirely new device to a user's account? If the app were to just happily send a message with the conversation key encrypted with that new key, Twitter would be able to decrypt that and obtain the conversation key. Since the conversation key is tied to the conversation, not any given pair of devices, obtaining the conversation key means you can then decrypt every message in that conversation, including ones sent before the key was obtained.

(An aside: Signal and WhatsApp make use of a protocol called Sesame which involves additional secret material that's shared between every device a user owns, hence why you have to do that QR code dance whenever you add a new device to your account. I'm grossly over-simplifying how clever the Signal approach is here, largely because I don't understand the details of it myself. The Signal protocol uses something called the Double Ratchet Algorithm to implement the actual message encryption keys in such a way that even if someone were able to successfully impersonate a device they'd only be able to decrypt messages sent after that point even if they had encrypted copies of every previous message in the conversation)

How's this avoided? Based on the UI that exists in the iOS version of the app, in a fairly straightforward way - each user can only have a single device that supports encrypted messages. If the user (or, in our hypothetical, a malicious Twitter) replaces the device key, the client will generate a notification. If the user pays attention to that notification and verifies with the recipient through some out of band mechanism that the device has actually been replaced, then everything is fine. But, if any participant in the conversation ignores this warning, the holder of the subverted key can obtain the conversation key and decrypt the entire history of the conversation. That's strictly worse than anything based on Signal, where such impersonation would simply not work, but even in the Twitter case it's not possible for someone to silently subvert the security.

So when Elon says Twitter wouldn't be able to decrypt these messages even if someone held a gun to his head, there's a condition applied to that - it's true as long as nobody fucks up. This is clearly better than the messages just not being encrypted at all in the first place, but overall it's a weaker solution than Signal. If you're currently using Twitter DMs, should you turn on encryption? As long as the limitations aren't too limiting, definitely! Should you use this in preference to Signal or WhatsApp? Almost certainly not. This seems like a genuine incremental improvement, but it'd be easy to interpret what Elon says as providing stronger guarantees than actually exist.

comment count unavailable comments

May 11, 2023 12:40 AM

May 08, 2023

Linux Plumbers Conference: Microconference proposals are being published

After some hiccups with Indico we’ve finally set up a page that lists submitted microconference proposals. Along with seasoned veterans like Containers and Checkpoint/Restore and RISC-V we are glad to see Live Patching microconference returning after a long break and a brand new Linux Kernel Debugging microconference.

The Proposed microconfences page will be updated from time until the CFP for microconference proposals will be closed on June, 1.

Be sure not to miss the deadline and submit your microconference!

 

May 08, 2023 05:20 PM

April 28, 2023

James Bottomley: Fixing our Self Defeating Licence Compatibility Problems in Open Source

Much angst (and discussion ink) is wasted in open source over whether pulling in code from one project with a different licence into another is allowable based on the compatibility of the two licences. I call this problem self defeating because it creates sequestered islands of incompatibly licensed but otherwise fully open source code that can never ever meet in combination. Everyone from the most permissive open source person to the most ardent free software one would agree this is a problem that should be solved, but most of the islands would only agree to it being solved on their terms. Practically, we have got around this problem by judicious use of dual licensing but that requires permission from the copyright holders, which can sometimes be hard to achieve; so dual licensing is more a band aid than a solution.

In this blog post, I’m going to walk you through the reasons behind cone the most intractable compatibility disputes in open source: Apache-2 vs GPLv2. However, before we get there, I’m first going to walk through several legal issues in general contract and licensing law and then get on to the law and politics of open source licensing.

The Law of Contracts and Licences

Contracts and Licences come from very similar branches of the law and concepts that apply to one often apply to the other. For this legal tour we’ll begin with materiality in contracts followed by licences then look at repairable and irreparable legal harms and finally the conditions necessary to take court action.

Materiality in Contracts

This is actually a well studied and taught bit of the law. The essence is that every contract has a “heart” or core set of clauses which really represent what the parties want from each other and often has a set of peripheral clauses which don’t really affect the “heart” of the contract if they’re not fulfilled. Not fulfilling the latter are said to cause non-material breaches of the contract (i.e. breaches which don’t terminate the contract if they happen, although a party may still have an additional legal claim for the breach if it caused some sort of harm). A classic illustration, often used in law schools, is a contract for electrical the electrical wiring of a house that specifies yellow insulation. The contractor can’t find yellow, so wires the house with blue insulation. The contract doesn’t suffer a material breach because the wires are in the wall (where no-one can see) and there’s no safety issue with the colour and the heart of the contract was about wiring the house not about wire colour.

Materiality in Licensing

This is actually much less often discussed, but it’s still believed that licences are subject to the same materiality constraints as contracts and for this reason, licences often contain “materiality clauses” to describe what the licensor considers to be material to it. So for the licensing example, consider a publisher wishing to publish a book written by a famous author known as the “Red Writer”. A licence to publish for per copy royalties of 25% of the purchase price of the book is agreed but the author inserts a clause specifying by exact pantone number the red that must be the predominant colour of the binding (it’s why they’re known as the “Red Writer”) and also throws in a termination of copyright licence for breaches clause. The publisher does the first batch of 10,000 copies, but only after they’ve been produced discovers that the red is actually one pantone shade lighter than that specified in the licence. Since the cost of destroying the batch and reprinting is huge, the publisher offers the copies for sale knowing they’re out of spec. Some time later the “Red Writer” comes to know of the problem, decides the licence is breached and therefore terminated, so the publisher owes statutory damages (yes, they’ve registered their copyright) per copy on 10,000 books (about $300 million maximum), would the author win?

The answer of course is that no court is going to award the author $300 million. Most courts would take the view that the heart of the contract was about money and if the author got their royalties per book, there was no material breach and the licence continues in force for the publisher. The “Red Writer” may have a separate tort claim for reputational damage if any was caused by the mis-colouring of the book, but that’s it.

Open Source Enforcement and Harm

Looking at the examples above, you can see that most commercial applications of the law eventually boil down to money: you go to court alleging a harm, the court must agree and then assess the monetary compensation for the harm which becomes damages. Long ago in community open source, we agreed that money could never compensate for a continuing licence violation because if it could we’d have set a price for buying yourself out of the terms of the licence (and some Silicon Valley Rich Companies would actually be willing to pay it, since it became the dual licence business model of companies like MySQL)1. The principle that mostly applies in open source enforcement actions is that the harm is to the open source ecosystem and is caused by non-compliance with the licence. Since such harm can only be repaired by compliance that’s the essence of the demand. Most enforcement cases have been about egregious breaches: lack of any source code rather than deficiencies in the offer to provide source code, so there’s actually very little in court records with regard to materiality of licence breaches.

One final thing to note about enforcement cases is there must always be an allegation of material harm to someone or something because you can’t go into court and argue on abstract legal principles (as we seem to like to do in various community mailing lists), you must show actual consequences as well. In addition to consequences, you must propose a viable remedy for the harm that a court could impose. As I said above in open source cases it’s often about harms to the open source ecosystem caused by licence breaches, which is often accepted unchallenged by the defence because the case is about something obviously harmful to open source, like failure to provide source code (and the remedy is correspondingly give us the source code). However, when considering about the examples below it’s instructive to think about how an allegation of harm around a combination of incompatible open source licences would play out. Since the source code is available, there would be much more argument over what the actual harm to the ecosystem, if any, was and even if some theoretical harm could be demonstrated, what would the remedy be?

Applying this to Apache-2 vs GPLv2

The divide between the Apache Software Foundation (ASF) and the Free Software Foundation (FSF) is old and partly rooted in politics. For proof of this notice the FSF says that the two licences (GPLv2 and Apache-2) are legally incompatible and in response the ASF says no-one should use any GPL licences anyway. The purpose of this section is to guide you through the technicalities of the incompatibility and then apply the materiality lessons from above to see if they actually matter.

Why GPLv2 is Incompatible with Apache-2

The argument is that Apache-2 contains two incompatible clauses: the patent termination clause (section 3) which says that if you launch an action against anyone alleging the licensed code infringes your patent then all your rights to patents in the code under the Apache-2 licence terminate; and the Indemnity clause (Section 9) which says that if you want to offer an a warranty you must indemnify every contributor against any liability that warranty might incur. By contrast, GPLv2 contains an implied patent licence (Section 7) and a No Warranty clause (Section 11). Licence scholars mostly agree that the patent and indemnity terms in GPLv2 are weaker than those in Apache-2.

The incompatibility now occurs because GPLv2 says in Section 2 that the entire work after the combination must be shipped under GPLv2, which is possible: Apache is mostly permissive except for the stronger patent and indemnity clauses. However, it is arguable that without keeping those stronger clauses on the Apache-2 code, you’ve violated the Apache-2 licence and the GPLv2 no additional restrictions clause (Section 6) prevents you from keeping the stronger licensing and indemnity clauses even on the Apache-2 portions of the code. Thus Apache-2 and GPLv2 are incompatible.

Materiality and Incompatibility

It should be obvious from the above that it’s hard to make a materiality argument for dropping the stronger apache2 provisions because someone, somewhere might one day get into a situation where they would have helped. However, we can look at the materiality of the no additional restrictions clause in GPLv2. The FSF has always taken the absolutist position on this, which is why they think practically every other licence is GPLv2 incompatible: when you dig at least one clause in every other open source licence can be regarded as an additional restriction. We also can’t take the view that the whole clause is not material: there are obviously some restrictions (like you must pay me for every additional distribution of the code) that would destroy the open source nature of the licence. This is the whole point of the no additional restrictions clause: to prevent the downstream addition of clauses incompatible with the free software goal of the licence.

I mentioned in the section on Materiality in Licences that some licences have materiality clauses that try to describe what’s important to the licensor. It turns out that GPLv2 actually does have a materiality clause: the preamble. We all tend to skip the preamble when analysing the licence, but there’s no denying it’s 7 paragraphs of justification for why the licence looks like it does and what its goals are.

So, to take the easiest analysis first, does the additional indemnity Apache-2 requires represent a material additional restriction. The preamble actually says “for each author’s protection and ours, we want to make certain that everyone understands that there is no warranty for this free software. If the software is modified by someone else and passed on, we want its recipients to know that what they have is not the original, so that any problems introduced by others will not reflect on the original authors’ reputations.” Even on a plain reading an additional strengthening of that by providing an indemnity to the original authors has to be consistent with the purpose as described, so the indemnity clause can’t be regarded as a material additional restriction (a restriction which would harm the aims of the licence) when read in combination with the preamble.

Now the patent termination clause. The preamble has this to say about patents “Finally, any free program is threatened constantly by software patents. We wish to avoid the danger that redistributors of a free program will individually obtain patent licenses, in effect making the program proprietary. To prevent this, we have made it clear that any patent must be licensed for everyone’s free use or not licensed at all.” So giving licensees the ability to terminate the patent rights for patent aggressors would appear to be an additional method of fulfilling the last sentence. And, again, the patent termination clause seems to be consistent with the licence purpose and thus must also not be a material additional restriction.

Thus the final conclusion is that while the patent and indemnity clauses of Apache-2 do represent additional restrictions, they’re not material additional restrictions according to the purpose of the licence as outlined by its materiality clause and thus the combination is permitted. This doesn’t mean the combination is free of consequences: the added code still carries the additional restrictions and you must call that out to the downstream via some mechanism like licensing tags, but it can be done.

Proving It

The only way to prove the above argument is to win in court on it. However, here lies the another good reason why combining Apache-2 and GPLv2 is allowed: there’s no real way to demonstrate harm to anything (either the copyright holder who agreed to GPLv2 or the Community) and without a theory of actual Harm, no-one would have standing to get to court to test the argument. This may look like a catch-22, but it’s another solid reason why, even in the absence of the materiality arguments, this would ultimately be allowed (if you can’t prevent it, it must be allowable, right …).

Community Problems with the Materiality Approach

The biggest worry about the loosening of the “no additional restrictions” clause of the GPL is opening the door to further abuse of the licence by unscrupulous actors. While I agree that this should be a concern, I think it is adequately addressed by rooting the materiality of the licence in the preamble or in provable harm to the open source community. There is also the flip side of this: licences are first and foremost meant to serve the needs of their development community rather than become inflexible implements for a group of enforcers, so even if there were some putative additional abuse in this approach, I suspect it would be outweighed by the licence compatibility benefit to the development communities in general.

Conclusion

The first thing to note is that Open Source incompatible licence combination isn’t as easy as simply combining the code under a single licence: You have to preserve the essential elements of both licences in the code which is combined (although not necessarily the whole project), so for an Apache-2/GPLv2 combination, you’ll need a note on the files saying they follow the stronger Apache patent termination and indemnity even if they’re otherwise GPLv2. However, as long as you’re careful the combination works for either of two reasons: because the Apache-2 restrictions aren’t material additional restrictions under the GPLv2 preamble or because no-one was actually harmed in the making of the combination (or both).

One can see from the above that similar arguments can be applied to various other supposedly incompatible licence combinations (exercise for the reader: try it with BSD-4-Clause and GPLv2). One final point that should be made is that licences and contracts are also all about what was in the minds of the parties, so for open source licences on community code, the norms and practices of the community matter in addition to what the licence actually says and what courts have made of it. In the final analysis, if the community norm of, say, a GPLv2 project is to accept Apache-2 code allowing for the stronger patent and indemnity clauses, then that will become the understood basis for interpreting the GPLv2 licence in that community.

For completeness, I should point out I’ve used the no harm no foul reasoning before when arguing that CDDL and GPLv2 are compatible.

April 28, 2023 01:27 PM

Brendan Gregg: eBPF Observability Tools Are Not Security Tools

eBPF has many uses in improving computer security, but just taking eBPF observability tools as-is and using them for security monitoring would be like driving your car into the ocean and expecting it to float. Observability tools are designed have the lowest overhead possible so that they are safe to run in production while analyzing an active performance issue. Keeping overhead low can require tradeoffs in other areas: tcpdump(8), for example, will drop packets if the system is overloaded, resulting in incomplete visibility. This creates an obvious security risk for tcpdump(8)-based security monitoring: An attacker could overwhelm the system with mostly innocent packets, hoping that a few malicious packets get dropped and are left undetected. Long ago I encountered systems which met strict security auditing requirements with the following behavior: If the kernel could not log an event, it would immediately **halt**! While this was vulnerable to DoS attacks, it met the system's security auditing non-repudiation requirements, and logs were 100% complete. There are ways to evade detection in other tools as well, like top(1) (since it samples processes and relies on its comm field) and even ls(1) (putting escape characters in files). Rootkits do this. These techniques have been known in the industry for decades and haven't been "fixed" because they aren't "broken." They are cars, not boats. Similar methods can be used to evade detection in the eBPF bcc and bpftrace observability tools as well: overwhelming them with events, doing time-of-check-time-of-use attacks (TOCTOU), escape characters, etc. When will the eBPF community "fix" these tools? Well, when will Tesla fix my Model 3 so I can drive it under the Oakland bridge instead of over it? (I joke, and I don't drive a Tesla.) What you actually want is a security monitoring tool that meets a different set of requirements. Trying to adapt observability tools into security tools generally increases overhead (e.g., adding extra probes) which negates the main reason I developed these using eBPF in the first place. That would be like taking the wheels off a car to help make it float. There are other issues as well, like decreasing maintainability when moving probes from stable tracepoints to unstable inner workings for TOU tracing. Had I written these as security tools to start with, I would have done them differently: I'd start with LSM hooks, use a plugin model instead of standalone CLI tools, support configurable policies for event drop behavior, optimize event logging (which we still haven't [done](https://github.com/iovisor/bcc/issues/1033)), and lots more. None of this should be news to experienced security engineers. I'm writing this post because others see the tools and examples I've shared and believe that, with a bit of shell scripting, they could have a good security monitoring product. I get that it looks that way, but in reality there's a bunch of work to do. Ideally I'd link to an example in bcc for security monitoring (we could create a subdirectory for them) but that currently doesn't exist. In the meantime my best advice is: If you are making a security monitoring product, hire a good security engineer (e.g., someone with solid pen-testing experience). BPF for security monitoring was first explored by myself and a Netflix security engineer, Alex Maestretti, in a [2017 BSides talk] \(some slides below). Since then I've worked with other security engineers on the topic (hi Michael, Nabil, Sargun, KP). (I also did security work many years ago, so I'm not completely new to the topic.)



BSidesSF2017 BPF security monitoring: Alex Maesretti, Brendan Gregg
There is potential for an awesome eBPF security product, and it's not just the visibility that's valuable (all those arrows) it's also the low overhead. These slides included our [overhead evaluation] showing bcc/eBPF was far more efficient than auditd or go-audit. (It was pioneering work, but unfortunately the slides are all we have: Alex, I, and others left Netflix before open sourcing it.) There are now other eBPF security products, including open source projects (e.g., [tetragon]), but I don't know enough about them all to have a recommendation. Note that I'm talking about the observability tools here and not the eBPF kernel runtime itself, which has been designed as a secure sandbox. Nor am I talking about privilege escalation, since to run the tools you already need root access (that car has sailed!). [2017 BSides talk]: https://www.brendangregg.com/Slides/BSidesSF2017_BPF_security_monitoring [overhead evaluation]: https://www.brendangregg.com/Slides/BSidesSF2017_BPF_security_monitoring/#17 [tetragon]: https://github.com/cilium/tetragon

April 28, 2023 12:00 AM

April 25, 2023

Kernel Podcast: S2E4 – 2023/04/24

Season 2 – Episode 4 – 2023/04/24

Summary

The latest stable kernel is Linux 6.3, released by Linus Torvalds on Sunday, April 23rd, 2023.

The latest mainline (development) kernel is 6.3. The Linux 6.4 “merge window” is open.

Linux 6.3

Linus Torvalds announced the release of Linux 6.3, noting, “It’s been a calm release this time around, and the last week was really no different. So here we are, right on schedule”. As usual, the KernelNewbies website has a summary of Linux 6.3, including links to the appropriate LWN (Linux Weekly News) articles with deep dives for each new feature (if you like this podcast and want to support Linux Kernel journalism, please subscribe to Linux Weekly News).

Linux 6.3 includes additional support for the Rust programming language, a new red-black tree data structure for BPF programs, and the removal of a large number of legacy Arm systems. 

With the release of Linux 6.3 comes the opening of the “merge window” (period of time during which disruptive changes are allowed to be merged into the kernel source code) for what will be Linux 6.4 in another couple of months. The next podcast release will include a full summary.

Thorsten Leemhuis has been doing his usual excellent work tracking regressions. He posted multiple updates during the Linux 6.3 development cycle as usual, at one point saying that “The list of regressions from the 6.3 cycle I track is still quite short”. Most seemed to relate to build problems that had stalled for fixes. He had been concerned that there “are two regressions from the 6.2 cycle still not fixed”. These included that “Wake-on-lan (WOL) apparently is broken for a huge number of users” and “a huge number of DISCARD request on NVME devices with Btrfs” causing “a performance regression for some users”. With the final release of Linux 6.3, he has “nothing much to report”, with just “two regression from the 6.3 cycle…worth mentioning”.

Sebastian Andrej Siewior announced pre-empt RT (Real Time) patch v6.3-rc5-rt8.

Shuah Khan posted a summary of complaints addressed by the Linux Kernel Code of Conduct Committee between October 1, 2022 through March 31, 2023. During that time, they received reports of “Unacceptable behavior of comments in email” 6 times. Most were resolved with “Clarification on the Code of Conduct related to maintainer rights and responsibility to reject code”. Overall “The reports were about the decisions made in rejecting code and these actions are not viewed as violations of the Code of Conduct”.

Russia

It cannot have escaped anyone’s attention that there is an active military conflict ongoing in Europe. I try to keep politics out of this podcast. We are, after all, not lacking for other places in which to debate our opinions. Similarly, for the most part, it can be convenient as Open Source developers to attempt to live in an online world devoid of politics and physical boundaries, but the real world very much continues to exist, and in the real world there are consequences (in the form of sanctions) faced by those who invade other sovereign nations. Those consequences can be imposed by governments, but also by fellow developers. The latter was the case over the past month with a patch posted to the Linux “netdev” networking development list.

An engineer from (sanctioned) Russian company Baikal Electronics attempted to post some network patches. His post was greeted by a terse response from one of the maintainers: “We don’t feel comfortable accepting patches from or relating to hardware produced by your organization. Please withhold networking contributions until further notice”. Baikal is known for its connections to the Russian state. The question of official policy was subsequently raised by James Harkonnen, citing a message allegedly from Linus in which he reportedly said “I will not stop any kernel developer I trust from taking patches from Russian sources that they in turn trust, but at the same time I will also not override anybody who goes “I don’t want to have anything to do with this” and doesn’t want to work with Russian companies”. James wanted a clarification as to any official position. As of this date no follow up discussion appears to have taken place, and there does not appear to be an official kernel-wide policy on Russian patches.

Introducing Bugbot

Konstantin Ryabitsev, who is responsible for running kernel.org on behalf of Linux Foundation, posted “Introducing bugbot”, in which he described a new tool that aims to be “a bridge between bugzilla [as in bugzilla.kernel.org] and public-inbox (the mailing list). The tool is “still a very early release” but it is able to “Create bugs from mailing list discussions, with full history”, and “Start mailing list threads from pre-triaged bugzilla bugs”. He closed (presciently) with “bugbot is very young and probably full of bugs, so it will still see a lot of change and will likely explode a couple of times”. True to the prediction, bugbot saw that it was summoned by the announcement of its existence and it replied to the thread, which Konstantin used as an example of the “may explode” comment he had made. Generally feedback to the new tool was positive.

Ongoing Development

Anjali Kulkarni posted version 3 of “Process connector bug fixes & enhancements”, a patch series to improve the performance of monitoring the exit of dependent threads. According to Anjali, “Oracle DB runs on a large scale with 100000s of short lived processes, starting up and exiting quickly. A process monitoring DB daemon which tracks and cleans up after processes that have died without a proper exit needs notifications only when a process died with a non-zero exit code (which should be rare)”. The patches allow a “client [to] register to listen for only exit or fork or a mix of all events. This greatly enhances performance”.

Vlastimil Babka posted “remove SLOB and allow kfree() with kmem_cache_alloc()”. In the patch posted, Vlastimil notes that “The SLOB allocator was deprecated in 6.2 so I think we can start exposing the complete removal in for-next and aim at 6.4 if there are no complaints”.

Thorsten Leemhuis (“the Linux kernel’s regression tracker”) poked an older thread about a 20% UDP performance degradation that Tariq Toukan (NVIDIA) had reported a few months ago. The report observed that a specific CFS (Completely Fair Scheduler, the current default Linux scheduler) patch was the culprit, but that the team discovering it “couldn’t come up with a good explanation how this patch causes this issue”. Thorsten tagged the mail for followup tracking.

Lukas Bulwahn posted “Updating information on lanana.org”. Lanana was setup to be “The Linux Assigned Names and Numbers Authority”, a play on organizations like the IANA: Internet Assigned Numbers Authority, that assigns e.g. IP addresses on the internet. As the patches note, “As described in Documentation/admin-guide/devices.rst, the device number register (or linux device list) is at Documentation/admin-guide/devices.txt and no longer maintained at lanana.org”. Lanana still technically hosts some of the LSB (Linux Standard Base) IDs.

On the Rust front, Asahi Lina posted “rust: add uapi crate” that “introduce[s] a new ‘uapi’ crate that will contain only these [uapi] publicly usable definitions” for use by userspace APIs.

Marcelo Tosatti posted “fold per-CPU vmstats remotely”, a patch that notes a (Red Hat) customer had encountered a system in which 48 out of 52 CPUs were in a “nohz_full” state (i.e. completely idle with the idle “tick” interrupt stopped), where a process on the system was “trapped in throttle_direct_reclaim” (a low memory “reclaim” codepath) but was not making progress because the counters the reclaim code wanted to use were stale (coming from a completely idle CPU) and not updating. The patch series causes the “vmstat_shepered” kernel thread to “flush the per-CPU counters to the global counters from remote [other] CPUs”.

Reinette Chatre posted “vfio/pci: Support dynamic allocation of MSI-X interrupts”. MSIs are “Message Signaled Interrupts”, typically used by modern buses, such as PCIe, in which an interrupt is not signaled using a traditional wiggling of a wire, but instead by a memory write to a special magic address that subsequently causes an actual hard-wired interrupt to be asserted. In the patch posting, Reinette noted that “Qemu allocates interrupts incrementally at the time the guest unmasks an interrupt, for example each time a Linux guest runs request_irq(). Dynamic allocation of MSI-X interrupts was not possible until v6.2. This prompted Qemu to, when allocating a new interrupt, first release a previously allocated interrupts (including disable of MSI-X) followed by re-allocation of all interrupts that includes the new interrupt”. This of course may not be possible while a device or accelerator is running. The patches are marked as RFC (Request For Comments) because “vfio support for dynamic MSI-X needs to work with existing user space as well as upcoming user space that takes advantage of this feature”. Reinette adds, “I would appreciate guidance on the expectations and requirements surrounding error handling when considering existing user space”. She provides several scenarios to consider.

Tejun Heo posted version 3 of “sched: Implement BPF extensible scheduler class”, which “proposed a new scheduler class called ‘ext_sched_class’, or sched_ext, which allows scheduling policies to be implemented as BPF programs”. BPF (Berkeley Packet Filter) programs are small specially processed “bytecode” programs that can be loaded into the kernel and run within a special form of sandbox. They are commonly used to implement certain tracing logic and come with restrictions (for obvious reasons) on the nature of the modifications they can make to a running kernel. Due to their complexity, and potential intrusiveness of allowing scheduling algorithms to be implemented in BPF programs, the patches come with a (lengthy) “Motivation” section, describing the “Ease of experimentation and exploration”, among other reasons for allowing BPF extension of the scheduler instead of requiring traditional patches. An example provided includes that of implementing an L1TF (L1 Terminal Fault, a speculation execution security side-channel bug in certain x86 CPUs) aware scheduler that performs co-scheduling of (safe to pair) peer threads using sibling hyperthreads using BPF.

Joel Fernandes sent a patch adding himself as a maintainer for RCU, noting “I have spent years learning / contributing to RCU with several features, talks and presentations, with my most recent work being on Lazy-RCU. Please consider me for M[aintainer], so I can tell my wife why I spend a lot of my weekends and evenings on this complicated and mysterious thing — which is mostly in the hopes of preventing the world from burning down because everything runs on this one way or another”. RCU (Read-Copy-Update) is a notoriously difficult subsystem to understand yet it is a feature of certain modern Operating Systems that allows them to gain significant performance enhancements from the fundamental notion of having different views into the same data, based upon point-in-time producers and consumers that come and go. Joel later followed up with “Core RCU patches for 6.4”, including the shiny new MAINTAINERS change and several other fixes.

Separately, Paul McKenney (the original RCU author, and co-inventor) posted assorted updates to sleepable RCU (SRCU) reducing cache footprint and marking it non-optional in Kconfig (kernel build configuration), “courtesy of new-age printk() requirements”.

Mike Kravetz raised a concern about THP (Transparent Huge Page) “backed thread stacks”. In his mail, he cited a “product team” that had “recently experienced ‘memory bloat’ in their environment” due to the alignment of the allocations they had used for thread local stacks within the Java Virtual Machine (JVM) runtime. Mike questioned whether stacks should always be THP given that “Stacks by their very nature grow in somewhat unpredictable ways over time”. Most replies were along the lines that the JVM should alter how it does allocations to use the MADV_NOHUGEPAGE parameter to madvise when allocating space for thread stacks.

Carlos Llamas posted “Using page-fault handler in binder” about “trying to remove the current page handling in [Android’s userspace IPC] binder and switch to using ->fault() and other mm/ infrastructure”. He was seeking pointers and input on the direction from other developers.

Mike Rapoport posted a patch series that “move[s] core MM initialization to mm/mm_init.c”.

Randy Dunlap noted that uclinux.org was dead and requested references to it be removed from the Linux kernel MAINTAINERS file.

Jonathan Corbet (of LWN) posted various cleanups to the kernel documentation (which he maintains), including an “arch reorg” to clean up architecture specific docs.

Architectures

Arm

Lukasz Luba posted “Introduce runtime modifiable Energy Model”, a patch set that “adds a new feature which allows to modify Energy Model (EM) power values at runtime. It will allow to better reflect power model of a recent SoCs and silicon. Different characteristics of the power usage can be leverages and thus better decisions made during task placement”. Thus, the kernel’s (CFS) scheduler can (with this patch) make a decision about where to schedule (place, or migrate) a running process (known as a task within the kernel) according to the power usage that the silicon knows will vary according to nature of the workload, and its use of hardware. For example, heavy GPU use will cause a GPU to heat up and alter a chip’s (SoC’s) thermal properties in a manner that may make it better to migrate other tasks to a different core.

Itanium

Reports of Itanium’s demise may not have been greatly exaggerated, but when it comes to the kernel they may have been a little premature by a month or two. Florian Weimer followed up to “Retire IA64/Itanium support” with a question, “Is this still going ahead? In userspace, ia64 is of course full of special cases, too, so many of us really want to see it gone, but we can’t really start the removal process while there is still kernel support”.

LoongArch

Tianrui Zhao posted version 5 of “Add KVM LoongArch support”.

Huacai Chen posted a patch, “LoongArch: Make WriteCombine configurable for ioremap()” that aims to work around a PCIe protocol violation in the implementation of the LS7A chipset.

Separately, Huacai also posted a patch enabling the kernel itself to use FPU (Floating Point Unit) functions. Quoting the patch, “They can be used by some other kernel components, e.g. the AMDGPU graphic driver for DCN”.

WANG Xuerui posted “LongArch: Make bounds-checking instructions useful”, referring to “BCE” (Bounds Checking Error) instructions, similar to those of other architectures, such as x86_64.

POWER

Laurent Dufour posted “Online new threads according to the current SMT level”, which aims to balance a hotplugged CPU’s SMT level against the current one used by the overall system. For example, a system capable of SMT8 but booted in SMT4 will currently nonetheless online all 8 SMT threads of a subsequently added CPU, rather than only 4 (to match the system).

RISC-V

Evan Green posted the fourth version of “RISC-V Hardware Probing User Interface”, which aims to handle the number of (potentially incompatible) ISA extensions present in implementations of the RISC-V architecture. The basic idea is to provide a vDSO (virtual Dynamic Shared Object – a kind of library that appears in userspace and is fast to link against, but is owned by the kernel) and backing syscall (for fallback use by the vDSO in certain cases) that can quickly hand an application key/value pairs representative of potential ISA features present on a system. The previous attempts had experienced pushback, so this time Evan came with performance numbers showing the (many) orders of magnitude differences in performance between using a vDSO/syscall approach vs. the sysfs file interface originally counter proposed by Greg KH (Greg Kroah-Hartman). Greg had preferred an application perform many open calls to parse sysfs files in order to determine the capabilities of a system, but this would be expensive for every binary. This patch series was later merged by Palmer Dabbelt (the RISC-V kernel maintainer) and should therefore make its way into the Linux 6.4 kernel series in the next couple of months.

Sia Jee Heng posted version 5 of a patch series implementing hibernation support for RISC-V. According to the posting, “This series adds RISC-V Hibernation/suspend to disk support. Low level Arch functions were created to support hibernation”. The cover letter explains how e.g. swsusp_arch_resume “creates a temporary page table that [covering only] the linear map. It copies the restore code to a ‘safe’ page, then [start] restore the memory image”.

Heiko Stuebner posted “RISC-V: support some cryptography accelerations”. These rely on version 14 of a previous patch series adding experimental support for the “v” (vector) extension, which has not been ratified (made official) by the RISC-V International organization yet. And speaking of this, a recent discussion of the non-standard implementation of the RISC-V vector extension in the “T-Head C9xx” cores suggests describing those as an “errata” implementation.

The PINE64 project recently began shipping a RISC-V development board known as “Star64”. This board uses the StarFive JH7110 SoC for which Samin Guo recently posted an updated ethernet driver, apparently based on the DesignWare MAC from Synopsys. Separately, Walker Chen posted a DMA driver for the same SoC, and Mason Huo posted cpufreq support (which included enabling “the axp15060 pmic for the cpu power source”). Seems an effort is underway to upstream support for this low-cost “Raspberry Pi”-like alternative in the RISC-V ecosystem.

Greg Ungerer posted “riscv: support ELF format binaries in nommu mode” which does what it says on the tin: “add the ability to run ELF format binaries when running RISC-V in nommu mode. That support is actually part of the ELF-FDPIC loader, so these changes are all about making that work on RISC-V”. Greg notes, “These changes have not been used to run actual ELF-FDPIC binaries. It is used to load and run normal ELF – compiled -pie format. Though the underlying changes are expected to work with full ELF-FDPIC binaries if or when that is supported on RISC-V in gcc”.

Anup Patel posted version 18 of “RISC-V IPI Improvements” which aims to teach RISC-V (on suitable hardware) how to use “normal per-CPU interrupts” to send IPIs (Inter-Processor Interrupts), as well as remote TLB (Translation Lookaside Buffer) flushes and cache maintenance operations without having to resort to calls into “M” mode firmware.

x86 (x86_64)

Rick Edgecombe posted version 8 of “Shadow stacks for userspace”, to which Borislav Petkov replied “Yes, finally! That was loooong in the making. Thanks for the persistence and patience”. He signed off as having reviewed the patches.

Ian Rogers posted “Event updates for GNR, MTL and SKL”. Apparently these perf events are generated automatically using a script on Intel’s github (that’s pretty sweet).

Usama Arif posted version 15 of “Parallel CPU bringup for x86_64”. This is about doing parallel calls to INIT/SIPI/SIPI (the initialization sequences used by x86 CPUs to bring them up) rather than the single threaded process that previously was used by the Linux kernel.

Tony Luck posted version 2 of “Handle corrected machine check interrupt storms”, which includes additional patches from Smita Koralahalli that “Extend the logic of handling Intel’s corrected machine check interrupt storms to AMD’s threshold interrupts”. 

Yi Liu posted “iommu: Add nested domain support”, which “Introduce[s] a new domain type for a user space I/O address, which is nested on top of another address space address represented by a UNMANAGED domain”.

Kirill A. Shutemov posted version 16 of “Linear Address Masking enabling”. As he noted, “(LAM) modifies the checking that is applied to 64-bit linear addresses, allowing software to use of the untranslated address bits for metadata. The capability can be used for efficient address sanitizers (ASAN) implementation and for optimizations in JITs and virtual machines”. It’s also been present in architectures such as Arm for many, many years as TBI (Top Byte Ignore), etc.

Kuppuswamy Sathyanarayanan posted “TDX Guest Quote generation support”, which enables “TDX” (Trusted Domain Extensions – aka Confidential Compute) guests to attest to their “trustworthiness to other entities before provisioning secrets to the guest”. The patch describes a two step process including a “TDREPORT generation” and a “Quote generation”. The report captures measurements while the report is sent to a “Quoting Enclave” (QE) that generates a “remotely verifiable Quote”. A special conduit is provided for guests to send these quotes.

Shan Kang posted some benchmark results from KVM for Intel’s “FRED” (Flexible Return and Event Delivery) new syscall/sysenter enhanced architecture.

Mario Limonciello posted “Add vendor agnostic mechanism to report hardware sleep”, noting that “An import part of validating that S0ix [an SoC level idle power state] worked properly is to check how much of a cycle was spent in a hardware sleep state”.

April 25, 2023 12:11 AM

April 24, 2023

Linux Plumbers Conference: eBPF & Networking Track

Linux Plumbers Conference 2023 is pleased to host the eBPF & Networking Track!

For the fourth year in a row, the eBPF & Networking Track is going to bring together developers, maintainers, and other contributors from all around the globe to discuss improvements to the Linux kernel’s networking stack as well as BPF subsystem and their surrounding user space ecosystems such libraries, loaders, compiler backends, and other related system tooling.

The gathering is designed to foster collaboration and face to face discussion of ongoing development topics as well as to encourage bringing new ideas into the development community for the advancement of both subsystems.

Proposals can cover a wide range of topics related to Linux networking and BPF covering improvements in areas such as (but not limited to) core networking, protocols, routing, performance, tunneling, drivers, BPF infrastructure and its use in tracing, security, networking, scheduling and beyond, as well as non-kernel components like libraries, compilers, testing infra and tools.

Please come and join us in the discussion. We hope to see you there!

April 24, 2023 08:49 AM

Dave Airlie (blogspot): Fedora 38 LLVM vs Team Fortress 2 (TF2)

F38 just released and seeing a bunch of people complain that TF2 dies on AMD or other platforms when lavapipe is installed. Who's at fault? I've no real idea. How to fix it? I've no real idea.

What's happening?

AMD OpenGL drivers use LLVM as the backend compiler. Fedora 38 updated to LLVM 16. LLVM 16 is built with c++17 by default. C++17 introduces new "operator new/delete" interfaces[1].

TF2 ships with it's own libtcmalloc_minimal.so implementation, tcmalloc expects to replace all the new/delete interfaces, but the version in TF2 must not support or had incorrect support for the new align interfaces.

What happens is when TF2 probes OpenGL and LLVM is loaded, when DenseMap initializes, one "new" path fails to go into tcmalloc, but the "delete" path does, and this causes tcmalloc to explode with

"src/tcmalloc.cc:278] Attempt to free invalid pointer"

Fixing it?

I'll talk to Valve and see if we can work out something, LLVM 16 doesn't seem to support building with C++14 anymore. I'm not sure if static linking libstdc++ into LLVM might avoid the tcmalloc overrides, it might not also be acceptable to the wider Fedora community.

[1] https://www.cppstories.com/2019/08/newnew-align/

April 24, 2023 03:29 AM

April 19, 2023

Dave Airlie (blogspot): nouveau/gsp + kernel module firmware selection for initramfs generation

There are plans for nouveau to support using the NVIDIA supplied GSP firmware in order to support new hardware going forward

The nouveau project doesn't have any input or control over the firmware. NVIDIA have made no promises around stable ABI or firmware versioning. The current status quo is that NVIDIA will release versioned signed gsp firmwares as part of their driver distribution packages that are version locked to their proprietary drivers (open source and binary). They are working towards allowing these firmwares to be redistributed in linux-firmware.

The NVIDIA firmwares are quite large. The nouveau project will control the selection of what versions of the released firmwares are to be supported by the driver, it's likely a newer firmware will only be pulled into linux-firmware for:

  1. New hardware support (new GPU family or GPU support)
  2. Security fix in the firmware
  3. New features that is required to be supported

This should at least limit the number of firmwares in the linux-firmware project.

However a secondary effect of the size of the firmwares is that having the nouveau kernel module at more and more MODULE_FIRMWARE lines for each iteration will mean the initramfs sizes will get steadily larger on systems, and after a while the initramfs will contain a few gsp firmwares that the driver doesn't even need to run.

To combat this I've looked into adding some sort of module grouping which dracut can pick one out off.

It currently looks something like:

MODULE_FIRMWARE_GROUP_ONLY_ONE("ga106-gsp");
MODULE_FIRMWARE("nvidia/ga106/gsp/gsp-5258902.bin");
MODULE_FIRMWARE("nvidia/ga106/gsp/gsp-5303002.bin");
MODULE_FIRMWARE_GROUP_ONLY_ONE("ga106-gsp"); 

This group only one will end up in the module info section and dracut will only pick one module from the group to install into the initramfs. Due to how the module info section is constructed this will end up picking the last module in the group first.

The dracut MR is: 

https://github.com/dracutdevs/dracut/pull/2309

The kernel one liner is:

https://lore.kernel.org/all/20230419043652.1773413-1-airlied@gmail.com/T/#u


 

April 19, 2023 05:16 AM

April 18, 2023

Matthew Garrett: PSA: upgrade your LUKS key derivation function

Here's an article from a French anarchist describing how his (encrypted) laptop was seized after he was arrested, and material from the encrypted partition has since been entered as evidence against him. His encryption password was supposedly greater than 20 characters and included a mixture of cases, numbers, and punctuation, so in the absence of any sort of opsec failures this implies that even relatively complex passwords can now be brute forced, and we should be transitioning to even more secure passphrases.

Or does it? Let's go into what LUKS is doing in the first place. The actual data is typically encrypted with AES, an extremely popular and well-tested encryption algorithm. AES has no known major weaknesses and is not considered to be practically brute-forceable - at least, assuming you have a random key. Unfortunately it's not really practical to ask a user to type in 128 bits of binary every time they want to unlock their drive, so another approach has to be taken.

This is handled using something called a "key derivation function", or KDF. A KDF is a function that takes some input (in this case the user's password) and generates a key. As an extremely simple example, think of MD5 - it takes an input and generates a 128-bit output, so we could simply MD5 the user's password and use the output as an AES key. While this could technically be considered a KDF, it would be an extremely bad one! MD5s can be calculated extremely quickly, so someone attempting to brute-force a disk encryption key could simply generate the MD5 of every plausible password (probably on a lot of machines in parallel, likely using GPUs) and test each of them to see whether it decrypts the drive.

(things are actually slightly more complicated than this - your password is used to generate a key that is then used to encrypt and decrypt the actual encryption key. This is necessary in order to allow you to change your password without having to re-encrypt the entire drive - instead you simply re-encrypt the encryption key with the new password-derived key. This also allows you to have multiple passwords or unlock mechanisms per drive)

Good KDFs reduce this risk by being what's technically referred to as "expensive". Rather than performing one simple calculation to turn a password into a key, they perform a lot of calculations. The number of calculations performed is generally configurable, in order to let you trade off between the amount of security (the number of calculations you'll force an attacker to perform when attempting to generate a key from a potential password) and performance (the amount of time you're willing to wait for your laptop to generate the key after you type in your password so it can actually boot). But, obviously, this tradeoff changes over time - defaults that made sense 10 years ago are not necessarily good defaults now. If you set up your encrypted partition some time ago, the number of calculations required may no longer be considered up to scratch.

And, well, some of these assumptions are kind of bad in the first place! Just making things computationally expensive doesn't help a lot if your adversary has the ability to test a large number of passwords in parallel. GPUs are extremely good at performing the sort of calculations that KDFs generally use, so an attacker can "just" get a whole pile of GPUs and throw them at the problem. KDFs that are computationally expensive don't do a great deal to protect against this. However, there's another axis of expense that can be considered - memory. If the KDF algorithm requires a significant amount of RAM, the degree to which it can be performed in parallel on a GPU is massively reduced. A Geforce 4090 may have 16,384 execution units, but if each password attempt requires 1GB of RAM and the card only has 24GB on board, the attacker is restricted to running 24 attempts in parallel.

So, in these days of attackers with access to a pile of GPUs, a purely computationally expensive KDF is just not a good choice. And, unfortunately, the subject of this story was almost certainly using one of those. Ubuntu 18.04 used the LUKS1 header format, and the only KDF supported in this format is PBKDF2. This is not a memory expensive KDF, and so is vulnerable to GPU-based attacks. But even so, systems using the LUKS2 header format used to default to argon2i, again not a memory expensive KDFwhich is memory strong, but not designed to be resistant to GPU attack (thanks to the comments pointing out my misunderstanding here). New versions default to argon2id, which is. You want to be using argon2id.

What makes this worse is that distributions generally don't update this in any way. If you installed your system and it gave you pbkdf2 as your KDF, you're probably still using pbkdf2 even if you've upgraded to a system that would use argon2id on a fresh install. Thankfully, this can all be fixed-up in place. But note that if anything goes wrong here you could lose access to all your encrypted data, so before doing anything make sure it's all backed up (and figure out how to keep said backup secure so you don't just have your data seized that way).

First, make sure you're running as up-to-date a version of your distribution as possible. Having tools that support the LUKS2 format doesn't mean that your distribution has all of that integrated, and old distribution versions may allow you to update your LUKS setup without actually supporting booting from it. Also, if you're using an encrypted /boot, stop now - very recent versions of grub2 support LUKS2, but they don't support argon2id, and this will render your system unbootable.

Next, figure out which device under /dev corresponds to your encrypted partition. Run

lsblk

and look for entries that have a type of "crypt". The device above that in the tree is the actual encrypted device. Record that name, and run

sudo cryptsetup luksHeaderBackup /dev/whatever --header-backup-file /tmp/luksheader

and copy that to a USB stick or something. If something goes wrong here you'll be able to boot a live image and run

sudo cryptsetup luksHeaderRestore /dev/whatever --header-backup-file luksheader

to restore it.

(Edit to add: Once everything is working, delete this backup! It contains the old weak key, and someone with it can potentially use that to brute force your disk encryption key using the old KDF even if you've updated the on-disk KDF.)

Next, run

sudo cryptsetup luksDump /dev/whatever

and look for the Version: line. If it's version 1, you need to update the header to LUKS2. Run

sudo cryptsetup convert /dev/whatever --type luks2

and follow the prompts. Make sure your system still boots, and if not go back and restore the backup of your header. Assuming everything is ok at this point, run

sudo cryptsetup luksDump /dev/whatever

again and look for the PBKDF: line in each keyslot (pay attention only to the keyslots, ignore any references to pbkdf2 that come after the Digests: line). If the PBKDF is either "pbkdf2" or "argon2i" you should convert to argon2id. Run the following:

sudo cryptsetup luksConvertKey /dev/whatever --pbkdf argon2id

and follow the prompts. If you have multiple passwords associated with your drive you'll have multiple keyslots, and you'll need to repeat this for each password.

Distributions! You should really be handling this sort of thing on upgrade. People who installed their systems with your encryption defaults several years ago are now much less secure than people who perform a fresh install today. Please please please do something about this.

comment count unavailable comments

April 18, 2023 06:35 PM

April 17, 2023

Matthew Garrett: Booting modern Intel CPUs

CPUs can't do anything without being told what to do, which leaves the obvious problem of how do you tell a CPU to do something in the first place. On many CPUs this is handled in the form of a reset vector - an address the CPU is hardcoded to start reading instructions from when power is applied. The address the reset vector points to will typically be some form of ROM or flash that can be read by the CPU even if no other hardware has been configured yet. This allows the system vendor to ship code that will be executed immediately after poweron, configuring the rest of the hardware and eventually getting the system into a state where it can run user-supplied code.

The specific nature of the reset vector on x86 systems has varied over time, but it's effectively always been 16 bytes below the top of the address space - so, 0xffff0 on the 20-bit 8086, 0xfffff0 on the 24-bit 80286, and 0xfffffff0 on the 32-bit 80386. Convention on x86 systems is to have RAM starting at address 0, so the top of address space could be used to house the reset vector with as low a probability of conflicting with RAM as possible.

The most notable thing about x86 here, though, is that when it starts running code from the reset vector, it's still in real mode. x86 real mode is a holdover from a much earlier era of computing. Rather than addresses being absolute (ie, if you refer to a 32-bit address, you store the entire address in a 32-bit or larger register), they are 16-bit offsets that are added to the value stored in a "segment register". Different segment registers existed for code, data, and stack, so a 16-bit address could refer to different actual addresses depending on how it was being interpreted - jumping to a 16 bit address would result in that address being added to the code segment register, while reading from a 16 bit address would result in that address being added to the data segment register, and so on. This is all in order to retain compatibility with older chips, to the extent that even 64-bit x86 starts in real mode with segments and everything (and, also, still starts executing at 0xfffffff0 rather than 0xfffffffffffffff0 - 64-bit mode doesn't support real mode, so there's no way to express a 64-bit physical address using the segment registers, so we still start just below 4GB even though we have massively more address space available).

Anyway. Everyone knows all this. For modern UEFI systems, the firmware that's launched from the reset vector then reprograms the CPU into a sensible mode (ie, one without all this segmentation bullshit), does things like configure the memory controller so you can actually access RAM (a process which involves using CPU cache as RAM, because programming a memory controller is sufficiently hard that you need to store more state than you can fit in registers alone, which means you need RAM, but you don't have RAM until the memory controller is working, but thankfully the CPU comes with several megabytes of RAM on its own in the form of cache, so phew). It's kind of ugly, but that's a consequence of a bunch of well-understood legacy decisions.

Except. This is not how modern Intel x86 boots. It's far stranger than that. Oh, yes, this is what it looks like is happening, but there's a bunch of stuff going on behind the scenes. Let's talk about boot security. The idea of any form of verified boot (such as UEFI Secure Boot) is that a signature on the next component of the boot chain is validated before that component is executed. But what verifies the first component in the boot chain? You can't simply ask the BIOS to verify itself - if an attacker can replace the BIOS, they can replace it with one that simply lies about having done so. Intel's solution to this is called Boot Guard.

But before we get to Boot Guard, we need to ensure the CPU is running in as bug-free a state as possible. So, when the CPU starts up, it examines the system flash and looks for a header that points at CPU microcode updates. Intel CPUs ship with built-in microcode, but it's frequently old and buggy and it's up to the system firmware to include a copy that's new enough that it's actually expected to work reliably. The microcode image is pulled out of flash, a signature is verified, and the new microcode starts running. This is true in both the Boot Guard and the non-Boot Guard scenarios. But for Boot Guard, before jumping to the reset vector, the microcode on the CPU reads an Authenticated Code Module (ACM) out of flash and verifies its signature against a hardcoded Intel key. If that checks out, it starts executing the ACM. Now, bear in mind that the CPU can't just verify the ACM and then execute it directly from flash - if it did, the flash could detect this, hand over a legitimate ACM for the verification, and then feed the CPU different instructions when it reads them again to execute them (a Time of Check vs Time of Use, or TOCTOU, vulnerability). So the ACM has to be copied onto the CPU before it's verified and executed, which means we need RAM, which means the CPU already needs to know how to configure its cache to be used as RAM.

Anyway. We now have an ACM loaded and verified, and it can safely be executed. The ACM does various things, but the most important from the Boot Guard perspective is that it reads a set of write-once fuses in the motherboard chipset that represent the SHA256 of a public key. It then reads the initial block of the firmware (the Initial Boot Block, or IBB) into RAM (or, well, cache, as previously described) and parses it. There's a block that contains a public key - it hashes that key and verifies that it matches the SHA256 from the fuses. It then uses that key to validate a signature on the IBB. If it all checks out, it executes the IBB and everything starts looking like the nice simple model we had before.

Except, well, doesn't this seem like an awfully complicated bunch of code to implement in real mode? And yes, doing all of this modern crypto with only 16-bit registers does sound like a pain. So, it doesn't. All of this is happening in a perfectly sensible 32 bit mode, and the CPU actually switches back to the awful segmented configuration afterwards so it's still compatible with an 80386 from 1986. The "good" news is that at least firmware can detect that the CPU has already configured the cache as RAM and can skip doing that itself.

I'm skipping over some steps here - the ACM actually does other stuff around measuring the firmware into the TPM and doing various bits of TXT setup for people who want DRTM in their lives, but the short version is that the CPU bootstraps itself into a state where it works like a modern CPU and then deliberately turns a bunch of the sensible functionality off again before it starts executing firmware. I'm also missing out the fact that this entire process only kicks off after the Management Engine says it can, which means we're waiting for an entirely independent x86 to boot an entire OS before our CPU even starts pretending to execute the system firmware.

Of course, as mentioned before, on modern systems the firmware will then reprogram the CPU into something actually sensible so OS developers no longer need to care about this[1][2], which means we've bounced between multiple states for no reason other than the possibility that someone wants to run legacy BIOS and then boot DOS on a CPU with like 5 orders of magnitude more transistors than the 8086.

tl;dr why can't my x86 wake up with the gin protected mode already inside it

[1] Ha uh except that on ACPI resume we're going to skip most of the firmware setup code so we still need to handle the CPU being in fucking 16-bit mode because suspend/resume is basically an extremely long reboot cycle

[2] Oh yeah also you probably have multiple cores on your CPU and well bad news about the state most of the cores are in when the OS boots because the firmware never started them up so they're going to come up in 16-bit real mode even if your boot CPU is already in 64-bit protected mode, unless you were using TXT in which case you have a different sort of nightmare that if we're going to try to map it onto real world nightmare concepts is one that involves a lot of teeth. Or, well, that used to be the case, but ACPI 6.4 (released in 2021) provides a mechanism for the OS to ask the firmware to wake the CPU up for it so this is invisible to the OS, but you're still relying on the firmware to actually do the heavy lifting here

comment count unavailable comments

April 17, 2023 06:54 AM

April 11, 2023

Linux Plumbers Conference: CFP Open – Microconferences

We are pleased to announce the Call for Papers (CFP) for Microconferences at the Linux Plumbers Conference (LPC) 2023.

LPC 2023 is currently planned to take place in Richmond, VA, USA from 13 November to 15 November. For details about the location, co-location with other events see our website and social media for updates.

Like in 2022, Linux Plumbers Conference will be a hybrid event but still, ideally microconference runners should be willing and able to attend in person.

As the name suggests, LPC is about Linux plumbing encompassing topics from kernel and userspace. A microconference is a set of sessions organized around a particular topic. The topic can be a kernel subsystem or a specific problem area in either kernel or userspace.

A microconference is supposed to be research and development in action and an abstract for a Microconference should be thought of as a set of research questions and problem statements.

The sessions in each microconference are expected to address specific problems and should generate new ideas, solutions, and patches. Sessions should be focused on discussion. Presentations should always aim to aid or kick off a discussion. If your presentation feels like a talk we would recommend to consider submitting to the LPC refereed track.

In the past years microconferences were organized around topics such as security, scalability, energy efficiency, toolchains, containers, printing, system boot, Android, scheduling, filesystems, tracing, or real-time. The LPC microconference track is open to a wide variety of topics as long as it is focused, concerned with interesting problems, and is related to open source and the wider Linux ecosystem. We are happy about a wide range of topics!

A microconference submission should outline the overall topic and list key people and problems which can be discussed. The list of problems and specific topics in a microconference can be continuously updated until fairly late. This will allow microconferences to cover topics that pop up after submission and to address new developments or problems.

Microconferences that have been at previous LPCs should list results and accomplishments in the submission and should make sure to cover follow-up work and new topics.

Submissions are due on or before 11:59PM UTC on Sunday, June 1, 2023.

April 11, 2023 07:43 AM

April 05, 2023

Pete Zaitcev: MinIO in the news

The company we mentioned previously is in the news. Seen at Blocks And Files:

If MinIO want to discourage software developers from including its code in their products it is managing this very well.

April 05, 2023 12:15 AM

April 04, 2023

Linux Plumbers Conference: CFP Open – Refereed Track Presentations

The Call for Refereed Presentation Proposals for the 2023 edition of the Linux Plumbers Conference (LPC) is now open. We plan to hold LPC in Richmond Virginia on November 13-15, 2023.

Submitters should ideally be able to give their presentation in person, although presenting remotely will be possible if necessary. Expectation is that the presentation will be done live in either case, to maximize audience interaction. Please see our website or social media for regular updates.

Refereed Presentations are 45 minutes in length and should focus on a specific aspect of the “plumbing” in a Linux system. Examples of Linux plumbing include core kernel subsystems, init systems, core libraries, windowing systems, management tools, device support, media creation/playback, and so on. The best presentations are not about finished work, but rather problem statements, proposals, or proof-of-concept solutions that require face-to-face discussions and debate.
The Refereed Presentations track will be running throughout all three days of the conference.

Linux Plumbers Conference Program Committee members will be reviewing all submitted proposals. High-quality submissions that cannot be accepted due to the limited number of slots will be forwarded to organizers of suitable Linux Plumbers Microconferences for further consideration.
Submissions are due on or before 11:59PM UTC on Sunday, August 6, 2023.

April 04, 2023 07:38 AM

March 24, 2023

James Bottomley: Converting Engines to OpenSSL-3 Providers

Engines in OpenSSL have a long history of providing new algorithms (Russian GOST hash/signature etc) but they can also be used to interface external crypto tokens (pkcs#11) or even key managers like my own TPM engine. I’ve actually been using my TPM2 engine for nearly a decade so that I no longer have to have an unprotected private keys anywhere on my laptops (including for ssh). The purpose of this post is to look at the differences between Providers and Engines and give advice on the minimum necessary Provider implementation to give back all the Engine functionality. So this post is aimed at Engine developers who wish to convert to Providers rather than giving user advice for either.

TPMs and Engines

TPM2 actually has a remarkable number of algorithms: hashing, symmetric encryption, asymmetric signatures, key derivation, etc. However, most TPMs are connected to the host over very slow busses (usually serial), which means that no-one in their right mind would use a TPM for bulk data operations (like hashing or symmetric encryption) since it will take orders of magnitude longer than if the native CPU did it. Thus from an Engine point of view, the TPM is really only good for guarding private asymmetric keys and doing sign or decrypt operations on them, which are the only capabilities the TPM engine has.

Hashes and Signatures

Although I said above we don’t use the TPM for doing hashes, the TPM2_Sign() routines insist on knowing which hash they’re signing. For ECDSA signatures, this is irrelevant, since the hash type plays no part in the signature (it’s always truncated to key length and converted to a bignum) but for RSA the ASN.1 form of the hash description is part of the toBeSigned data. The problem now is that early TPM2’s only had two hash algorithms (sha1 and sha256) and the engine wanted to be able to use larger hash sizes. The solution was actually easy: lie about the hash size for ECDSA, so always give the hash that’s the width of the key (sha256 for NIST P-256 and sha384 for NIST P-384) and left truncate the passed in hash if larger or left zero pad if smaller.

For RSA, the problem is more acute, since TPM2_Sign() actually takes a raw digest and adds the hash description but the engine code sends down the fully described hash which merely needs to be padded if PKCS1 (PSS data is fully padded when sent down) and encrypted with the private key. The solution to this taken years ago was not to bother with TPM2_Sign() at all for RSA keys but instead to do a Decrypt operation1. This also means that TPM RSA engine keys are marked as decryption keys, not signing keys.

The Engine Itself

Given that the TPM is really only guarding the private keys, it only makes sense to substitute engine functions for the private key operations. Although the TPM can do public key operations, the core OpenSSL routines do them much faster and no information is leaked about the private key by doing them through OpenSSL, so Engine keys were constructed from standard OpenSSL keys by substituting a couple of private key methods from the underlying key types. One thing Engines were really bad at was passing additional parameters at key creation time and doing key wrapping. The result is that most Engines already have a separate tool to create engine keys (create_tpm2_key for the TPM2 engine) because complex arguments are needed for TPM specific things like key policy.

TPM keys are really both public and private keys combined and the public part of the key can be accessed without a password (unlike OpenSSL keys) or even access to the TPM that created the key. However, the engine code doesn’t usually know when only the public part of the key will be required and password prompting is done in OpenSSL at key loading (the TPM doesn’t need a password until key use), so usually after a TPM key is created, the public key is also separately derived using a pkey operation and used as a normal public key.

The final, and most problematic Engine feature, is key loading. Engine keys must be loaded using a special API (ENGINE_load_private_key). OpenSSL built in applications require you to specify the key type (-keyform option) but most well written OpenSSL applications simply try loading the PEM key first, then the DER key then the Engine key (since they all have different APIs), but frequently the Engine key is forgotten leading to the application having to be patched if you want to use them with any engine.

Converting Engines to Providers

The provider API has several pieces which apply to asymmetric key handling: Store, Encode/Decode, Key Management, Signing and Decryption (plus many more if you provide hashes or symmetric algorithms). One thing to remember about the store API is that if you only have file based keys, you should use the generic file store instead. Implementing your own store is only necessary if you also have a URI based input (like PKCS#11). In fact the TPM Engine has a URI for persistent keys, so the TPM store implementation will be dealt with later.

Provider Basics

If a provider is specified on the OpenSSL command line, it will become the sole provider of every algorithm. That means that providers like the TPM2 one, which only fill in a subset of functions cannot operate on their own and must always be used with another provider (usually the default one). After initialization (see below) all provider actions are governed by algorithm tables. One of the key questions for any provider is what to do about algorithm names and properties. Because the TPM2 provider relies on external providers for other algorithms, it must use consistent key names (so “EC” for Elliptic curve and “RSA” for RSA), even though it has only a single key type. There are also elements of the provider key managements, like the way Elliptic Curve keys change name to “ECDSA” for signing and “ECDH” for derivation, which is driven by the key management query operation function. As far as I can tell, this provides no benefit and merely serves to add complexity to the provider, so my provider doesn’t implement these functions and uses the same key names throughout.

The most mysterious string of all is the algorithm property one. The manual gives very little clue as to what should be in it besides “provider=<provider name>”. Empirically it seems to have input, output and structure elements, which are primarily used by encoders and decoders: input can be either der or pem and structure must be the same as the OSSL_OBJECT_PARAM_DATA_STRUCTURE string produced by the der decoder (although you are free to choose any name for this). output is even more varied and the best current list is provided by the source; however the only encoder the TPM2 provider actually provides is the text one.

One of the really nice things about providers is that when OpenSSL is presented with a key to load, every provider will be tried (usually in the order they’re specified on the command line) to decode and load the key. This completely fixes the problem with missing ENGINE_load_private_key() functions is applications because now all applications can use any provider key. This benefit alone is enough to outweigh all the problems of doing the actual conversion to a provider.

Replacing Engine Controls

Engine controls were key/value pairs passed into engines. The TPM2 engine has two: “PIN” for the parent authority and “NVPREFIX” for the prefix which identifies a non-volatile key. Although these can be passed in with the ENGINE_ctrl() functions, they were mostly set in the configuration file. This latter mechanism can be replaced with the provider base callback core_get_params(). Most engine controls actually set global variables and with the provider, they could be placed into the provider context. However, for code sharing it’s easier simply to keep the current globals mechanism.

Initialization and Contexts

Every provider has to have an OSSL_provider_init() routine which fills in a dispatch table and allocates a core context, which is passed in to every other context routine. For a provider, there’s really only one instance, so storing variables in the provider context is really no different (except error handling and actually getting destructors) from using static variables and since the engine used static variables, that’s what we’ll stick with. However, pretty much every routine will need an allocated library context, so it’s easiest to allocate at provider init time and pass it through as the provider context. The dispatch routine must contain a query_operation function, and probably needs a teardown function if you need to use a destructor, but nothing else.

All provider function groups require a newctx() and freectx() call. This is not optional because the current OpenSSL code calls them without checking so they cannot be NULL. Thus for function groups (like encoders and key management) where new contexts aren’t really required it makes sense to use pass through context functions that simply pass through the provider context for newctx() and do nothing for freectx().

The man page implies it is necessary to pick a load of functions from the in argument, but it seems unnecessary for those which the OpenSSL library already provides. I assume it’s something to do with a provider not requiring OpenSSL symbols, but it’s impossible to implement a provider today without relying on other OpenSSL functions than those which can be picked out of the in argument.

Decoders

Decoders are used to convert a read file from PEM to DER (this is essentially the same conversion for every provider, so it is strange you have to do this rather than it being done in the core routines) and then DER to an internal key structure. The remaining decoders take DER in and output a labelled key structure (which is used as a component of the EVP_PKEY), if you do both RSA and EC keys, you need one for each key type and, unfortunately, they must be provided and may not cross decode (the RSA decoder must reject EC keys and vice versa). This is actually required so the OpenSSL core can tell what type of key it has but is a royal pain for things like the TPM where the key DER is identical regardless of key type:

const OSSL_ALGORITHM decoders[] = {
	{ "DER", "provider=tpm2,input=pem", decode_pem_fns },
	{ "RSA", "provider=tpm2,input=der,structure=TPM2", decode_rsa_fns },
	{ "EC", "provider=tpm2,input=der,structure=TPM2", decode_ec_fns },
	{ NULL, NULL, NULL }
};

The decode_pem_fns can be cut and pasted from any provider with the sole exception that you probably have a different PEM guard string that you need to check for.

Then a sample decoder function set looks like:

static const OSSL_DISPATCH decode_rsa_fns[] = {
	{ OSSL_FUNC_DECODER_NEWCTX, (void (*)(void))tpm2_passthrough_newctx },
	{ OSSL_FUNC_DECODER_FREECTX, (void (*)(void))tpm2_passthrough_freectx },
	{ OSSL_FUNC_DECODER_DECODE, (void (*)(void))tpm2_rsa_decode },
	{ 0, NULL }
};

The main job of the DECODER_DECODE function is to take the DER form of the key and convert it to an internal PKEY and send that PKEY up by reference so it can be consumed by a key management load.

Encoders

By and large, engines all come with creation tools for key files, which means that while you could now use the encoder routines to create key files, it’s probably better off to stick with what you have (especially for things like the TPM that can have complex policy statements attached to keys), so you can omit providing any encoder functions at all. The only possible exception is if you want the keys pretty printing, you might consider a text output encoder:

const OSSL_ALGORITHM encoders[] = {
	{ "RSA", "provider=tpm2,output=text", encode_text_fns },
	{ "EC", "provider=tpm2,output=text", encode_text_fns },
	{ NULL, NULL, NULL }
};

Which largely follows the format for decoders:

static const OSSL_DISPATCH encode_text_fns[] = {
	{ OSSL_FUNC_ENCODER_NEWCTX, (void (*)(void))tpm2_passthrough_newctx },
	{ OSSL_FUNC_ENCODER_FREECTX, (void (*)(void))tpm2_passthrough_freectx },
	{ OSSL_FUNC_ENCODER_ENCODE, (void (*)(void))tpm2_encode_text },
	{ 0, NULL }
};

Note: there are many more encode/decode function types you could supply, but the above are the essential ones.

Key Management

Nothing in the key management functions requires the underlying key object to be reference counted since it belongs to an already reference counted EVP_PKEY structure in the OpenSSL generic routines. However, the signature operations can’t be implemented without context duplication and the signature context must contain a reference to the provider key so, depending on how the engine implements keys, duplicating via reference might be easier than duplicating via copy. The minimum functionality to implement is LOAD, FREE and HAS. If you are doing Elliptic Curve derive or reference counting your engine keys, you will also need NEW. You also have to provide both GET_PARAMS and GETTABLE_PARAMS (many key management functions have to implement pairs like this) for at least the BITS, SECURITY_BITS and SIZE properties)2.

You must also implement the EXPORT (and EXPORT_TYPES, which must be provided but has no callers) so that you can convert your engine key to an external public key. Note the EXPORT function must fail if asked to export the private key otherwise the default provider will try to do the private key operations via the exported key as well.

If you need to do Elliptic Curve key derivation you must also implement IMPORT (and IMPORT_TYPES) because the creation of the peer key (even though it’s a public one) will necessarily go through your provider key managment functions.

The HAS function can be problematic because OpenSSL doesn’t assume the interchangeability of public and private keys, even if it is true of the engine. Thus the engine must remember in the decode routines what key selector was used (public, private or both) and make sure to condition HAS on that value.

Signatures

This is one of the most confusing areas for simple signing devices (which don’t do hashing) because you’d assume you can implement NEWCTX, FREECTX, SIGN_INIT and SIGN and be done. Unfortunately, in spite of the fact that all the DIGEST_SIGN_… functions can be implemented in terms of the previous functions and generic hashing, they aren’t, so all providers are required to duplicate hashing and signing functions including constructing the binary ASN.1 for the certificate signature function (via GET_CTX_PARAMS and its pair GETTABLE_CTX_PARAMS). Another issue a sign only token will get into is padding: OpenSSL supports a variety of padding schemes (for RSA) but is deprecating their export, so if your token doesn’t do an expected form of padding, you’ll need to implement that in your provider as well. Recalling that the TPM2 provider uses RSA Decryption for signatures means that the TPM2 provider implementation is entirely responsible for padding all signatures. In order to try to come up with a common solution, I added an opensslmissing directory to my provider under the MIT licence that anyone is free to incorporate into their provider if they end up having the same digest and padding problems I did.

Decryption and Derivation

The final thing a private key provider needs to do is decryption. This is a very different operation between Elliptic Curve and RSA keys, so you need two different operations for each (OSSL_OP_ASYM_CIPHER for RSA and OSSL_OP_KEYEXCH for EC). Each ends up being a slightly special snowflake: RSA because it may need OAEP padding (which the TPM does) but with the most usual cipher being md5 (so OAEP padding with arbitrary mask and hash function is also in opensslmissing), which the TPM doesn’t do. and EC because it requires derivation from another public key. The problem with this latter operation is that because of the way OpenSSL works, the public key must be imported into the provider before it can be used, so you must provide NEW, IMPORT and IMPORT_TYPES routines for key management for this to happen.

Store

The store functions only need to be used if you have to load keys that aren’t file based (for file based keys the default provider file store will load them). For the TPM there are a set of NV Keys with 0x81 MSO prefix that aren’t file based. We load these in the engine with //nvkey:<hex> as the designator (and the //nvkey: prefix is overridable in the config file). To get this to work in the Provider is slightly problematic because the scheme (the //nvkey: prefix) must be specified as the provider algorithm_name which is usually a constant in a static array. This means that the stores actually can’t be static and must have the configuration defined name poked into it before the store is used, but this is relatively easy to arrange this in the OSSL_provider_init() function. Once this is done, it’s relatively easy to create a store. The only really problematic function is the STORE_EOF one, which is designed around files but means you have to keep an eof indicator in the context and update it to be 1 once the load function has complete.

The Provider Recursion Problem

This doesn’t seem to be discussed anywhere else, but it can become a huge issue if your provider depends on another library which also uses OpenSSL. The TPM2 provider depends on either the Intel or IBM TSS libraries and both of those use OpenSSL for cryptographic operations around TPM transport security since both of them use ECDH to derive a seed for session encryption and HMAC. The problem is that ordinarily the providers are called in the order they’re listed, so you always have to specify –provider default –provider tpm2 to make up for the missing public key operations in the TPM2 provider. However, the OpenSSL core operates a cache for the provider operations it has previously found and searches the cache first before doing any other lookups, so if the EC key management routines are cached (as they are if you input a TPM format key) and the default ones aren’t (because inputting TPM format keys requires no public key operations), the next attempt to generate an ephemeral EC key for the ECDH security derivation will find the TPM2 provider first. So say you are doing a signature which requires HMAC security to guard against interposer tampering. The use of ECDH in the HMAC seed derivation will then call back into the provider to do an ECDH operation which also requires session security and will thus call back again into the provider ad infinitum (or at least until stack overflow). The only way to break out of this infinite recursion is to try to prime the cache with the default provider as well as the TPM2 provider, so the tss library functions can find the default provider first. The (absolutely dirty) hack I have to do this is inside the pkey decode function as

	if (alg == TPM_ALG_ECC) {
		EVP_PKEY_CTX *ctx = EVP_PKEY_CTX_new_id(EVP_PKEY_EC, NULL);
		EVP_PKEY_CTX_free(ctx);
	}

Which currently works to break the recursion loop. However it is an unreliable hack because internally the OpenSSL hash bucket implementation orders the method cache by provider address and since the TPM2 provider is dynamically loaded it has a higher address than the OpenSSL default one. However, this will not survive security techniques like Address Space Layout Randomization.

Conclusions

Hopefully I’ve given a rapid (and possibly useful) overview of converting an engine to a provider which will give some pointers about provider conversion to all the engine token implementations out there. Please feel free to repurpose my opensslmissing routines under the MIT licence without any obligations to get them back upstream (although I would be interested in hearing about bugs and feature enhancements). In the end, it was only 1152 lines of C to implement the TPM2 provider (additive on top of the common shared code base with the existing Engine) and 681 lines in opensslmissing, showing firstly that there is still an need for OpenSSL itself to do the missing routines as a provider export and secondly that it really takes a fairly small amount of provider code to wrapper an existing engine implementation provided you’re discriminating about what functions you actually provide. As a final remark I should note that the openssl_tpm2_engine has a fairly extensive test suite which all now pass with the provider implementation as well.

March 24, 2023 08:48 PM

Matthew Garrett: We need better support for SSH host certificates

Github accidentally committed their SSH RSA private key to a repository, and now a bunch of people's infrastructure is broken because it needs to be updated to trust the new key. This is obviously bad, but what's frustrating is that there's no inherent need for it to be - almost all the technological components needed to both reduce the initial risk and to make the transition seamless already exist.

But first, let's talk about what actually happened here. You're probably used to the idea of TLS certificates from using browsers. Every website that supports TLS has an asymmetric pair of keys divided into a public key and a private key. When you contact the website, it gives you a certificate that contains the public key, and your browser then performs a series of cryptographic operations against it to (a) verify that the remote site possesses the private key (which prevents someone just copying the certificate to another system and pretending to be the legitimate site), and (b) generate an ephemeral encryption key that's used to actually encrypt the traffic between your browser and the site. But what stops an attacker from simply giving you a fake certificate that contains their public key? The certificate is itself signed by a certificate authority (CA), and your browser is configured to trust a preconfigured set of CAs. CAs will not give someone a signed certificate unless they prove they have legitimate ownership of the site in question, so (in theory) an attacker will never be able to obtain a fake certificate for a legitimate site.

This infrastructure is used for pretty much every protocol that can use TLS, including things like SMTP and IMAP. But SSH doesn't use TLS, and doesn't participate in any of this infrastructure. Instead, SSH tends to take a "Trust on First Use" (TOFU) model - the first time you ssh into a server, you receive a prompt asking you whether you trust its public key, and then you probably hit the "Yes" button and get on with your life. This works fine up until the point where the key changes, and SSH suddenly starts complaining that there's a mismatch and something awful could be happening (like someone intercepting your traffic and directing it to their own server with their own keys). Users are then supposed to verify whether this change is legitimate, and if so remove the old keys and add the new ones. This is tedious and risks users just saying "Yes" again, and if it happens too often an attacker can simply redirect target users to their own server and through sheer fatigue at dealing with this crap the user will probably trust the malicious server.

Why not certificates? OpenSSH actually does support certificates, but not in the way you might expect. There's a custom format that's significantly less complicated than the X509 certificate format used in TLS. Basically, an SSH certificate just contains a public key, a list of hostnames it's good for, and a signature from a CA. There's no pre-existing set of trusted CAs, so anyone could generate a certificate that claims it's valid for, say, github.com. This isn't really a problem, though, because right now nothing pays attention to SSH host certificates unless there's some manual configuration.

(It's actually possible to glue the general PKI infrastructure into SSH certificates. Please do not do this)

So let's look at what happened in the Github case. The first question is "How could the private key have been somewhere that could be committed to a repository in the first place?". I have no unique insight into what happened at Github, so this is conjecture, but I'm reasonably confident in it. Github deals with a large number of transactions per second. Github.com is not a single computer - it's a large number of machines. All of those need to have access to the same private key, because otherwise git would complain that the private key had changed whenever it connected to a machine with a different private key (the alternative would be to use a different IP address for every frontend server, but that would instead force users to repeatedly accept additional keys every time they connect to a new IP address). Something needs to be responsible for deploying that private key to new systems as they're brought up, which means there's ample opportunity for it to accidentally end up in the wrong place.

Now, best practices suggest that this should be avoided by simply placing the private key in a hardware module that performs the cryptographic operations, ensuring that nobody can ever get at the private key. The problem faced here is that HSMs typically aren't going to be fast enough to handle the number of requests per second that Github deals with. This can be avoided by using something like a Nitro Enclave, but you're still going to need a bunch of these in different geographic locales because otherwise your front ends are still going to be limited by the need to talk to an enclave on the other side of the planet, and now you're still having to deal with distributing the private key to a bunch of systems.

What if we could have the best of both worlds - the performance of private keys that just happily live on the servers, and the security of private keys that live in HSMs? Unsurprisingly, we can! The SSH private key could be deployed to every front end server, but every minute it could call out to an HSM-backed service and request a new SSH host certificate signed by a private key in the HSM. If clients are configured to trust the key that's signing the certificates, then it doesn't matter what the private key on the servers is - the client will see that there's a valid certificate and will trust the key, even if it changes. Restricting the validity of the certificate to a small window of time means that if a key is compromised an attacker can't do much with it - the moment you become aware of that you stop signing new certificates, and once all the existing ones expire the old private key becomes useless. You roll out a new private key with new certificates signed by the same CA and clients just carry on trusting it without any manual involvement.

Why don't we have this already? The main problem is that client tooling just doesn't handle this well. OpenSSH has no way to do TOFU for CAs, just the keys themselves. This means there's no way to do a git clone ssh://git@github.com/whatever and get a prompt asking you to trust Github's CA. Instead, you need to add a @cert-authority github.com (key) line to your known_hosts file by hand, and since approximately nobody's going to do that there's only marginal benefit in going to the effort to implement this infrastructure. The most important thing we can do to improve the security of the SSH ecosystem is to make it easier to use certificates, and that means improving the behaviour of the clients.

It should be noted that certificates aren't the only approach to handling key migration. OpenSSH supports a protocol for key rotation, basically by allowing the server to provide a set of multiple trusted keys that the client can cache, and then invalidating old ones. Unfortunately this still requires that the "new" private keys be deployed in the same way as the old ones, so any screwup that results in one private key being leaked may well also result in the additional keys being leaked. I prefer the certificate approach.

Finally, I've seen a couple of people imply that the blame here should be attached to whoever or whatever caused the private key to be committed to a repository in the first place. This is a terrible take. Humans will make mistakes, and your systems should be resilient against that. There's no individual at fault here - there's a series of design decisions that made it possible for a bad outcome to occur, and in a better universe they wouldn't have been necessary. Let's work on building that better universe.

comment count unavailable comments

March 24, 2023 05:51 PM

March 13, 2023

Dave Airlie (blogspot): vulkan video vp9 decode - radv update

While going over the AV1 a few people commented on the lack of VP9 and a few people said it would be an easier place to start etc.

Daniel Almeida at Collabora took a first pass at writing the spec up, and I decided to go ahead and take it to a working demo level.

Lynne was busy, and they'd already said it should take an afternoon, so I decided to have a go at writing the ffmpeg side for it as well as finish off Daniel's radv code.

About 2 mins before I finished for the weekend on Friday, I got a single frame to decode, and this morning I finished off the rest to get at least 2 test videos I downloaded to work.

Branches are at [1] and [2]. There is only 8-bit support so far and I suspect some cleaning up is required.

[1] https://github.com/airlied/FFmpeg/tree/vulkan-vp9-decode

[2] https://gitlab.freedesktop.org/airlied/mesa/-/commits/radv-vulkan-video-decode-mesa-vp9


March 13, 2023 04:14 AM

March 10, 2023

Kernel Podcast: S2E3 – 2023/03/09

Season 2 – Episode 3 – 2023/03/09

Summary

The latest stable kernel is Linux 6.2.2, released by Greg Kroah-Hartman on March 3rd 2023. The latest mainline (development) kernel is 6.3-rc1, released by Linus on March 5th 2023.

Mathieu Desnoyers has announced Userspace RCU release 0.14.0 which adopts a baseline requirement of C99 and C++11, and introduces new APIs for C++.

Alejandro Colomar announced man-pages-6.03 is released. Among the “most notable changes” is “We now have a hyperlinked PDF book of the Linux man-pages”.

Junio C Hamano announced that “A release candidate Git v2.40.0-rc2 is now available”.

Takashi Sakamoto has stepped up to become the owner of the FireWire subsystem.

Linux 6.2 released

Linux 6.2 was released “right on (the extended) schedule” on February 19th following an extra RC (Release Candidate) motivated by the end of year holidays. Linus noted in his release announcement that “Nothing unexpected happened” toward the end of the cycle but there were a “couple of small things” on the regression side that Thorsten Leemhuis is tracking. Since “they weren’t actively pushed by maintainers…they will have to show up for stable [kernel releases]”.

Thorsten diligently followed up with his summary of regressions, noting that “There are still quite a few known issues from this cycle mentioned below. Afaics none of them affect a lot of people”. He also recently posted “docs: describe how to quickly build a trimmed kernel” as “that’s something users will often have to do when they want to report an issue or test proposed fixes”.

Among the fixes that will come in for stable is a build fix for those running Linux 6.2 on a Talos II (IBM POWER9) machine, who may notice an “undefined reference to ‘hash__tlb_flush” during kernel compilation. A fix is being tracked for backport to stable.

Speaking of regressions, Nick Bowler identified an older regression beginning in Linux 6.1 that caused “random crashes” on his SPARC machine. Peter Xu responded that it was likely a THP (Transparent Huge Page) problem, perhaps showing up because THP was disabled (which it was in Nick’s configuration). Nick tested a fix from Peter that seemed to address the issue.

As you’ll see below, ongoing discussions are taking place about the removal of various legacy architectures from the kernel. Another proposal recently made (by Christoph Hellwig) is to “orphan JFS” (the “Journalling File System”). Stefan Tibus was among those who stood up and claimed to still be “a happy user of JFS from quite early on all my Linux installations”.

Linux 6.3-rc1

Linus announced the closure of the merge window (the period of time during which disruptive changes are allowed to be merged into the kernel) with the release of Linux 6.3-rc1, noting, “So after several releases where the merge windows had something odd going on, we finally had just a regular “two weeks of just merge window”. It was quite nice. In fact, it was quite nice in a couple of ways: not only didn’t I have a huge compressed merge window where I felt I had to cram as much as possible into the first few days, but the fact that we _have_ had a couple of merge windows where I really asked for people to have everything ready when the merge window opened seems to have set a pattern: the bulk of everything really did come in early”.

As usual, Linux Weekly News has an excellent summary of part 1 and part 2 of the merge window (across two weeks). I encourage you to subscribe and read it for a full breakdown.

Ongoing Development

Linux 6.2 brought with it initial support for the Rust programming language. Development continues apace upsteam, with proposed patches extending the support to include new features. Miguel Ojeda (the Rust for Linux maintainer) posted a pull request for Linux 6.3, including support for various new types. Daniel Almeida recently posted “rust: virtio: add virtio support”, which “adds virtIO support to the rust crate. This includes the capability to create a virtIO driver (through the module_virtio_driver macro and the respective Driver trait)”.

And the work extends to the architectural level also, with Conor Dooley recently posting “RISC-V: enable rust”, which he notes is a “somewhat blind (and maybe foolish) attempt at enabling Rust for RISC-V. I’ve tested this on Icicle [a prominent board], and the modules seem to work. I’d like to play around with Rust on RISC-V, but I’m not interested in using downstream kernels, so figured I should try and see what’s missing…”.

But probably the most interesting development in Rust language land has noting to do with Rust as a language at all. Instead, it is a patch series titled “Rust DRM subsystem abstractions (& preview AGX driver)” from Asahi Lina. In the patch, Lina notes “This is my first take on the Rust abstractions from the DRM [graphics] subsystem. It includes the abstractions themselves, some minor prerequisite changes to the C side, as well as drm-asahi GPU driver (for reference on how the abstractions are used, but not necessarily intended to land together)”. It’s that last part, patch 18, the one titled “drm/asahi: Add the Asahi driver for Apple AGX GPUs” which we refer to here. In it, Lina implements support for the GPUs used by the Apple M1, M1 Pro, M1 Max, M1 Ultra, and the Apple M2 silicon. This is not a small driver and an interesting demonstration of the level of capability already being reached in terms of Linux upstream Rust language support.

Lokesh Gidra posted an “RFC for new feature to move pages from one vma to another without split” which allows one “anonymous” (not file backed) page (the fundamental granule size by which memory is managed and accounted) to be moved from part of a runtime heap (VMA) to another without otherwise impacting the state of the overall heap. The intended benefit is to managed runtimes with garbage collection, allowing for simplified “coarse-grained page-level compaction” garbage collection algorithms “wherein pages containing live objects are slid next to each other without touching them, while reclaiming in-between pages which contain only garbage”. The patch posting includes a lengthy writeup explaining the details.

Alison Schofield posted patches titled “CXL Poison List Retrieval & Tracing” targeting the CXL 3.0 specification, which allows OS management software to obtain a list of memory locations that have been poisoned (corrupted due to a RAS event, such as an ECC failure), for example in a “CXL.mem” DDR memory device attached to a system using the serial CXL interconnect.

Dexuan Cui noted that “earlyprintk=ttyS0” was broken on AMD SNP (Confidential Compute) guests running under KVM. This turned out to be due to a particular code branch taken during initialization that varied based upon whether a kernel was entered in 64-bit mode via EFI or through a direct (e.g. kexec/qemu KVM device modeling userspace) type of load.

Zhangjin Wu posted “Add dead syscalls elimination support” intended to remove support from the kernel for “dead” syscalls “which are not used in target system”. Presumably this is to benefit deeply embedded architectures where any excess memory used by the kernel is precious.

Nick Alcock posted “MODULE_LICENSE removals, first tranche” intended to “remove the LICENSE_MODULE usage from files/objects that are not tristate” [meaning that they are not actually setup to be used as modules to begin with].

Bobby Eshleman posted “vsock: add support for sockmap”. Bytedance are apparently “testing usage of vsock as a way to redirect guest-local UDS [Unix Domain Socket] requests to the host and this patch series greatly improves the performance of such a setup”. By 121% throughput.

Chih-En Lin posted version 4 of a patch series “Introduce Copy-On-Write to Page Table” which aims to add support for COW to the other half of the equation. Copy-on-Write is commonly used as an optimization whereby a cloned process (for example, during a fork used to exec a new program) doesn’t actually get a copy of the entire memory used by the original process. Instead, the tracking structures (page tables) are modified to mark all the pages in the new process as read only. Only when it attempts to write to the memory are the actual pages copied. The COW page table patches aim to do the same for the page tables themselves, so that full copies are not needed until the new address space is modified. Pulling off this trick requires that some of the tables are copied, but not the leaf (PTE) entries themselves, which are shared between the two processes. David Hildenbrand thanked Chih-En for the work, and the measurements, but expressed concern about “how intrusive even this basic deduplication approach already is”.

On the subject of page tables, Matthew (Willy) Wilcox posted version 3 of “New page table range API” that allows for setting up multiple page table entries at once, noting “The point of all this is better performance, and Fenwei Yin has measured improvement on x86”.

Architectures

Arm

Kristina Martsenko posted “arm64: support Armv8.8 memcpy instructions in userspace”, which adds support for (you guessed it) the memcpy instructions that were added in Armv8.8. These are described by the FEAT_MOPS documentation in the Arm ARM. As Kristina puts it, “The aim is to avoid having many different performance-optimal memcpy implementations in software (tailored to CPU model and copy size) and the overhead of selecting between them. The new instructions are intended to be at least as fast as any alternative instruction sequence”.

Various Apple Silicon patches have been posted. As the Asahi Linux project noted recently in “an update and reality check”, Linux “6.2 notably adds device trees and basic support for M1 Pro/Max/Ultra machines. However, there is still a long road before upstream kernels are usable on laptops”. Nonetheless, patches continue to fly, with the latest including “Apple M2 PMU support” from Janne Grunau, which notes that “The PMU itself appears to work in the same way as o[n] M1”, and support for the Broadcom BCM4387 WiFi chip used by Apple M1 platforms from Hector Martin. Hector also posted “Apple T2 platform support” patches.

Itanium

The Intel Itanium architecture, also known as “IA-64”, was originally announced on October 4th 1999. It was intended as the successor to another legacy architecture that Intel had previously introduced back in 1978. That legacy architecture (known as “x86”) had a number of design challenges that could limit its future scalability, but it was also quite popular, and there were a relatively large number of systems deployed. Nonetheless, Intel was determined to replace x86 with a modern architecture designed with the future in mind. Itanium was co-designed with Hewlett-Packard, who created the original ISA specification. It featured 128 64-bit general purpose registers, 128 floating point registers, 64-bit predicate registers, and more besides.

Itanium was a VLIW (Very Long Instruction Word) machine that leveraged fixed-width “bundles” of instructions that are each individually 41-bits, plus a 5-bit template describing which type of instructions are present in the bundle. The Itanium implementation of VLIW is referred to as “EPIC” (Explicitly Parallel Instruction Computing) – which one must be careful not to confuse with the highly successful x86 architecture implementation from AMD known as “EPYC”. In Itanium, modern high performance microprocessor innovations such as hardware speculative and Out-of-Order execution take a back seat to software managed speculation, requiring an extremely complicated compiler toolchain that took many years to develop. Even then, it was clear early on that software management of dependencies and speculation could not compete with a hardware implementation, such as that used by contemporary x86 and RISC CPUs.

Intel Itanium processors were officially discontinued in January of 2020. As Ard Biesheuvel noted across several patch postings attempting to remove or mark IA-64 as broken, various support for Itanium has already been removed from dependent projects (such as upstream Tianocore – the EFI implementation needed to boot such systems, from which Intel itself removed such support in 2018), “QEMU no longer implements support for it”, and given the lack of systems and ongoing firmware maintenance, “there is zero test coverage using actual hardware” (“beyond a couple of machines used by distros to churn out packages”). Even this author has long since decommissioned his Itanium system (named “Hamartia” after the tragic hero, which he acquired during the upstreaming of PCI support for Arm as both of Itanium’s users had expressed concern that Arm support for PCI might break Itanium and it thus seemed important to be able to test that this mission-critical architecture was not broken in the process).

As of this writing support for Itanium has not (yet) been removed from the kernel.

LoongArch

A lot of work is going into the LoongArch [aside: could someone please let me know how to pronounce it properly?]. Recent patches include a patch from Youling Tang (“Add support for kernel relocation”) that “allows to compile [the] kernel as PIE and to relocated it at any virtual address at runtime” (to “pave the way to KASLR”, added in a later patch). Another patch “Add hardware breakpoints/watchpoints support” does what it says on the tin. Finally, Tianrui Zhao posted “Add KVM LoongArch support”, which adds KVM support noting that the Loongson (the company behind the architecture) “3A5000” chip “supports hardware assisted virtualization”.

RISC-V

Evan Green posted “RISC-V: Add a syscall for HW probing” which started an extremely long discussion about the right (and wrong) ways to handle the myriad (sometimes mutually incompatible) extensions supported by the RISC-V community. Traditionally, architectures were quite standardized with a central authority providing curation. But while RISC-V does have the RISC-V International organization, and the concept of ratification for extensions with a standard set of extensions defined in various profiles, the practical reality is somewhat less rigid than folks may be used to. As a result, there are in fact a very wide range of implementations, and the kernel needs to somehow be able to handle all of the hundreds of permutations.

Most architectures handle minor variation between implementations using the “HWCAP” infrastructure and the “Auxiliary Vectors” which are special environment variables exported into every running process. This allows (e.g.) userspace software to quickly determine whether a particular feature is supported or not. For example, the feature might be some novel atomic or vector support that isn’t present in older processors. But when it comes to RISC-V this approach isn’t as easy. As Evan said in his posting, “We don’t have enough space for these all in ELF_HWCAP and there’s no system call that quite does this, so let’s just provide an arch-specific one to probe for hardware capabilities. This currently just provides m{arch,imp,vendor}id, but with the key-value pairs we can pass more in the future”.

The response was swift, and negative, with Greg Kroah-Hartman responding, “Ick, this is exactly what sysfs is designed to export in a sane way. Why not just use that instead? The “key” would be the filename, and the value the value read from the filename”. The response was that this would slow down future RISC-V systems because of the large number of file operations that every process would need to perform on startup in order for the standard libraries to figure out what features were supported or not. Worse, some of the infrastructure for file operations might not be available at the time when it would be needed. This situation is a good reminder of the importance of standardization and the value that it can bring to any modern architecture.

Speaking of standardization, several rounds of patches were posted titled “Add basic ACPI support for RISC-V” which “enables the basic ACPI infrastructure for RISC-V”. According to Sunil V L, who posted the patch series, “Supporting external interrupt controllers is in progress and hence it is tested using poll based HVC SBI console and RAM disk”.

Other patches recently posted for RISC-V include “Introduce virtual kernel mapping KASLR”. The patches note that “The seed needed to virtually move the kernel is taken from the device tree, so we rely on the bootloader to provide the correct seed”. Later patches may add support for the RISC-V “Zkr” random extension so that this can be provided by hardware instead. As a dependent patch, Alexandre Ghiti posted “Introduce 64b relocatable kernel”.

Deepak Gupta posted “riscv control-flow integrity for U mode” in which he notes he has “been working on linux support for shadow stack and landing pad instruction on riscv for a while. These are still RFC quality. But at least they’re in a shape which can start a discussion”. The RISC-V extension adding support for control flow integrity is called Zisslpcfi, which rolls off the tongue just as easily as all of the other extension names, chosen by cats falling on keyboards.

Jesse Taube posted “Add RISC-V 32 NOMMU support”, noting, “This patch-set aims to add NOMMU support to RV32. Many people want to build simple emulators or HDL models of RISC-V. [T]his patch makes it possible to run linux on them”.

Returning to the topic of incompatible vendor extensions, Heiko Stuebner posted “RISC-V: T-Head vector handling”, which notes “As is widely known, the T-Head C9xx cores used for example in the Allwinner D1 implement an older non-ratifed variant of the vector spec. While userspace will probably have a lot more problems implementing support for both, on the kernel side the needed changes are actually somewhat small’ish and can be handled via alternatives somewhat nicely. With this patchset I could run the same userspace program (picked from some riscv-vector-test repository) that does some vector additions on both qemu and a d1-nezha board. On both platforms it ran successfully and even produced the same results”.

Super-H

Returning to the subject of dying architectures once again, an attempt was made by Christoph Hellwig to “Drop arch/sh and everything that depends on it” since “all of the support has been barely maintained for almost 10 years, and not at all for more than 1 year”. Geert Uytterhoeven noted that “The main issue is not the lack of people sending patches and fixes, but those patches never being applied by the maintainers. Perhaps someone is willing to stand up to take over maintainership?” This caused John Paul Adrian Glaubitz to raise his hand and say he “actually would be silling to do it but I’m a bit hesitant as I’m not 100% sure my skills are sufficient”. Rob Landley offered to help out too. It seems sh might survive this round.

x86

Mathieu Desnoyers was interested in formal documentation from Intel concerning concurrent modification of code while it is executing (specifically, updating instructions to patch them as calling a debug handler via “INT3”). He wrote to Peter Anvin saying “I have emails from you dating from a few years back unofficially stating that it’s OK to update the first byte of an instruction with a single-byte int3 concurrently…Olivier Dion is working on the libpatch project aiming to use this property for low-latency/low-overhead live code patching in user-space as well, but we cannot find an official statement from Intel that guarantees this breakpoint-bypass technique is indeed OK without stopping the world while patching”. Steven Rostedt was among those who noted “The fact that we have been using it for over 10 years without issue should be a good guarantee”. Mathieu was able to find comprehensive documentation in the AMD manual that allows it, but noted again “I cannot find anything with respect to asynchronous cross-modification of code stated as clearly in Intel’s documentation”. Anyone want to help him?

Development continues toward implementing support for “Flexible Return and Event Delivery” aka “FRED” on Intel architecture. Among the latest patches, Ammar Faizi includes a fix to the “sysret_rip” selftest that handles the fact that FRED’s “syscall” instruction (to enter the kernel from userspace) no longer clobbers (overwrites) the x86 “rcx” and “r11” registers. On the subject of tests, Mingwei Zhang posted patches updated the “amx_test” suite to add support for several of the “new entities” that are present in Intel’s AMX (matrix extension) architecture.

Sean Christopherson posted “KVM: x86: Add “governed” X86_FEATURE framework”, which is intended to “manage and cache KVM-governed features, i.e. CPUID based features that require explicit KVM enabling and/or need to be queried semi-frequently by KVM”. According to Sean, “The idea originally came up in the context of the architectural LBRs [Last Branch Record, a profiling mechanism to record precisely the last N branched] series as a way to avoid querying guest CPUID in hot paths without needing a dedicated flag, but as evidenced by the shortlog, the most common usage is to handle the ever-growing list of SVM [Shared Virtual Memory] that are exposed to L1”. Reducing calls to CPUID is generally a good thing since it results in a (possibly lengthy) trap into microcode, and is also a context serializing instruction.

Paolo Bonzini posted “Cross-Thread Return Address Predictions vulnerability”, noting that “Certain AMD processors are vulnerable to a cross-thread return address predictions bug. When running in SMT [Simultaneous Multi-Threading] mode and one of the sibling threads transitions out of C0 state, the other thread gets access to twice as many entries in the RSB [Return Stack Buffer], but unfortunately the predictions of the now-halter logical processor are not purged”. Paolo is referring to the fact that x86 processors include two logical “threads” (which Intel calls “Hyperthreads” – a trademarked name – and more generally are known as SMT or Simultaneous Multi-Thread). Most modern x86 processors include an optimization that when one logical thread is not being used and the other transitions into what software sees as a “low power” state, what actually happens is that the partitioned resources as given to the other thread, which consequently sees a boost in performance as it is no longer contending on the back end for execution units, and now has double the store buffer and predictor entries.

But in this case, the RSB [Return Stack Buffer] entries are not zeroed out in the process, meaning that it is possible for a malicious thread to “train” the RSB predictor later used by the peer thread to guess that certain function call return paths will be used. This opens up an opportunity to cause a sibling thread to speculatively execute down a wrong path that leaves cache breadcrumbs that can be measured in order to potentially leak certain information. Paolo addresses this by adding a KVM (hypervisor) parameter that “if set, will prevent the user from disabling the HLT, MWAIT, and CSTATE exits”, preventing the other thread from preventing the hypervisor from stuffing the RSB with dummy safe values when the sibling thread goes to sleep.

Dionna Glaze posted “Add throttling detection to sev-guest”, noting that “The guest request synchronous API from SEV-SNP [AMD’s Confidential Computing feature] to the host’s security processor consumes a global resource. For this reason, AMD’s docs recommend that the host implements a throttling mechanism. In order for the guest to know it’s been throttled and should try its request again, we need some good-faith communication from the host that the request has been throttled. These patches work with the existing dev/sev-guest ABI”. 

On the subject of Confidential Compute, Kai Huang posted version 9 of a patch series “TDX host kernel support” aiming to add support for Intel’s TDX Confidential Compute extensions, while Jeremi Piotrowski posted “Support nested SNP KVM guests on Hyper-V” intending to add support for nested (hypervisor inside hypervisor) support for AMD’s Confidential Compute to the Hyper-V hypervisor as used by Microsoft Azure. Nested Confidential Compute sounds fun.

Rick Edgecombe posted version 6 of “Shadow stacks for userspace”, a series that “implements Shadow Stacks for userspace using x86’s Control-flow Enforcement Technology (CET)”. As he reminds us, CET supports both shadow stacks and indirect branch tracking (landing pads), but these patches “implements just the shadow stack part of this feature, and just for userspace”.

Michael S. Tsirkin posted “revert RNG seed mess” noting “All attempts to fix up passing RNG [random entropy] seed via setup_data entry failed. Let’s just rip out all of it. We’ll start over”.

Arnd Bergmann posted “x86: make 64-bit defconfig the default” noting that 32-bit kernel builds were “rarely what anyone wants these days”. The patch changes “the default so that the 64-bit config gets used unless the user asked for i686_defconfig, uses ARCH=i386 or runs on a system that “uname -m” identifies as i386/i486/i586/i686”.

March 10, 2023 04:41 AM

March 01, 2023

Brendan Gregg: USENIX SREcon APAC 2022: Computing Performance: What's on the Horizon

At USENIX SREcon22 APAC I gave the opening keynote on the future of computer performance, rounding up the latest developments and making predictions of where I see things heading. This talk originated from my updates to [Systems Performance 2nd Edition], and this was the first time I've given this talk in person! The video is now on [YouTube]:

The slides are [online] and as a [PDF]:
/ permalink/zoom
In Q&A I was asked about CXL (compute express link) which was fortunate as I had planned to cover it and then forgot, so the question let me talk about it (although Q&A is missing from the video). CXL in a way allows a custom memory controller to be added to a system, to increase memory capacity, bandwidth, and overall performance. My personal opinion is that I don't see a widespread need for more capacity given horizontal scaling and servers that can already exceed 1 Tbyte of DRAM; bandwidth is also helpful, but I'd be concerned about the increased latency for adding a hop to more memory. So it's interesting, but I don't think they have the killer use case for it yet. ## Realizing and exceeding a lifelong dream I began my tech career as a junior Unix sysadmin in Newcastle, NSW, Australia, in 1999, with no connection to the exciting world of tech in Silicon Valley, New York, or even nearby Sydney. As I was determined to become great at my new occupation regardless of my location, I read every sysadmin book, article, and magazine I could find on the shelf. This included SysAdmin magazine, which contained articles from various experts including Amy Rich, and a couple of advertisements: One was to submit your own articles to the magazine for publication (by writing to the editor, Rikki Endsley) and another was to attend USENIX conferences in the US and learn directly from the experts! I made both of these my goals, even though I'd never been published before and I'd never been to the US. Or even on a plane. I didn't end up getting published in SysAdmin directly, but my performance work did make it as a feature article (thanks Matty). As for attending USENIX conferences: I finally started attending and speaking at them in 2010 when a community manager encouraged me to (thanks Deirdre Straughan), and since then I've met many friends and connections, including Amy who is now USENIX President, and Rikki with whom I co-chaired the USENIX LISA18 conference. USENIX has been a great help to my career and my employers, and I hope it is just as helpful for you. It's an important vendor-neutral space to share the latest in technology. And now, helping bring USENIX conferences to Australia by giving the first keynote: I could not have scripted or expected it. It was a great privilege. ## SREcon 2023 CFP Tech moves fast, however, and I have little time to reflect on 2022 when there's 2023 to plan! I'm now program co-chair for SREcon 2023 APAC, and our 2023 conference is June 14-16 in Singapore. The call for participation ends on March 2nd 23:59 SGT! That's about 24 hours from now! ## References I've reproduced the references from my SREcon22 keynote below, so you can click on links: - [Gregg 08] Brendan Gregg, “ZFS L2ARC,” http://www.brendangregg.com/blog/2008-07-22/zfs-l2arc.html, Jul 2008 - [Gregg 10] Brendan Gregg, “Visualizations for Performance Analysis (and More),” https://www.usenix.org/conference/lisa10/visualizations-performance-analysis-and-more, 2010 - [Greenberg 11] Marc Greenberg, “DDR4: Double the speed, double the latency? Make sure your system can handle next-generation DRAM,” https://www.chipestimate.com/DDR4-Double-the-speed-double-the-latencyMake-sure-your-system-can-handle-next-generation-DRAM/Cadence/Technical-Article/2011/11/22, Nov 2011 - [Hruska 12] Joel Hruska, “The future of CPU scaling: Exploring options on the cutting edge,” https://www.extremetech.com/computing/184946-14nm-7nm-5nm-how-low-can-cmos-go-it-depends-if-you-ask-the-engineers-or-the-economists, Feb 2012 - [Gregg 13] Brendan Gregg, “Blazing Performance with Flame Graphs,” https://www.usenix.org/conference/lisa13/technical-sessions/plenary/gregg, 2013 - [Shimpi 13] Anand Lal Shimpi, “Seagate to Ship 5TB HDD in 2014 using Shingled Magnetic Recording,” https://www.anandtech.com/show/7290/seagate-to-ship-5tb-hdd-in-2014-using-shingled-magnetic-recording, Sep 2013 - [Borkmann 14] Daniel Borkmann, “net: tcp: add DCTCP congestion control algorithm,” https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=e3118e8359bb7c59555aca60c725106e6d78c5ce, 2014 - [Macri 15] Joe Macri, “Introducing HBM,” https://www.amd.com/en/technologies/hbm, Jul 2015 - [Cardwell 16] Neal Cardwell, et al., “BBR: Congestion-Based Congestion Control,” https://queue.acm.org/detail.cfm?id=3022184, 2016 - [Gregg 16] Brendan Gregg, “Unikernel Profiling: Flame Graphs from dom0,” http://www.brendangregg.com/blog/2016-01-27/unikernel-profiling-from-dom0.html, Jan 2016 - [Gregg 16b] Brendan Gregg, “Linux BPF Superpowers,” https://www.brendangregg.com/blog/2016-03-05/linux-bpf-superpowers.html, 2016 - [Alcorn 17] Paul Alcorn, “Seagate To Double HDD Speed With Multi-Actuator Technology,” https://www.tomshardware.com/news/hdd-multi-actuator-heads-seagate,36132.html, 2017 - [Alcorn 17b] Paul Alcorn, “Hot Chips 2017: Intel Deep Dives Into EMIB,” https://www.tomshardware.com/news/intel-emib-interconnect-fpga-chiplet,35316.html#xenforo-comments-3112212, 2017 - [Corbet 17] Jonathan Corbet, “Two new block I/O schedulers for 4.12,” https://lwn.net/Articles/720675, Apr 2017 - [Gregg 17] Brendan Gregg, “AWS EC2 Virtualization 2017: Introducing Nitro,” http://www.brendangregg.com/blog/2017-11-29/aws-ec2-virtualization-2017.html, Nov 2017 - [Russinovich 17] Mark Russinovich, “Inside the Microsoft FPGA-based configurable cloud,” https://www.microsoft.com/en-us/research/video/inside-microsoft-fpga-based-configurable-cloud, 2017 - [Gregg 18] Brendan Gregg, “Linux Performance 2018,” http://www.brendangregg.com/Slides/Percona2018_Linux_Performance.pdf, 2018 - [Hady 18] Frank Hady, “Achieve Consistent Low Latency for Your Storage-Intensive Workloads,” https://www.intel.com/content/www/us/en/architecture-and-technology/optane-technology/low-latency-for-storage-intensive-workloads-article-brief.html, 2018 - [Joshi 18] Amit Joshi, et al., “Titus, the Netflix container management platform, is now open source,” https://netflixtechblog.com/titus-the-netflix-container-management-platform-is-now-open-source-f868c9fb5436, Apr 2018 - [Cutress 19] Dr. Ian Cutress, “Xilinx Announces World Largest FPGA: Virtex Ultrascale+ VU19P with 9m Cells,” https://www.anandtech.com/show/14798/xilinx-announces-world-largest-fpga-virtex-ultrascale-vu19p-with-9m-cells, Aug 2019 - [Gallatin 19] Drew Gallatin, “Kernel TLS and hardware TLS offload in FreeBSD 13,” https://people.freebsd.org/~gallatin/talks/euro2019-ktls.pdf, 2019 - [Bearman 20] Ian Bearman, “Exploring Profile Guided Optimization of the Linux Kernel,” https://linuxplumbersconf.org/event/7/contributions/771, 2020 - [Burnes 20] Andrew Burnes, “GeForce RTX 30 Series Graphics Cards: The Ultimate Play,” https://www.nvidia.com/en-us/geforce/news/introducing-rtx-30-series-graphics-cards, Sep 2020 - [Charlene 20] Charlene, “800G Is Coming: Set Pace to More Higher Speed Applications,” https://community.fs.com/blog/800-gigabit-ethernet-and-optics.html, May 2020 - [Cutress 20] Dr. Ian Cutress, “Insights into DDR5 Sub-timings and Latencies,” https://www.anandtech.com/show/16143/insights-into-ddr5-subtimings-and-latencies, Oct 2020 - [Ford 20] A. Ford, et al., “TCP Extensions for Multipath Operation with Multiple Addresses,” https://datatracker.ietf.org/doc/html/rfc8684, Mar 2020 - [Gregg 20] Brendan Gregg, “Systems Performance: Enterprise and the Cloud, Second Edition,” Addison-Wesley, 2020 - [Hruska 20] Joel Hruska, “Intel Demos PCIe 5.0 on Upcoming Sapphire Rapids CPUs,” https://www.extremetech.com/computing/316257-intel-demos-pcie-5-0-on-upcoming-sapphire-rapids-cpus, - Oct 2020 - [Liu 20] Linda Liu, “Samsung QVO vs EVO vs PRO: What’s the Difference? [Clone Disk],” - https://www.partitionwizard.com/clone-disk/samsung-qvo-vs-evo.html, 2020 - [Moore 20] Samuel K. Moore, “A Better Way to Measure Progress in Semiconductors,” https://spectrum.ieee.org/semiconductors/devices/a-better-way-to-measure-progress-in-semiconductors, Jul 2020 - [Peterson 20] Zachariah Peterson, “DDR5 vs. DDR6: Here's What to Expect in RAM Modules,” https://resources.altium.com/p/ddr5-vs-ddr6-heres-what-expect-ram-modules, Nov 2020 - [Salter 20] Jim Salter, “Western Digital releases new 18TB, 20TB EAMR drives,” https://arstechnica.com/gadgets/2020/07/western-digital-releases-new-18tb-20tb-eamr-drives, Jul 2020 - [Spier 20] Martin Spier, Brendan Gregg, et al., “FlameScope,” https://github.com/Netflix/flamescope, 2020 - [Tolvanen 20] Sami Tolvanen, Bill Wendling, and Nick Desaulniers, “LTO, PGO, and AutoFDO in the Kernel,” Linux Plumber’s Conference, https://linuxplumbersconf.org/event/7/contributions/798, 2020 - [Vega 20] Juan Camilo Vega, Marco Antonio Merlini, Paul Chow, “FFShark: A 100G FPGA Implementation of BPF Filtering for Wireshark,” IEEE 28th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM), 2020 - [Warren 20] Tom Warren, “Microsoft reportedly designing its own ARM-based chips for servers and Surface PCs,” https://www.theverge.com/2020/12/18/22189450/microsoft-arm-processors-chips-servers-surface-report, Dec 2020 - [Alcorn 21] Paul Alcorn, “Intel Shares Alder Lake Pricing, Specs and Gaming Performance: $589 for 16 Cores,” https://www.tomshardware.com/features/intel-shares-alder-lake-pricing-specs-and-gaming-performance, Oct 2021 - [Cutress 21] Ian Cutress, “AMD Demonstrates Stacked 3D V-Cache Technology: 192 MB at 2 TB/sec,” https://www.anandtech.com/show/16725/amd-demonstrates-stacked-vcache-technology-2-tbsec-for-15-gaming, May 2021 - [Google 21] Google, “Cloud TPU,” https://cloud.google.com/tpu, 2021 - [Haken 21] Michael Haken, et al., “Delta Lake 1S Server Design Specification 1v05, https://www.opencompute.org/documents/delta-lake-1s-server-design-specification-1v05-pdf, 2021 - [Intel 21] Intel corporation, “Intel® OptaneTM Technology,” https://www.intel.com/content/www/us/en/products/docs/storage/optane-technology-brief.html, 2021 - [Kostovic 21] Aleksandar Kostovic, “Esperanto Delivers Kilocore Processor in its Supercomputer-on-a-Chip Design,” https://www.tomshardware.com/news/esperanto-kilocore-processor, Aug 2021 - [Kummrow 21] Patricia Kummrow, “The IPU: A New, Strategic Resource for Cloud Service Providers,” https://itpeernetwork.intel.com/ipu-cloud/#gs.g5pkub, Aug 2021 - [Quach 21a] Katyanna Quach, “Global chip shortage probably won't let up until 2023, warns TSMC: CEO 'still expects capacity to tighten more',” https://www.theregister.com/2021/04/16/tsmc_chip_forecast, Apr 2021 - [Quach 21b] Katyanna Quach, “IBM says it's built the world's first 2nm semiconductor chips,” https://www.theregister.com/2021/05/06/ibm_2nm_semiconductor_chips, May 2021 - [Ridley 21] Jacob Ridley, “IBM agrees with Intel and TSMC: this chip shortage isn't going to end anytime soon,” https://www.pcgamer.com/ibm-agrees-with-intel-and-tsmc-this-chip-shortage-isnt-going-to-end-anytime-soon, May 2021 - [Shilov 21] Anton Shilov, “Samsung Develops 512GB DDR5 Module with HKMG DDR5 Chips,” https://www.tomshardware.com/news/samsung-512gb-ddr5-memory-module, Mar 2021 - [Shilov 21b] Anton Shilov, “Seagate Ships 20TB HAMR HDDs Commercially, Increases Shipments of Mach.2 Drives,” https://www.tomshardware.com/news/seagate-ships-hamr-hdds-increases-dual-actuator-shipments, 2021 - [Shilov 21c] Anton Shilov, “SK Hynix Envisions 600-Layer 3D NAND & EUV-Based DRAM,” https://www.tomshardware.com/news/sk-hynix-600-layer-3d-nand-euv-dram, Mar 2021 - [SuperMicro 21] SuperMicro, “B12SPE-CPU-25G (For SuperServer Only),” https://www.supermicro.com/en/products/motherboard/B12SPE-CPU-25G, 2021 - [Thaler 21] Dave Thaler, Poorna Gaddehosur, “Making eBPF work on Windows,” https://cloudblogs.microsoft.com/opensource/2021/05/10/making-ebpf-work-on-windows, May 2021 - [TornadoVM 21] TornadoVM, “TornadoVM Run your software faster and simpler!” https://www.tornadovm.org, 2021 - [Trader 21] Tiffany Trader, “Cerebras Second-Gen 7nm Wafer Scale Engine Doubles AI Performance Over First-Gen Chip ,” https://www.enterpriseai.news/2021/04/21/latest-cerebras-second-gen-7nm-wafer-scale-engine-doubles-ai-performance-over-first-gen-chip, Apr 2021 - [Ghigoff 21] Yoann Ghigoff, et al., "BMC: Accelerating Memcached using Safe In-kernel Caching and Pre-stack Processing," Proceedings of the 18th USENIX Symposium on Networked Systems Design and Implementation, https://www.usenix.org/system/files/nsdi21-ghigoff.pdf, Apr 2021 - [Tyson 21] Mark Tyson, "Intel Sapphire Rapids utillises tiled, modular SoC architecture," https://hexus.net/tech/news/cpu/148266-intel-sapphire-rapids-utillises-tiled-modular-soc-architecture/, Aug 2021 - [Vahdat 21] Amin Vahdat, “The past, present and future of custom compute at Google,” https://cloud.google.com/blog/topics/systems/the-past-present-and-future-of-custom-compute-at-google, Mar 2021 - [Wikipedia 21] “Semiconductor device fabrication,” https://en.wikipedia.org/wiki/Semiconductor_device_fabrication, 2021 - [Wikipedia 21b] “Silicon,” https://en.wikipedia.org/wiki/Silicon, 2021 - [ZonedStorage 21] Zoned Storage, “Zoned Namespaces (ZNS) SSDs,” https://zonedstorage.io/introduction/zns, 2021 - [Cutress 21b] Dr. Ian Cutress, Andrei Frumusanu, "The Intel 12th Gen Core i9-12900K Review: Hybrid Performance Brings Hybrid Complexity," https://www.anandtech.com/show/17047/the-intel-12th-gen-core-i912900k-review-hybrid-performance-brings-hybrid-complexity, Nov 2021 - [Nash 22] Paul Nash, "Now in preview: Azure Virtual Machines with Ampere Altra Arm-based processors," https://azure.microsoft.com/en-us/blog/now-in-preview-azure-virtual-machines-with-ampere-altra-armbased-processors/, Apr 2022 - [Bonshor 22] Gavin Bonshor, "AMD Releases Milan-X CPUs With 3D V-Cache," https://www.anandtech.com/show/17323/amd-releases-milan-x-cpus-with-3d-vcache-epyc-7003, Mar 2022 - [Mann 22] Tobias Mann, "Why Intel killed its Optane memory business," https://www.theregister.com/2022/07/29/intel_optane_memory_dead/, Jul 2022 - [Torvalds 22] Linus Torvalds, “Linux 6.0-rc1,” https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=568035b01cfb107af8d2e4bd2fb9aea22cf5b868, Aug 2022 - [Whalen 22] Jeanne Whalen, "Biden’s visit shows high stakes of $20 billion Ohio chip factory," https://www.washingtonpost.com/us-policy/2022/09/09/biden-intel-ohio-chip-factory/, Sep 2022 - [Robinson 22] Dan Robinson, "Intel has a secret club in the cloud for devs to try out new chips – and you ain't in it," https://www.theregister.com/2022/09/28/intel_developer_cloud/, Sep 2022 - [CloudHypervisor 22] Cloud Hypervisor Project (Linux Foundation), "Run Cloud Virtual Machines Securely and Efficiently," https://www.cloudhypervisor.org, accessed 2022 - [Cerebras 22] Cerebras, "Cerebras Wafer-Scale Cluster," https://www.cerebras.net/product-cluster, accessed 2022 - [GrafanaLabs 22] Grafana Labs, "https://grafana.com/docs/grafana/latest/panels-visualizations/visualizations/flame-graph/," https://grafana.com/docs/grafana/latest/panels-visualizations/visualizations/flame-graph/, accessed 2022 - [Pirzada 22] Usman Pirzada, "Intel Announces The Worlds First x86 CPU With HBM Memory: Xeon Max ‘Sapphire Rapids’ Data Center CPU," https://wccftech.com/intel-announces-the-worlds-first-x86-cpu-with-hbm-memory-xeon-max-sapphire-rapids-data-center-cpu/, Nov 2022 - [Smith 22] Lyle Smith, "Samsung PM1743 PCIe Gen5 SSD First Take Review," https://www.storagereview.com/review/samsung-pm1743-pcie-gen5-ssd-first-take-review, Jan 2022 - [Barr 22] Jeff Barr, "New Amazon EC2 Instance Types In the Works – C7gn, R7iz, and Hpc7g," https://aws.amazon.com/blogs/aws/new-amazon-ec2-instance-types-in-the-works-c7gn-r7iz-and-hpc7g, Nov 2022 - [Gooding 22] Matthew Gooding, "TSMC’s US fab will make 4nm chips for Apple, AMD and Nvidia," https://techmonitor.ai/technology/silicon/tsmcs-arizona-apple-amd-nvidia, Dec 2022 - [Liu 22] Zhiye Liu, "Smuggler Hid Over 200 Alder Lake CPUs in Fake Silicone Belly," https://www.tomshardware.com/news/smuggler-hid-over-200-alder-lake-cpus-in-fake-silicone-belly, Dec 2022 - [Seagate 22] Seagato, "Exos X Series," https://www.seagate.com/au/en/products/enterprise-drives/exos-x/, accessed 2022 - [Mann 22b] Tobais Mann, "Nvidia not cutting it? Google and Amazon’s latest AI chips have arrived," https://www.theregister.com/2022/10/11/google_amazon_ai_chips_nvidia/, Oct 2022 - [Intel 22] Intel, "Intel® Developer Cloud," https://www.intel.com/content/www/us/en/developer/tools/devcloud/overview.html, accessed Dec 2022 I've taken care to cite the author names along with the talk titles and dates, including for Internet sources, instead of the common practice of just listing URLs. I followed that practice when writing some earlier books, and it has since struck me as unfair that some references had author names and some didn't. Nowadays I always include full names when known. [YouTube]: https://www.youtube.com/watch?v=zGSQdN2X_k0 [PDF]: /Slides/SREcon2022_ComputingPerformance.pdf [online]: /Slides/SREcon2022_ComputingPerformance/ [Systems Performance 2nd Edition]: /systems-performance-2nd-edition-book.html

March 01, 2023 12:00 AM

February 17, 2023

Greg Kroah-Hartman: About

Greg is a Fellow at the Linux Foundation and is responsible for the Linux kernel stable releases. He is also the maintainer of a variety of different kernel subsystems (USB, char/misc, tty/serial, driver core, staging, etc.) and has written a few books and articles about Linux kernel development. Wikipedia entry that is usually very out of date reddit AMA in 2020 reddit AMA in 2015 Great’s setup back in 2012

February 17, 2023 03:56 PM

Brendan Gregg: USENIX SREcon APAC 2023: CFP

USENIX's SREcon conference is the best venue for learning the latest in systems engineering (not just site reliability engineering) and if you have useful production stories and takeaways to share -- especially if you are in the Asia/Pacific region -- please consider submitting a talk proposal to [SREcon APAC 2023]. The [call for participation] ends on March 2nd, only two weeks away. It is held this year in Singapore, June 14-16, and I'm excited to be program co-chair with fellow Aussie [Jamie Wilkinson]. To quote from our CFP:

You build computer platforms, debug them, and support them, and you have learned something useful to share: You are invited to submit proposals to give talks at SREcon23 Asia/Pacific, which welcomes speakers from a variety of backgrounds, not just SRE, and from a variety of different-sized companies, not just those that are technology-focused. Your insights will help create a relevant, diverse, and inclusive program. Conversations are never complete when they focus just on successes; we encourage talks that focus on lessons learned from failures or hard problems.

At the seventh SREcon Asia/Pacific, we are especially seeking the deepest engineering talks: Those that cover gritty technical internals, advanced tools and techniques, and complex problems that may matter to others, whether your solutions were elegant, ugly, or unsuccessful.

We look forward to learning from speakers across the SRE and systems engineering space. This year we particularly welcome new speakers; many of our best talks have come from people with new perspectives to share and the last few years most certainly has given us all new experiences and stories we can share and from which we can learn.

At every SREcon globally, we welcome and encourage participation from all individuals in any country, including people that are underrepresented in, or excluded from, technology, including but not limited to: people of all colours, women, LGBTQ people, people with disabilities, neurodiverse participants, students, veterans, and others with unique characteristics.

Similarly, we welcome participants from diverse professional roles: QA testers, performance engineers, security teams, OS engineers, DBAs, network administrators, compliance experts, UX designers, government employees, and data scientists. Regardless of who you are or the job title you hold, if you are a technologist who faces unique challenges and shares our areas of interest, we encourage you to be a part of SREcon23 Asia/Pacific.
More details are or the [CFP page]. We're especially looking to highlight local content from the Asia/Pacific region. SREcon usually has good representation of the FAANGs (Facebook, Amazon, Apple, Netflix, Google, etc.) but I think every company has interesting stories to share. I'd love to see deeply technical talks about production incidents, where the approaches and tools used can be widely applied by others. And I'd love to hear about the ones that got away, where current tooling and approaches are insufficient. For many years I've helped with the USENIX LISA conference, which was the premier event for production systems engineering. In recent years attendees have moved to SREcon, which covers similar topics, and [LISA 2021] may have been the last time LISA runs as a stand-alone conference (see [LISA made LISA obsolete]). I think LISA still exists as the people of the LISA community, many of whom are now involved with SREcon where our engineering principles live on and develop further. If you've never been to an SREcon, check out the [SREcon APAC 2022] conference page, which links to slides and videos of the talks. [SREcon APAC 2022]: https://www.usenix.org/conference/srecon22apac/program [call for participation]: https://www.usenix.net/conference/srecon23apac/call-for-participation [CFP page]: https://www.usenix.net/conference/srecon23apac/call-for-participation [SREcon APAC 2023]: https://www.usenix.net/conference/srecon23apac [LISA 2021]: https://www.usenix.org/conference/lisa21 [LISA made LISA obsolete]: https://www.usenix.org/publications/loginonline/lisa-made-lisa-obsolete-thats-compliment [Jamie Wilkinson]: https://www.linkedin.com/in/jamie-wilkinson-8043553/

February 17, 2023 12:00 AM

February 13, 2023

Kernel Podcast: S2E2 – 2023/02/12

Summary

The latest stable kernel is Linux 6.1.11, released by Greg K-H on February 9th 2023.

The latest mainline (development) kernel is 6.2-rc7, released on February 5th 2023.

Linux 6.2 progress

A typical kernel development cycle begins with the “merge window” (period of time during which disruptive changes are allowed to be merged into the kernel) followed by a series of (weekly) Release Candidate (RC) kernels, and then the final release. In most cases, RC7 is the final RC, but it is not all that unusual to have an extra week, as is likely the case this time around. Linus said a few weeks ago, “I am expecting to do an rc8 this release regardless, just because we effectively had a lost week or two in the early rc’s”, and indeed fixes for RC8 were still coming in as recently as today. We should at this rate see RC8 tomorrow (Sunday is the normal release day), and the 6.3 merge window in another week, meaning we’ll cover the 6.3 merge window in the next edition of this podcast. In the meantime, I encourage listeners to consider subscribing and supporting LWN (Linux Weekly News), who always have a great merge window summary.

Confidential Compute (aka “CoCo”)

If there were a “theme of the moment” for the industry (other than layoffs), it would probably be Confidential Compute. It seems one can’t go more than 10 minutes without seeing a patch for some new confidential compute feature in one of the major architectures, or the system IP that goes along with it. Examples in just the past few weeks (and which we’ll cover in a bit) include patches from both Intel (TDX) and AMD (SEV-SNP) for their Confidential Compute solutions, as well as PCI pass-through support in Hyper-V for Confidential VMs. At the same time, thought is going into revising the kernel’s “threat model” to update it for a world of Confidential Compute. 

A fundamental tenet of Confidential Compute is that guests no longer necessarily have to trust the hypervisor on which they are running, and quite possibly also don’t trust the operator of the system either (whether a cloud, edge network, OEM, etc.). The theory goes that you might even have a server sitting in some (less than friendly) geographical location but still hold out a certain amount of trust for your “confidential” workloads based on properties provided by the silicon (and attested by introspecting the other physical and emulated devices provided by the system). In this model, you necessarily have to trust the silicon vendor, but maybe not much beyond that.

Elana Rehetova (Intel) posted “Linux guest kernel threat model for Confidential Computing” in which she addressed Greg Kroah-Hartman (“Greg K-H”), who apparently previously requested “that we ought to start discussing the updated threat model for kernel”. She had links to quite detailed writeups on Intel’s github. Greg replied to a point about not trusting the hypervisor with “That is, frankly, a very funny threat model. How realistic is it really given all of the other ways that a hypervisor can mess with a guest?”. And that did indeed used to be a good point. Some of the earlier attempts at Confidential Compute included architectural designs in which guest registers were not protected against single step debug (and introspection) from a hypervisor, for example. And so one can be forgiven for thinking that there are some fundamental gaps, but a lot has changed over the past few years, and the architectures have advanced quite a bit since.

Greg also noted that he “hate[s] the term “hardening”” (as applied to “hardening” device drivers against malicious hardware implementations (as opposed to just potentially buggy ones). He added, “Please just say it for what it really is, “fixing bugs to handle broken hardware”.  We’ve done that for years when dealing with PCI and USB and even CPUs doing things that they shouldn’t be doing. How is this any different in the end? So what you also are saying here now is “we do not trust any PCI devices”, so please just say that (why do you trust USB devices?)  If that is something that you all think that Linux should support, then let’s go from there.” David Alan Gilbert piled on with some context around Intel, and AMD’s implementations, and in particular that more than mere memory encryption is used; register state, guest VMSA (control), etc. – all of that and much more – is carefully managed under the new world order.

Daniel Berrange further clarified, in response to a discussion about deliberately malicious implementations of PCI and USB controllers, that, “As a baseline requirement, in the context of confidential computing the guest would not trust the hypervisor with data that needs to remain confidential, but would generally still expect it to provide a faithful implementation of a given device.” A lot of further back and forth took place with others piling on comments indicating a few folks weren’t aware of the different technical pieces involved (e.g. PCI IDE, CMA, DOE, SPDM and other acronyms) for device attestation prior to trusting it from within a guest, or that this was even possible. The thread was more informative for revealing that general knowledge of technology involved in Confidential Compute is not broadly pervasive. Perhaps there is an opportunity there for sessions at the newly revived in-person conferences taking place in ‘23.

Ongoing Development

Miguel Ojeda posted a patch introducing a new “Rust fixes” branch, noting, “While it may be a bit early to have a “fixes” branch, I guessed it would not hurt to start practicing how to do things for the future when we may get actual users. And since the opportunity presented itself, I wanted to also use this PR to bring up a “policy” topic and ideally get kernel maintainers to think about it.” He went on to describe the PR as containing a fix for a “soundness issue” related to UB (Undefined Behavior) in which “safe” rust code can nonetheless trigger UB in C code. He wanted to understand whether such fixes were truly considered fixes suitable for backport to stable and was mostly interested in addressing the policy aspect of the development process. Linus took the pull request without discussion, so presumably it wasn’t a big deal for him.

Saurabh Sengar posted “Device tree support for Hyper-V VMBus driver”, which “expands the VMBus driver to include device tree support. This feature allows for a kernel boot without the use of ACPI tables, resulting in a smaller memory footprint and potentially faster boot times. This is tested by enabling CONFIG_FLAT and OF_EARLY_FLATTREE for x86.” It isn’t articulated in the patch series, but this smells like an effort to support a special case minimal kernel – like the kind used by Amazon’s “Firecracker” for fast spinup VMs used to back ephemeral services like “functions”. It will be interesting to see what happens with this.

Elliot Berman (QUIC, part of Qualcomm) posted version 9 of a patch series “Drivers for gunyah hypervisor”. Gunyah is, “a Type-1 hypervisor independent of any high-level OS kernel, and runs in a higher CPU privilege level. It does not depend on any lower-privileged OS kernel/code for its core functionality. This increases its security and can support a much smaller trusted computing base than a Type-2 hypervisor.” The Gunyah source is available on github.

Breno Leitao posted “netpoll: Remove 4s sleep during carrier detection” noting that “Modern NICs do not seem to have this bouncing problem anymore, and this sleep slows down the machine boot unnecessarily”. What he meant is that traditionally the carrier on a link might be reported as “up” while autonegotiation was still underway. As Jakub Kicinski noted, especially on servers the “BMC [is often] communicating over NC-SI via the same NIC as gets used for netconsole. BMC will keep the PHY up, hence the carrier appearing instantly.”

Robin Murphy (Arm) posted a patch series aiming to “retire” the “iommu_ops” per bus IOMMU operations and reconcile around a common kernel implementation.

SongJae Park continues to organize periodic “Beer/Coffee/Tea” chat series virtual sessions for those interested in DAMON. The agenda and info is an a shared Google doc.

Architectures

Arm

Suzuki K Poulose posted several sets of (large) related RFC (Request For Comment) patches beginning with, “Support for Arm CCA VMs on Linux”. Arm CCA is a new feature introduced as part of the Armv9 architecture, including both the “Realm Management Extension” (RME) and associated system level IP changes required to build machines that support Confidential Compute “Realms”. In the CCA world, there are additional security states beyond the traditional Secure/Non-Secure. There is now a Realm state in which e.g. a Confidential Compute VM communicates with a new “RMM” (Realm Management Monitor) over an “RSI” (Realm Service Interface) to obtain special services on the Realm’s behalf. The RMM is separated from the traditional hypervisor and “provides standard interfaces – Realm Management Interface (RMI) – to the Normal world hypervisor to manage the VMs running in the Realm world (also called

Realms in short).” The idea is that the RMM is well known (e.g. Open Source) code that can be attested and trusted by a Realm to provide it with services on behalf of an untrusted hypervisor.

Arm include links to an updated “FVP” (Fixed Virtual Platform) modeling an RME-enabled v9 platform, alongside patched TF-A (Trusted Firmware), an RMM (Realm Management Monitor), kernel patches, an updated kvmtool (a lightweight alternative to qemu for starting VMs), and updated kvm-unit-tests. Suzuki notes that what they are seeking is feedback on:

  1. KVM integration of the Arm CCA
  2. KVM UABI for managing the Realms, seeking to generalise the operations wherever possible with other Confidential Compute solutions.
  3. Linux Guest support for Realms

kvx

Yann Sionneau (Kalrey) posted version 2 of a patch series, “Upstream kvx Linux port”, which adds support for yet another architecture, as used in the “Coolidge (aka MPPA3-80)” SoC. The architecture is a little endian VLIW (Very Long Instruction Word) with 32 and 64-bit execution modes, 64 GPRs, SIMD instructions, and (but of course) a “deep learning co-processor”. The architecture appears to borrow nomenclature from elsewhere, having both an “APIC” and a “GIC” as part of its interrupt controller story. Presumably these mean something quite different. In the mail, Yann notes that this is only an RFC at this stage, “since kvx support is not yet upstreamed into gcc/binutils”. The most infamous example of a VLIW architecture is, of course, Intel’s Itanium. It is slowly being removed from the kernel in a process that began in 2019 with the shipping of the final Itanium systems and deprecation of GCC and GLIBC support for it. If things go well, perhaps this new VLIW architecture can take Itanium’s place as the only one.

RISC-V

Anup Patel (Ventana Micro) is heavily involved in various RISC-V architecture enablement, including for the new “AIA” (Advanced Interrupt Architecture) specification, replacing the de facto use of SiFive’s “PLIC” interrupt controller. The spec has now been frozen (Anup provided a link to the frozen AIA specification) and initial patches are posted by Anup enabling support for guests to see a virtualized set of CSRs (Configuration and Status Registers). AIA is designed to be fully virtualizable, although as this author has noted from reading the spec, it does require an interaction with the IOMMU to interdict messages in order to allow for device live migration.

Sunil V L (Ventana Micro) posted patches to “Add basic ACPI support for RISC-V”. The patches come alongside others for EDK2 (UEFI, aka “Tianocore”), and Qemu (to run that firmware and boot RISC-V kernels enabled with ACPI support). This is an encouraging first step toward an embrace of the kinds of technologies required for viability in the mainstream. This author recalls the uphill battle that was getting ACPI support enabled for Arm. Perhaps the community has more experience to draw upon at this point, and a greater understanding of the importance of such standards to broader ecosystems. In any case, there were no objections this time around.

x86-64

Early last year (2022), David Woodhouse (Amazon) posted the 4th version of a patch series he had been working on titled “Parallel CPU bringup for x86_64” which aims to speed up the boot process for large SMP x86 systems. Traditionally, x86 systems would enter the Linux kernel in a single threaded mode with a “bootcpu” being the first core that happened to start Linux (not necessarily “cpu0”). Once early initialization was complete, this CPU would use a SIPI (Startup IPI or “Inter Processor Interrupt”) to signal to the “secondary” cores that they should start booting. The entire process could eventually take quite some time, and it would therefore be better if these “secondary” cores could start their initialization earlier – while the first core was getting things setup – and then rendezvous waiting for a signal to proceed.

Usama Arif (Bytedance) noted that these older patches “brought down the smpboot time from ~700ms to 100ms”. That’s a decent savings, especially when using kexec as Usama is doing (perhaps in a “Linuxboot” type of configuration with Linux as a bootloader), and at the scale of a large number of systems. Usama was interested to know whether these patches could be merged. David replied that the last time around there had been some AMD systems that broke with the patches, “We don’t *think* there are any remaining software issues; we think it’s hardware. Either an actual hardware race in CPU or chipset, or perhaps even something as simple as a voltage regulator which can’t cope with an increase in power draw from *all* the CPUs at the same time. We have prodded AMD a few times to investigate, but so far to no avail. Last time I actually spoke to Thomas [Gleixner – one of the core x86 maintainers] in person, I think he agreed that we should just merge it and disable the parallel mode for the affected AMD CPUs.”. The suggestion was to proceed to merge but to disable this feature on all AMD CPUs for the moment out of an abundance of caution.

Nikunj A Dadhania (AMD) posted patches enabling support for a “Secure TSC” for SNP (Secure Nested Paging) guests. SNP is part of AMD’s Confidential Compute strategy and securing the TSC (Time Stamp Counter) is a necessary part of enabling confidential guests to not have to trust the host hypervisor. Prior to these patches, a hypervisor could interdict the TSC, providing a different view of the passage of CPU time to the guest than reality. With the patches, “Secure TSC allows guest to securely use RDTSC/RDTSCP instructions as the parameters being used cannot be changed by hypervisor once the guest is launched. More details in the AMD64 APM Vol 2, Section “Secure TSC”. According to Nikunj, “During the boot-up of the secondary cpus, SecureTSC enabled guests need to query TSC info from Security processor (PSP). This communication channel is encrypted between the security processor and the guest, hypervisor is just the conduit to deliver the guest messages to the security processor. Each message is protected with an AEAD (AES-256 GCM).”

Rick Edgecomb (Intel) posted an updated patch series titled “Shadow stacks for userspace” that “implements Shadow Stacks for userspace using x86’s Control-flow Enforcement Technology (CET). CET consists of two related security features: shadow stacks and indirect branch tracking. This series implements just the shadow stack part of this feature, and just for userspace.” As Rick notes, “The main use case for shadow stack is providing protection against return oriented programming attacks”. ROP attacks aim to string together pre-existing “gadgets” (existing pieces of code, not necessarily actually well defined functions in themselves) by finding a vulnerability in existing code that can cause a function to jump (return) into a gadget sequence. Shadow stacks mitigate this by adding an additional, separate hardware structure, that tracks all function entry/exit sequences and ensures returns only come from real function calls (or are special cased longjmp like sequences that usually require special handling).

Jarkko Sakkinen posted some fixes for AMD’s SEV-SNP hypervisor that had been discovered by the Enarx developers. I’m mentioning it because this patch series may have been the final one to go out by the startup “Profian”, which had been seeking to commercialize support for Enarx. Profian closed its doors in the past few weeks due to the macro-economic environment. Some great developers are available on the market and looking for new opportunities. If you are hiring, or know folks who are, you can see posts from the Profian engineers on LinkedIn.

Final words

The Open Source Summit North America returns in person (and virtual) this year, from May 10th in Vancouver, British Columbia, Canada. There are several other events planned to be colocated alongside the Open Source Summit. These include the (invite only) Linux Storage, Filesystem, Memory Management, and BPF (LSF/MM/BPF) Summit for which a CFP is open. Another colocated event is the Linux Security Summit North America, the CfP of which was announced by James Morris with a link for submitting proposals.

Cyril Hrubis (SuSE) posted an announcement that the Linux Test Project (LTP) for January 2023 had been released. It includes a number of new tests, among them “dirtyc0w_shmem aka CVE-2022-2590”. They have also updated the minimum C requirement to -std=gnu99. Linux itself moved to a baseline of C11 (from the much older C99 standard) since Linux 5.18.

February 13, 2023 12:06 AM

February 08, 2023

Dave Airlie (blogspot): vulkan video: status update (anv + radv)

 Okay just a short status update.

radv H264/H265 decode:

The radv h264/h265 support has been merged to mesa main branch. It is still behind RADV_PERFTEST=video_decode flag, and should work for basics from VI/GFX8+. It still has not passed all the CTS tests.

anv H264 decode:

The anv h264 decode support has been merged to mesa main branch. It has been tested from Skylake up to DG2. It has no enable flag, just make sure to build with h264dec video-codec support. It passes all current CTS tests.

hasvk H264 decode:

I ported the anv h264 decoder to hasvk the vulkan driver for Ivybridge/Haswell. This in a draft MR (HASVK H264). I haven't given this much testing yet, it has worked in the past. I'll get to testing it before trying to get it merged.

radv AV1 decode:

I created an MR for spec discussion (radv av1). I've also cleaned up the radv AV1 decode code.

anv AV1 decode:

I've started on anv AV1 decode support for DG2. I've gotten one very simple frame to decode. I will attempt to do more. I think filmgrain is not going to be supported in the short term. I'll fill in more details on this when it's working better. I think there are a few things that might need to be changed in the AV1 decoder provisional spec for Intel, there are some derived values that ffmpeg knows that it would be nice to not derive again, and there are also some hw limits around tiles and command buffers that will need to be figured out.

February 08, 2023 04:07 AM

February 02, 2023

Linux Plumbers Conference: Preliminary Dates and Location for LPC2023

The 2023 LPC PC is pleased to announce that we’ve begun exclusive negotiations with the Omni Hotel in Richmond, VA to host Plumbers 2023 from 13-15 November. Note: These dates are not yet final (nor is the location; we have had one failure at this stage of negotiations from all the Plumbers venues we’ve chosen). We will let you know when this preliminary location gets finalized (please don’t book irrevocable travel until then).

The November dates were the only ones that currently work for the venue, but Richmond is on the same latitude as Seville in Spain, so it should still be nice and warm.

February 02, 2023 04:18 PM

Matthew Garrett: Blocking free API access to Twitter doesn't stop abuse

In one week from now, Twitter will block free API access. This prevents anyone who has written interesting bot accounts, integrations, or tooling from accessing Twitter without paying for it. A whole number of fascinating accounts will cease functioning, people will no longer be able to use tools that interact with Twitter, and anyone using a free service to do things like find Twitter mutuals who have moved to Mastodon or to cross-post between Twitter and other services will be blocked.

There's a cynical interpretation to this, which is that despite firing 75% of the workforce Twitter is still not profitable and Elon is desperate to not have Twitter go bust and also not to have to tank even more of his Tesla stock to achieve that. But let's go with the less cynical interpretation, which is that API access to Twitter is something that enables bot accounts that make things worse for everyone. Except, well, why would a hostile bot account do that?

To interact with an API you generally need to present some sort of authentication token to the API to prove that you're allowed to access it. It's easy enough to restrict issuance of those tokens to people who pay for the service. But, uh, how do the apps work? They need to be able to communicate with the service to tell it to post tweets, retrieve them, and so on. And the simple answer to that is that they use some hardcoded authentication tokens. And while registering for an API token yourself identifies that you're not using an official client, using the tokens embedded in the clients makes it look like you are. If you want to make it look like you're a human, you're already using tokens ripped out of the official clients.

The Twitter client API keys are widely known. Anyone who's pretending to be a human is using those already and will be unaffected by the shutdown of the free API tier. Services like movetodon.org do get blocked. This isn't an anti-abuse choice. It's one that makes it harder to move to other services. It's one that blocks a bunch of the integrations and accounts that bring value to the platform. It's one that hurts people who follow the rules, without hurting the ones who don't. This isn't an anti-abuse choice, it's about trying to consolidate control of the platform.

comment count unavailable comments

February 02, 2023 10:36 AM