Kernel Planet

November 20, 2017

Davidlohr Bueso: Linux v4.14: Performance Goodies

Last week Linus released the v4.14 kernel with some noticeable performance changes. The following is an unsorted and incomplete list of changes that went in. Note that the term 'performance' can be vague in that some gains in one area can negatively affect another, so take everything with a grain of salt and reach your own conclusions.

sysvipc: scale key management

We began using relativistic hash tables for managing ipc keys, which greatly improves the current O(N) lookups. As such, ipc_findkey() calls are significantly faster (+800% in some reaim file benchmarks) and we need not iterate all elements each time. Improvements are even seen in  scenarios where the amount of keys is but a handful, so this is pretty much a win from any standpoint.
[Commit 0cfb6aee70bd]
 

interval-tree: fast overlap detection

With the new extended rbtree api to cache the smallest (leftmost) node, instead of doing O(logN) walks to the end of the tree, we have the pointer always available. This allows to extend and complete the fast overlap detection for interval trees to speedup (sub)tree searches if the interval is completely to the left or right of the current tree's max interval. In addition, a number of other users that traverse rbtrees are updated to use the new rbtree_cached, such as epoll, procfs and cfq.
[Commit  cd9e61ed1eeb, 410bd5ecb276, 2554db916586, b2ac2ea6296af808c13fd373]

sched: waitqueue bookmarks

A situation where constant NUMA migrations of a hot-page triggered large number of page waiters being awoken exhibited some issues in the waitqueue implementation. In such cases, large number of wakeups will occur while holding a spinlock, which causes significant unbounded lantencies. Unlike wake_qs (used in futexes and locks), where batched wakeups are done without the lock, waitqueue bookmarks allow to to pause and stop iterating the wake list such that another process has a chance to acquire the lock. Then it can resume where it left off.
[Commit 3510ca20ece, 2554db916586, 11a19c7b099f]

 x86 PCID (Process Context Identifier)

This is a 64-bit hardware feature that allows tagging TLBs such that upon context switching, only flush the required entries. For virtualization (VT-x) this has supported similar features for a while, via vpid. On other archs it is called address space ID. Linux's support is somewhat special. In order to avoid the x86 limitations of 4096 IDs (or processes), the implementation actually uses a PCID to identify a recently-used mm (process address space) on a per-cpu basis. An mm has no fixed PCID binding at all; instead, it is given a fresh PCID each time it's loaded, except in cases where we want to preserve the TLB, in which case we reuse a recent value. To illustrate, a workload under kvm that ping pongs two processes, dTLB misses were reduced by ~17x.
[Commit f39681ed0f48, b0579ade7cd8, 94b1b03b519b, 43858b4f25cf, cba4671af7550790c9aad849, 660da7c9228f, 10af6235e0d3

 

ORC (Oops Rewind Capability) Unwinder

The much acclaimed replacement to frame pointers and the (out of tree) DWARF unwinder. Through simplicity, the end result is faster profiling, such as for perf. Experiments show a 20x performance increase using ORC vs DWARF while calling save_stack_trace 20,000 times via single vfs_write. With respect to frame pointers, the ORC unwinder is more accurate across interrupt entry frames and enables a 5-10% performance improvement across the entire kernel compared to frame pointers.
[Commit ee9f8fce9964, 39358a033b2e]

mm: choose swap device according to numa node

If the system has more than one swap device and swap device has the node information, we can make use of this information to decide which swap device to use in get_swap_pages() to get better performance. This change replaces a single global swap_avail list with a per-numa-node list: each numa node sees its own priority based list of available swap devices. Swap device's priority can be promoted on its matching node's swap_avail_list. Shows ~25% improvements for a 2 node box, benchmaring random writes on mmaped region withSSDs attached to each node, ensuring swapping in and out.
[Commit a2468cc9bfdf]

mm: reduce cost of page allocator

Upon page allocation, the per-zone statistics are updated, introducing overhead in the form of cacheline bouncing; responsible for ~30% of all CPU cycles  for allocating a single page. The networking folks have been known to complain about the performance degradation when dealing with the memory management subsystem, particularly the page allocator. The fact that these NUMA associated counters are rarely used allows the counter threshold that determines the frequency of updating the global counter with the percpu counters (hence cacheline bouncing) to be increased. This means hurting readers, but that's the point.
[Commit 3a321d2a3dde, 1d90ca897cb0, 638032224ed7]

archs: multibyte memset

New calls memset16(), memset32() and memset64() are introduced, which are like memset(), but allow the caller to fill the destination with a value larger than a single byte. There are a number of places in the kernel that can benefit from using an optimized function rather than a loop; sometimes text size, sometimes speed, and sometimes both. When supported by the architecture, use a single instruction, such as stosq (stores a quadword) in x86-64. Zram shows a 7% performance improvement on x86 with a 100Mb non-zero deduplicate data. If not available, default back to the slower loop implementation.
[Commits  3b3c4babd898, 03270c13c5ff, 4c51248533ad, 48ad1abef402]

powerpc: improve TLB flushing

A few optimisations were also added to the radix MMU TLB flushing, mostly to avoid unnecessary Page Walk Cache (PWC) flushes when the structure of the tree is not changing.
[Commit a46cc7a90fd8, 424de9c6e3f8]

There are plenty of other performance optimizations out there, including ext4 parallel file creation and quotas, additional memset improvements in sparc, transparent hugepage migrations and swap improvements, ipv6 (ip6_route_output()) optimizations, etc. Again, the list here is partial and biased by me. For more list of features play with 'git log' or visit lwn (part1, part2) and kernelnewbies.

November 20, 2017 03:50 PM

November 15, 2017

Kees Cook: security things in Linux v4.14

Previously: v4.13.

Linux kernel v4.14 was released this last Sunday, and there’s a bunch of security things I think are interesting:

vmapped kernel stack on arm64
Similar to the same feature on x86, Mark Rutland and Ard Biesheuvel implemented CONFIG_VMAP_STACK for arm64, which moves the kernel stack to an isolated and guard-paged vmap area. With traditional stacks, there were two major risks when exhausting the stack: overwriting the thread_info structure (which contained the addr_limit field which is checked during copy_to/from_user()), and overwriting neighboring stacks (or other things allocated next to the stack). While arm64 previously moved its thread_info off the stack to deal with the former issue, this vmap change adds the last bit of protection by nature of the vmap guard pages. If the kernel tries to write past the end of the stack, it will hit the guard page and fault. (Testing for this is now possible via LKDTM’s STACK_GUARD_PAGE_LEADING/TRAILING tests.)

One aspect of the guard page protection that will need further attention (on all architectures) is that if the stack grew because of a giant Variable Length Array on the stack (effectively an implicit alloca() call), it might be possible to jump over the guard page entirely (as seen in the userspace Stack Clash attacks). Thankfully the use of VLAs is rare in the kernel. In the future, hopefully we’ll see the addition of PaX/grsecurity’s STACKLEAK plugin which, in addition to its primary purpose of clearing the kernel stack on return to userspace, makes sure stack expansion cannot skip over guard pages. This “stack probing” ability will likely also become directly available from the compiler as well.

set_fs() balance checking
Related to the addr_limit field mentioned above, another class of bug is finding a way to force the kernel into accidentally leaving addr_limit open to kernel memory through an unbalanced call to set_fs(). In some areas of the kernel, in order to reuse userspace routines (usually VFS or compat related), code will do something like: set_fs(KERNEL_DS); ...some code here...; set_fs(USER_DS);. When the USER_DS call goes missing (usually due to a buggy error path or exception), subsequent system calls can suddenly start writing into kernel memory via copy_to_user (where the “to user” really means “within the addr_limit range”).

Thomas Garnier implemented USER_DS checking at syscall exit time for x86, arm, and arm64. This means that a broken set_fs() setting will not extend beyond the buggy syscall that fails to set it back to USER_DS. Additionally, as part of the discussion on the best way to deal with this feature, Christoph Hellwig and Al Viro (and others) have been making extensive changes to avoid the need for set_fs() being used at all, which should greatly reduce the number of places where it might be possible to introduce such a bug in the future.

SLUB freelist hardening
A common class of heap attacks is overwriting the freelist pointers stored inline in the unallocated SLUB cache objects. PaX/grsecurity developed an inexpensive defense that XORs the freelist pointer with a global random value (and the storage address). Daniel Micay improved on this by using a per-cache random value, and I refactored the code a bit more. The resulting feature, enabled with CONFIG_SLAB_FREELIST_HARDENED, makes freelist pointer overwrites very hard to exploit unless an attacker has found a way to expose both the random value and the pointer location. This should render blind heap overflow bugs much more difficult to exploit.

Additionally, Alexander Popov implemented a simple double-free defense, similar to the “fasttop” check in the GNU C library, which will catch sequential free()s of the same pointer. (And has already uncovered a bug.)

Future work would be to provide similar metadata protections to the SLAB allocator (though SLAB doesn’t store its freelist within the individual unused objects, so it has a different set of exposures compared to SLUB).

setuid-exec stack limitation
Continuing the various additional defenses to protect against future problems related to userspace memory layout manipulation (as shown most recently in the Stack Clash attacks), I implemented an 8MiB stack limit for privileged (i.e. setuid) execs, inspired by a similar protection in grsecurity, after reworking the secureexec handling by LSMs. This complements the unconditional limit to the size of exec arguments that landed in v4.13.

randstruct automatic struct selection
While the bulk of the port of the randstruct gcc plugin from grsecurity landed in v4.13, the last of the work needed to enable automatic struct selection landed in v4.14. This means that the coverage of randomized structures, via CONFIG_GCC_PLUGIN_RANDSTRUCT, now includes one of the major targets of exploits: function pointer structures. Without knowing the build-randomized location of a callback pointer an attacker needs to overwrite in a structure, exploits become much less reliable.

structleak passed-by-reference variable initialization
Ard Biesheuvel enhanced the structleak gcc plugin to initialize all variables on the stack that are passed by reference when built with CONFIG_GCC_PLUGIN_STRUCTLEAK_BYREF_ALL. Normally the compiler will yell if a variable is used before being initialized, but it silences this warning if the variable’s address is passed into a function call first, as it has no way to tell if the function did actually initialize the contents. So the plugin now zero-initializes such variables (if they hadn’t already been initialized) before the function call that takes their address. Enabling this feature has a small performance impact, but solves many stack content exposure flaws. (In fact at least one such flaw reported during the v4.15 development cycle was mitigated by this plugin.)

improved boot entropy
Laura Abbott and Daniel Micay improved early boot entropy available to the stack protector by both moving the stack protector setup later in the boot, and including the kernel command line in boot entropy collection (since with some devices it changes on each boot).

eBPF JIT for 32-bit ARM
The ARM BPF JIT had been around a while, but it didn’t support eBPF (and, as a result, did not provide constant value blinding, which meant it was exposed to being used by an attacker to build arbitrary machine code with BPF constant values). Shubham Bansal spent a bunch of time building a full eBPF JIT for 32-bit ARM which both speeds up eBPF and brings it up to date on JIT exploit defenses in the kernel.

seccomp improvements
Tyler Hicks addressed a long-standing deficiency in how seccomp could log action results. In addition to creating a way to mark a specific seccomp filter as needing to be logged with SECCOMP_FILTER_FLAG_LOG, he added a new action result, SECCOMP_RET_LOG. With these changes in place, it should be much easier for developers to inspect the results of seccomp filters, and for process launchers to generate logs for their child processes operating under a seccomp filter.

Additionally, I finally found a way to implement an often-requested feature for seccomp, which was to kill an entire process instead of just the offending thread. This was done by creating the SECCOMP_RET_ACTION_FULL mask (née SECCOMP_RET_ACTION) and implementing SECCOMP_RET_KILL_PROCESS.

That’s it for now; please let me know if I missed anything. The v4.15 merge window is now open!

© 2017, Kees Cook. This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 License.
Creative Commons License

November 15, 2017 05:23 AM

November 14, 2017

James Morris: Save the Dates: Linux Security Summit Events for 2018

There will be a new European version of the Linux Security Summit for 2018, in addition to the established North American event.

The dates and locations are as follows:

Stay tuned for CFP announcements!

 

November 14, 2017 11:24 PM

November 13, 2017

Matthew Garrett: Eben Moglen is no longer a friend of the free software community

(Note: While the majority of the events described below occurred while I was a member of the board of directors of the Free Software Foundation, I am no longer. This is my personal position and should not be interpreted as the opinion of any other organisation or company I have been affiliated with in any way)

Eben Moglen has done an amazing amount of work for the free software community, serving on the board of the Free Software Foundation and acting as its general counsel for many years, leading the drafting of GPLv3 and giving many forceful speeches on the importance of free software. However, his recent behaviour demonstrates that he is no longer willing to work with other members of the community, and we should reciprocate that.

In early 2016, the FSF board became aware that Eben was briefing clients on an interpretation of the GPL that was incompatible with that held by the FSF. He later released this position publicly with little coordination with the FSF, which was used by Canonical to justify their shipping ZFS in a GPL-violating way. He had provided similar advice to Debian, who were confused about the apparent conflict between the FSF's position and Eben's.

This situation was obviously problematic - Eben is clearly free to provide whatever legal opinion he holds to his clients, but his very public association with the FSF caused many people to assume that these positions were held by the FSF and the FSF were forced into the position of publicly stating that they disagreed with legal positions held by their general counsel. Attempts to mediate this failed, and Eben refused to commit to working with the FSF on avoiding this sort of situation in future[1].

Around the same time, Eben made legal threats towards another project with ties to FSF. These threats were based on a license interpretation that ran contrary to how free software licenses had been interpreted by the community for decades, and was made without any prior discussion with the FSF. This, in conjunction with his behaviour over the ZFS issue, led to him stepping down as the FSF's general counsel.

Throughout this period, Eben disparaged FSF staff and other free software community members in various semi-public settings. In doing so he harmed the credibility of many people who have devoted significant portions of their lives to aiding the free software community. At Libreplanet earlier this year he made direct threats against an attendee - this was reported as a violation of the conference's anti-harassment policy.

Eben has acted against the best interests of an organisation he publicly represented. He has threatened organisations and individuals who work to further free software. His actions are no longer to the benefit of the free software community and the free software community should cease associating with him.

[1] Contrary to the claim provided here, Bradley was not involved in this process.

(Edit to add: various people have asked for more details of some of the accusations here. Eben is influential in many areas, and publicising details without the direct consent of his victims may put them at professional risk. I'm aware that this reduces my credibility, and it's entirely reasonable for people to choose not to believe me as a result. I will add that I said much of this several months ago, so I'm not making stuff up in response to recent events)

comment count unavailable comments

November 13, 2017 07:07 PM

Gustavo F. Padovan: The linuxdev-br conference was a success!

Last Saturday we had the first edition of the Linux Developer Conference Brazil. A conference  born from the need of a meeting point, in Brazil, for the developers,  enthusiasts and companies of FOSS projects that forms the Core of modern Linux systems, either it be in smartphones, cloud, cars or TVs.

After a few years traveling to conferences around the world I felt that we didn’t have in Brazil any forum like the ones outside of Brazil, so I came up with the idea of building one myself. So I invited two friends of mine to take on the challenge, Bruno Dilly and João Moreira. We also got help from University of Campinas that allowed us to use their space, many thanks to Professor Islene Garcia.

Together we made linuxdev-br was a success, the talks were great. Almost 100 people attended the conference, some of them traveling from quite far places in Brazil. During the day we had João Avelino Bellomo Filho talking about SystemTap, Lucas Villa Real talking about Virtualization with GoboLinux’ Runner and Felipe Neves talking about the Zephyr project. In the afternoon we had Fabio Estevam talking about Device Tree, Arnaldo Melo on perf tools and João Moreira on Live Patching. All videos are available here (in Portuguese).

To finish the day we had a Happy Hour paid by the sponsors of the conference. It was a great opportunity to have some beers and interact with other attendees.

I want to thank you everyone that joined us in the first edition, next year it will be even better. By the way, talking about next year, the conference idiom next year will be English. We want linuxdev-br to become part of the international cycle of conferences! Stay tuned for next year, if you want to take part, talk or sponsor please reach us at contact@linuxdev-br.net.

November 13, 2017 03:33 PM

November 07, 2017

Dave Airlie (blogspot): radv on Ubuntu broken in distro packages

It appears that Ubuntu mesa 17.2.2 packages that ship radv, have patches to enable MIR support. These patches actually just break radv instead. I'd seen some people complain that simple apps don't work on radv, and saying radv wasn't ready for use and how could anyone thing of using it and just wondered what they had been smoking as Fedora was working fine. Hopefully Canonical can sort that out ASAP.

November 07, 2017 07:35 PM

Pete Zaitcev: ProxyFS opened, I think

Not exactly sure if that thing is complete, and I didn't attend the announcement (at OpenStack Summit in Sydney, presumably), but it appears that SwiftStack open-sourced ProxyFS. The project was announced to the world a year an a half ago.

November 07, 2017 02:25 AM

October 26, 2017

Pete Zaitcev: Polite like Sphinx

Exception occurred:
   File "/usr/lib/python2.7/site-packages/sphinx/util/logging.py", line 363, in filter
     raise SphinxWarning(message % record.args)
TypeError: not all arguments converted during string formatting
The full traceback has been saved in /tmp/sphinx-err-SD2Ra4.log, if you want to report the issue to the developers.

Love how modest this package is.

October 26, 2017 08:26 PM

October 21, 2017

Pavel Machek: Prague and Nokia N900s

If you are travelling to Prague to ELCE, and have Nokia N900, N9 or N950, or spare parts for them, please take them with you. I may help you install postmarket os there (https://wiki.postmarketos.org/wiki/Main_Page), can probably charge N900 that does not charge, and spare parts would be useful for me. I have a talk about cameras, and will be around... https://osseu17.sched.com/event/ByYH/cheap-complex-cameras-pavel-machek-denx-software-engineering-gmbh .

October 21, 2017 10:25 PM

October 20, 2017

James Morris: Security Session at the 2017 Kernel Summit

For folks attending Open Source Summit Europe next week in Prague, note that there is a security session planned as part of the co-located Kernel Summit technical track.

This year, the Kernel Summit is divided into two components:

  1. An invitation-only maintainer summit of 30 people total, and;
  2. An open kernel summit technical track which is open to all attendees of OSS Europe.

The security session is part of the latter.  The preliminary agenda for the kernel summit technical track was announced by Ted Ts’o here:

There is also a preliminary agenda for the security session, here:

Currently, the agenda includes an update from Kees Cook on the Kernel Self Protection Project, and an update from Jarkko Sakkinen on TPM support.  I’ll provide a summary of the recent Linux Security Summit, depending on available time, perhaps focusing on security namespacing issues.

This agenda is subject to change and if you have any topics to propose, please send an email to the ksummit-discuss list.

 

October 20, 2017 12:23 AM

October 16, 2017

Greg Kroah-Hartman: Linux Kernel Community Enforcement Statement FAQ

Based on the recent Linux Kernel Community Enforcement Statement and the article describing the background and what it means , here are some Questions/Answers to help clear things up. These are based on questions that came up when the statement was discussed among the initial round of over 200 different kernel developers.

Q: Is this changing the license of the kernel?

A: No.

Q: Seriously? It really looks like a change to the license.

A: No, the license of the kernel is still GPLv2, as before. The kernel developers are providing certain additional promises that they encourage users and adopters to rely on. And by having a specific acking process it is clear that those who ack are making commitments personally (and perhaps, if authorized, on behalf of the companies that employ them). There is nothing that says those commitments are somehow binding on anyone else. This is exactly what we have done in the past when some but not all kernel developers signed off on the driver statement.

Q: Ok, but why have this “additional permissions” document?

A: In order to help address problems caused by current and potential future copyright “trolls” aka monetizers.

Q: Ok, but how will this help address the “troll” problem?

A: “Copyright trolls” use the GPL-2.0’s immediate termination and the threat of an immediate injunction to turn an alleged compliance concern into a contract claim that gives the troll an automatic claim for money damages. The article by Heather Meeker describes this quite well, please refer to that for more details. If even a short delay is inserted for coming into compliance, that delay disrupts this expedited legal process.

By simply saying, “We think you should have 30 days to come into compliance”, we undermine that “immediacy” which supports the request to the court for an immediate injunction. The threat of an immediate junction was used to get the companies to sign contracts. Then the troll goes back after the same company for another known violation shortly after and claims they’re owed the financial penalty for breaking the contract. Signing contracts to pay damages to financially enrich one individual is completely at odds with our community’s enforcement goals.

We are showing that the community is not out for financial gain when it comes to license issues – though we do care about the company coming into compliance.  All we want is the modifications to our code to be released back to the public, and for the developers who created that code to become part of our community so that we can continue to create the best software that works well for everyone.

This is all still entirely focused on bringing the users into compliance. The 30 days can be used productively to determine exactly what is wrong, and how to resolve it.

Q: Ok, but why are we referencing GPL-3.0?

A: By using the terms from the GPLv3 for this, we use a very well-vetted and understood procedure for granting the opportunity to come fix the failure and come into compliance. We benefit from many months of work to reach agreement on a termination provision that worked in legal systems all around the world and was entirely consistent with Free Software principles.

Q: But what is the point of the “non-defensive assertion of rights” disclaimer?

A: If a copyright holder is attacked, we don’t want or need to require that copyright holder to give the party suing them an opportunity to cure. The “non-defensive assertion of rights” is just a way to leave everything unchanged for a copyright holder that gets sued.  This is no different a position than what they had before this statement.

Q: So you are ok with using Linux as a defensive copyright method?

A: There is a current copyright troll problem that is undermining confidence in our community – where a “bad actor” is attacking companies in a way to achieve personal gain. We are addressing that issue. No one has asked us to make changes to address other litigation.

Q: Ok, this document sounds like it was written by a bunch of big companies, who is behind the drafting of it and how did it all happen?

A: Grant Likely, the chairman at the time of the Linux Foundation’s Technical Advisory Board (TAB), wrote the first draft of this document when the first copyright troll issue happened a few years ago. He did this as numerous companies and developers approached the TAB asking that the Linux kernel community do something about this new attack on our community. He showed the document to a lot of kernel developers and a few company representatives in order to get feedback on how it should be worded. After the troll seemed to go away, this work got put on the back-burner. When the copyright troll showed back up, along with a few other “copycat” like individuals, the work on the document was started back up by Chris Mason, the current chairman of the TAB. He worked with the TAB members, other kernel developers, lawyers who have been trying to defend these claims in Germany, and the TAB members’ Linux Foundation’s lawyers, in order to rework the document so that it would actually achieve the intended benefits and be useful in stopping these new attacks. The document was then reviewed and revised with input from Linus Torvalds and finally a document that the TAB agreed would be sufficient was finished. That document was then sent to over 200 of the most active kernel developers for the past year by Greg Kroah-Hartman to see if they, or their company, wished to support the document. That produced the initial “signatures” on the document, and the acks of the patch that added it to the Linux kernel source tree.

Q: How do I add my name to the document?

A: If you are a developer of the Linux kernel, simply send Greg a patch adding your name to the proper location in the document (sorting the names by last name), and he will be glad to accept it.

Q: How can my company show its support of this document?

A: If you are a developer working for a company that wishes to show that they also agree with this document, have the developer put the company name in ‘(’ ‘)’ after the developer’s name. This shows that both the developer, and the company behind the developer are in agreement with this statement.

Q: How can a company or individual that is not part of the Linux kernel community show its support of the document?

A: Become part of our community! Send us patches, surely there is something that you want to see changed in the kernel. If not, wonderful, post something on your company web site, or personal blog in support of this statement, we don’t mind that at all.

Q: I’ve been approached by a copyright troll for Netfilter. What should I do?

A: Please see the Netfilter FAQ here for how to handle this

Q: I have another question, how do I ask it?

A: Email Greg or the TAB, and they will be glad to help answer them.

October 16, 2017 09:05 AM

Greg Kroah-Hartman: Linux Kernel Community Enforcement Statement

By Greg Kroah-Hartman, Chris Mason, Rik van Riel, Shuah Khan, and Grant Likely

The Linux kernel ecosystem of developers, companies and users has been wildly successful by any measure over the last couple decades. Even today, 26 years after the initial creation of the Linux kernel, the kernel developer community continues to grow, with more than 500 different companies and over 4,000 different developers getting changes merged into the tree during the past year. As Greg always says every year, the kernel continues to change faster this year than the last, this year we were running around 8.5 changes an hour, with 10,000 lines of code added, 2,000 modified, and 2,500 lines removed every hour of every day.

The stunning growth and widespread adoption of Linux, however, also requires ever evolving methods of achieving compliance with the terms of our community’s chosen license, the GPL-2.0. At this point, there is no lack of clarity on the base compliance expectations of our community. Our goals as an ecosystem are to make sure new participants are made aware of those expectations and the materials available to assist them, and to help them grow into our community.  Some of us spend a lot of time traveling to different companies all around the world doing this, and lots of other people and groups have been working tirelessly to create practical guides for everyone to learn how to use Linux in a way that is compliant with the license. Some of these activities include:

Unfortunately the same processes that we use to assure fulfillment of license obligations and availability of source code can also be used unjustly in trolling activities to extract personal monetary rewards. In particular, issues have arisen as a developer from the Netfilter community, Patrick McHardy, has sought to enforce his copyright claims in secret and for large sums of money by threatening or engaging in litigation. Some of his compliance claims are issues that should and could easily be resolved. However, he has also made claims based on ambiguities in the GPL-2.0 that no one in our community has ever considered part of compliance.  

Examples of these claims have been distributing over-the-air firmware, requiring a cell phone maker to deliver a paper copy of source code offer letter; claiming the source code server must be setup with a download speed as fast as the binary server based on the “equivalent access” language of Section 3; requiring the GPL-2.0 to be delivered in a local language; and many others.

How he goes about this activity was recently documented very well by Heather Meeker.

Numerous active contributors to the kernel community have tried to reach out to Patrick to have a discussion about his activities, to no response. Further, the Netfilter community suspended Patrick from contributing for violations of their principles of enforcement. The Netfilter community also published their own FAQ on this matter.

While the kernel community has always supported enforcement efforts to bring companies into compliance, we have never even considered enforcement for the purpose of extracting monetary gain.  It is not possible to know an exact figure due to the secrecy of Patrick’s actions, but we are aware of activity that has resulted in payments of at least a few million Euros.  We are also aware that these actions, which have continued for at least four years, have threatened the confidence in our ecosystem.

Because of this, and to help clarify what the majority of Linux kernel community members feel is the correct way to enforce our license, the Technical Advisory Board of the Linux Foundation has worked together with lawyers in our community, individual developers, and many companies that participate in the development of, and rely on Linux, to draft a Kernel Enforcement Statement to help address both this specific issue we are facing today, and to help prevent any future issues like this from happening again.

A key goal of all enforcement of the GPL-2.0 license has and continues to be bringing companies into compliance with the terms of the license. The Kernel Enforcement Statement is designed to do just that.  It adopts the same termination provisions we are all familiar with from GPL-3.0 as an Additional Permission giving companies confidence that they will have time to come into compliance if a failure is identified. Their ability to rely on this Additional Permission will hopefully re-establish user confidence and help direct enforcement activity back to the original purpose we have all sought over the years – actual compliance.  

Kernel developers in our ecosystem may put their own acknowledgement to the Statement by sending a patch to Greg adding their name to the Statement, like any other kernel patch submission, and it will be gladly merged. Those authorized to ‘ack’ on behalf of their company may add their company name in (parenthesis) after their name as well.

Note, a number of questions did come up when this was discussed with the kernel developer community. Please see Greg’s FAQ post answering the most common ones if you have further questions about this topic.

October 16, 2017 09:00 AM

Pavel Machek: Help time travelers!

Ok, so I have various machines here. It seems only about half of them has working RTC. That are the boring ones.

And even the boring ones have pretty imprecise RTCs... For example Nokia N9. I only power it up from time to time, I believe it drifts something like minute per month... For normal use with SIM card, it can probably correct from GSM network if you happen to have a cell phone signal, but...

More interesting machines... Old thinkpad is running without CMOS battery. ARM OLPC has _three_ RTCs, but not a single working one. N900 has working RTC but no or dead backup battery. On these, RTC driver probably knows time is not valid, but feeds the garbage into the system time, anyway. Ouch. Neither Sharp Zaurus SL-5500 nor C-3000 had battery backup on RTC...

Even in new end-user machines, time quality varies a lot. "First boot, please enter time" is only accurate to seconds, if the user is careful. RTC is usually not very accurate, either... and noone uses adjtime these days. GSM time and ntpdate are probably accurate to miliseconds, GPS can provide time down to picoseconds... And broken systems are so common "swclock" is available in init system to store time in file, so it at least does not go backwards.

https (and other crypto) depends on time... so it is important to know approximate month we are in.

Is it time we handle it better?

Could we return both time and log2(expected error) from system calls?

That way we could hide the clock in GUI if time is not available or not precise to minutes, ignore certificate dates when time is not precise to months, and you would not have to send me a "Pavel, are you time traveling, again?" message next time my mailer sends email dated to 1970.

October 16, 2017 07:38 AM

October 14, 2017

James Bottomley: Using Elliptic Curve Cryptography with TPM2

One of the most significant advances going from TPM1.2 to TPM2 was the addition of algorithm agility: The ability of TPM2 to work with arbitrary symmetric and asymmetric encryption schemes.  In practice, in spite of this much vaunted agile encryption capability, most actual TPM2 chips I’ve seen only support a small number of asymmetric encryption schemes, usually RSA2048 and a couple of Elliptic Curves.  However, the ability to support any Elliptic Curve at all is a step up from TPM1.2.  This blog post will detail how elliptic curve schemes can be integrated into existing cryptographic systems using TPM2.  However, before we start on the practice, we need at least a tiny swing through the theory of Elliptic Curves.

What is an Elliptic Curve?

An Elliptic Curve (EC) is simply the set of points that lie on the curve in the two dimensional plane (x,y) defined by the equation

y2 = x3 + ax + b

which means that every elliptic curve can be parametrised by two constants a and b.  The set of all points lying on the curve plus a point at infinity is combined with an addition operation to produce an abelian (commutative) group.  The addition property is defined by drawing straight lines between two points and seeing where they intersect the curve (or picking the infinity point if they don’t intersect).  Wikipedia has a nice diagrammatic description of this here.  The infinity point acts as the identity of the addition rule and the whole group is denoted E.

The utility for cryptography is that you can define an integer multiplier operation which is simply the element added to itself n times, so for P ∈ E, you can always find Q ∈ E such that

Q = P + P + P … = n × P

And, since it’s a simple multiplication like operation, it’s very easy to compute Q.  However, given P and Q it is computationally very difficult to get back to n.  In fact, it can be demonstrated mathematically that trying to compute n is equivalent to the discrete logarithm problem which is the mathematical basis for the cryptographic security of RSA.  This also means that EC keys suffer the same (actually more so) problems as RSA keys: they’re not Quantum Computing secure (vulnerable to the Quantum Shor’s algorithm) and they would be instantly compromised if the discrete logarithm problem were ever solved.

Therefore, for any elliptic curve, E, you can choose a known point G ∈ E, select a large integer d and you can compute a point P = d × G.  You can then publish (P, G, E) as your public key knowing it’s computationally infeasible for anyone to derive your private key d.

For instance, Diffie-Hellman key exchange can be done by agreeing (E, G) and getting Alice and Bob to select private keys dA, dB.  Then knowing Bob’s public key PB, Alice can select a random integer r, which she publishes, and compute a key agreement as a secret point on the Elliptic Curve (r dA) × PB.  Bob can derive the same Elliptic Curve point because

(r dA) × PB = (r dA)dB × G = (r dB) dA × G = (r dB) × PA

The agreement is a point on the curve, but you can use an agreed hashing or other mechanism to get from the point to a symmetric key.

Seems simple, but the problem for computing is that we really want to use integers and right at the moment the elliptic curve is defined over all the real numbers, meaning E is of infinite size and involves floating point computations (of rather large precision).

Elliptic Curves over Finite Fields

Fortunately there is a mathematical theory of finite fields, called Galois Theory, which allows us to take the Galois Field over prime number p, which is denoted GF(p), and compute Elliptic Curve points over this field.  This derivation, which is mathematically rather complicated, is denoted E(GF(p)), where every point (x,y) is represented by a pair of integers between 0 and p-1.  There is another theory that says the number of elements in E(GF(p))

n = |E(GF(p))|

is roughly the same size as p, meaning if you choose a 32 bit prime p, you likely have a field over roughly 2^32 elements.  For every point P in E(GF(p)) it is also mathematically proveable that n × P = 0. where 0 is the zero point (which was the infinity point in the real elliptic curve).

This means that you can take any point, G,  in E(GF(p)) and compute a subgroup based on it:

EG = { ∀m ∈ Zn : m × G }

If you’re lucky |EG| = |E(GF(p))| and G is the generator of the entire group.  However, G may only generate a subgroup and you will find |EG| = h|E(GF(p))| where integer h is called the cofactor.  In general you want the cofactor to be small (preferably less than four) for EG to be cryptographically useful.

For a computer’s purposes, EG is the elliptic curve group used for integer arithmetic in the cryptographic algorithms.  The Curve and Generator is then defined by (p, a, b, Gx, Gy, n, h) which are the published parameters of the key (Gx, Gy represent the x and y elements of point G).  You select a random number d as your private key and your public key P = d × G exactly as above, except now P is easy to compute with integer operations.

Problems with Elliptic Curves

Although I stated above that solving P = d × G is equivalent in difficulty to the discrete logarithm problem, that’s not generally true.  If the discrete logarithm problem were solved, then we’d easily be able to compute d for every generator and curve, but it is possible to pick curves for which d can be easily computed without solving the discrete logarithm problem. This is the reason why you should never pick your own curve parameters (even if you think you know what you’re doing) because it’s very easy to choose a compromised curve.  As a demonstration of the difficulty of the problem: each of the major nation state actors, Russia, China and the US, publishes their own curve parameters for use in their own cryptographic EC implementations and each of them thinks the parameters published by the others is compromised in a way that allows the respective national security agencies to derive private keys.  So if nation state actors can’t tell if a curve is compromised or not, you surely won’t be able to either.

Therefore, to be secure in EC cryptography, you pick and existing curve which has been vetted and select some random Generator Point on it.  Of course, if you’re paranoid, that means you won’t be using any of the nation state supplied curves …

Using the TPM2 with Elliptic Curves in Cryptosystems

The initial target for this work was the openssl cryptosystem whose libraries are widely used for deriving other uses (like https in apache or openssh). Originally, when I did the initial TPM2 enabling of openssl as described in this blog post, I added TPM2 as a patch to the existing TPM 1.2 openssl_tpm_engine.  Unfortunately, openssl_tpm_engine seems to be pretty much defunct at this point, so I started my own openssl_tpm2_engine as a separate git tree to begin experimenting with Elliptic Curve keys (if you don’t use git, you can download the tar file here). One of the benefits to running my own source tree is that I can now add a testing infrastructure that makes use of the IBM TPM emulator to make sure that the basic cryptographic operations all work which means that make check functions even when a TPM2 isn’t available.  The current key creation and import algorithms use secured connections to the TPM (to avoid eavesdropping) which means it’s only really possible to construct them using the IBM TSS. To make all of this easier, I’ve set up an openSUSE Build Service repository which is building for all major architectures and the openSUSE and Fedora distributions (ignore the failures, they’re currently induced because the TPM emulator only currently works on 64 bit little endian systems, so make check is failing, but the TPM people at IBM are working on this, so eventually the builds should be complete).

TPM2 itself also has some annoying restrictions.  The biggest of which is that it doesn’t allow you to pass in arbitrary elliptic curve parameters; you may only use elliptic curves which the TPM itself knows.  This will be annoying if you have an existing EC key you’re trying to import because the TPM may reject it as an unknown algorithm.  For instance, openssl can actually compute with arbitrary EC parameters, but has 39 current elliptic curves parametrised by name. By contrast, my Nuvoton TPM2 inside my Dell XPS 13 knows precisely two curves:

jejb@jarvis:~> create_tpm2_key --list-curves
prime256v1
bnp256

However, assuming you’ve picked a compatible curve for your EC private key (and you’ve defined a parent key for the storage hierarchy) you can simply import it to a TPM bound key:

create_tpm2_key -p 81000001 -w key.priv key.tpm

The tool will report an error if it can’t convert the curve parameters to a named elliptic curve known to the TPM

jejb@jarvis:~> openssl genpkey -algorithm EC -pkeyopt ec_paramgen_curve:brainpoolP256r1 > key.priv
jejb@jarvis:~> create_tpm2_key -p 81000001 -w key.priv key.tpm
TPM does not support the curve in this EC key
openssl_to_tpm_public failed with 166
TPM_RC_CURVE - curve not supported Handle number unspecified

You can also create TPM resident private keys simply by specifying the algorithm

create_tpm2_key -p 81000001 --ecc bnp256 key.tpm

Once you have your TPM based EC keys, you can use them to create public keys and certificates.  For instance, you create a self-signed X509 certificate based on the tpm key by

openssl req -new -x509 -sha256  -key key.tpm -engine tpm2 -keyform engine -out my.crt

Why you should use EC keys with the TPM

The initial attraction is the same as for RSA keys: making it impossible to extract your private key from the system.  However, the mathematical calculations for EC keys are much simpler than for RSA keys and don’t involve finding strong primes, so it’s much simpler for the TPM (being a fairly weak calculation machine) to derive private and public EC keys.  For instance the times taken to derive a RSA key from the primary seed and an EC key differ dramatically

jejb@jarvis:~> time tsscreateprimary -hi o -ecc bnp256 -st
Handle 80ffffff

real 0m0.111s
user 0m0.000s
sys 0m0.014s

jejb@jarvis:~> time tsscreateprimary -hi o -rsa -st
Handle 80ffffff

real 0m20.473s
user 0m0.015s
sys 0m0.084s

so for a slow system like the TPM, using EC keys is a significant speed advantage.  Additionally, there are other advantages.  The standard EC Key signature algorithm is a modification of the NIST Digital Signature Algorithm called ECDSA.  However DSA and ECDSA require a cryptographically strong (and secret) random number as Sony found out to their cost in the EC Key compromise of Playstation 3.  The TPM is a good source of cryptographically strong random numbers and if it generates the signature internally, you can be absolutely sure of keeping the input random number secret.

Why you might want to avoid EC keys altogether

In spite of the many advantages described above, EC keys suffer one additional disadvantage over RSA keys in that Elliptic Curves in general are very hot fields of mathematical research so even if the curve you use today is genuinely not compromised, it’s not impossible that a mathematical advance tomorrow will make the curve you chose (and thus all the private keys you generated) vulnerable.  Of course, the same goes for RSA if anyone ever cracks the discrete logarithm problem, but solving that problem would likely be fully published to world acclaim and recognition as a significant contribution to the advancement of number theory.  Discovering an attack on a currently used elliptic curve on the other hand might be better remunerated by offering to sell it privately to one of the national security agencies …

October 14, 2017 10:52 PM

October 11, 2017

Paul E. Mc Kenney: Stupid RCU Tricks: In the audience for a pair of RCU talks!

I had the privilege of attending CPPCON last month. Michael Wong, Maged Michael, and I presented a parallel-programming overview, in which I presented the "Hardware and its Habits" chapter of Is Parallel Programming Hard, And, If So, What Can You Do About It?.

But the highlight for me was actually sitting in the audience for a pair of talks by people who had implemented RCU in C++.

Ansel Sermersheim presented a two-part talk entitled Multithreading is the answer. What is the question?. The second part of this talk covered lockless containers, and used a variant of RCU to implement a low-overhead libGuarded facility in order to more easily avoid deadlocks. The implementation is similar to the Linux-kernel real-time RCU implementation by Jim Houston and Joe Korty in that the counterpart to rcu_read_unlock() actively registers a quiescent state. Ansel's implementation goes further by also driving callback invocation from rcu_read_unlock(). Now I don't recommend this for a general-purpose RCU implementation due to the possibility of deadlock should a resource need to be held across rcu_read_unlock() and acquired within the callback. However, this approach should work just fine in the case where the callbacks just free memory and the memory allocator does not contain too many RCU read-side critical sections.

Fedor Pikus presented a talk entitled Read, Copy, Update, then what? RCU for non-kernel programmers, in which he gave a quite-decent introduction to use of RCU. This introduction included an improved version of my long-standing where-to-use-RCU diagram, which I fully intend to incorporate. I had a number of but-you-could moments, including the usual "put the size in with the array" advice, ways of updating things already exposed to readers, and the fact that RCU really can tolerate multiple writers, along with some concerns about counter overflow. Nevertheless, an impressive amount of great information in a one-hour talk!

It is very good to see more people making use of RCU!

October 11, 2017 09:47 PM

October 04, 2017

Dave Airlie (blogspot): radv: a conformant Vulkan driver (with caveats)

If you take a look at the conformant vulkan list, you might see entry 220.

Software in the Public Interest, Inc. 2017-10-04 Vulkan_1_0 220
AMD Radeon R9 285 Intel i5-4460 x86_64 Linux 4.13 X.org DRI3.

 This is radv, and this is the first conformance submission done under the X.org (SPI) membership of the Khronos adopter program.

This submission was a bit of a trial run for radv developers, but Mesa 17.2 + llvm 5.0 on Bas's R9 285 card.

We can extend this submission to cover all VI GPUs.

In practice we pass all the same tests on CIK and Polaris GPUs, but we will have to do complete submission runs on those when we get a chance.

But major milestone/rubberstamp reached, radv is now a conformant Vulkan driver. Thanks go to Bas and all the other contributors and people who's code we've leveraged!

October 04, 2017 07:49 PM

October 02, 2017

James Morris: Linux Security Summit 2017 Roundup

The 2017 Linux Security Summit (LSS) was held last month in Los Angeles over the 14th and 15th of September.  It was co-located with Open Source Summit North America (OSSNA) and the Linux Plumbers Conference (LPC).

LSS 2017 sign at conference

LSS 2017

Once again we were fortunate to have general logistics managed by the Linux Foundation, allowing the program committee to focus on organizing technical content.  We had a record number of submissions this year and accepted approximately one third of them.  Attendance was very strong, with ~160 attendees — another record for the event.

LSS 2017 Attendees

LSS 2017 Attendees

On the day prior to LSS, attendees were able to access a day of LPC, which featured two tracks with a security focus:

Many thanks to the LPC organizers for arranging the schedule this way and allowing LSS folk to attend the day!

Realtime notes were made of these microconfs via etherpad:

I was particularly interested in the topic of better integrating LSM with containers, as there is an increasingly common requirement for nesting of security policies, where each container may run its own apparently independent security policy, and also a potentially independent security model.  I proposed the approach of introducing a security namespace, where all security interfaces within the kernel are namespaced, including LSM.  It would potentially solve the container use-cases, and also the full LSM stacking case championed by Casey Schaufler (which would allow entirely arbitrary stacking of security modules).

This would be a very challenging project, to say the least, and one which is further complicated by containers not being a first class citizen of the kernel.   This leads to security policy boundaries clashing with semantic functional boundaries e.g. what does it mean from a security policy POV when you have namespaced filesystems but not networking?

Discussion turned to the idea that it is up to the vendor/user to configure containers in a way which makes sense for them, and similarly, they would also need to ensure that they configure security policy in a manner appropriate to that configuration.  I would say this means that semantic responsibility is pushed to the user with the kernel largely remaining a set of composable mechanisms, in relation to containers and security policy.  This provides a great deal of flexibility, but requires those building systems to take a great deal of care in their design.

There are still many issues to resolve, both upstream and at the distro/user level, and I expect this to be an active area of Linux security development for some time.  There were some excellent followup discussions in this area, including an approach which constrains the problem space. (Stay tuned)!

A highlight of the TPMs session was an update on the TPM 2.0 software stack, by Philip Tricca and Jarkko Sakkinen.  The slides may be downloaded here.  We should see a vastly improved experience over TPM 1.x with v2.0 hardware capabilities, and the new software stack.  I suppose the next challenge will be TPMs in the post-quantum era?

There were further technical discussions on TPMs and container security during subsequent days at LSS.  Bringing the two conference groups together here made for a very productive event overall.

TPMs microconf at LPC with Philip Tricca presenting on the 2.0 software stack.

This year, due to the overlap with LPC, we unfortunately did not have any LWN coverage.  There are, however, excellent writeups available from attendees:

There were many awesome talks.

The CII Best Practices Badge presentation by David Wheeler was an unexpected highlight for me.  CII refers to the Linux Foundation’s Core Infrastructure Initiative , a preemptive security effort for Open Source.  The Best Practices Badge Program is a secure development maturity model designed to allow open source projects to improve their security in an evolving and measurable manner.  There’s been very impressive engagement with the project from across open source, and I believe this is a critically important effort for security.

CII Bade Project adoption (from David Wheeler’s slides).

During Dan Cashman’s talk on SELinux policy modularization in Android O,  an interesting data point came up:

Interesting data from the talk: 44% of Android kernel vulns blocked by SELinux due to attack surface reduction. https://t.co/FnU544B3XP

— James Morris (@xjamesmorris) September 15, 2017

We of course expect to see application vulnerability mitigations arising from Mandatory Access Control (MAC) policies (SELinux, Smack, and AppArmor), but if you look closely this refers to kernel vulnerabilities.   So what is happening here?  It turns out that a side effect of MAC policies, particularly those implemented in tightly-defined environments such as Android, is a reduction in kernel attack surface.  It is generally more difficult to reach such kernel vulnerabilities when you have MAC security policies.  This is a side-effect of MAC, not a primary design goal, but nevertheless appears to be very effective in practice!

Another highlight for me was the update on the Kernel Self Protection Project lead by Kees, which is now approaching its 2nd anniversary, and continues the important work of hardening the mainline Linux kernel itself against attack.  I would like to also acknowledge the essential and original research performed in this area by grsecurity/PaX, from which this mainline work draws.

From a new development point of view, I’m thrilled to see the progress being made by Mickaël Salaün, on Landlock LSM, which provides unprivileged sandboxing via seccomp and LSM.  This is a novel approach which will allow applications to define and propagate their own sandbox policies.  Similar concepts are available in other OSs such as OSX (seatbelt) and BSD (pledge).  The great thing about Landlock is its consolidation of two existing Linux kernel security interfaces: LSM and Seccomp.  This ensures re-use of existing mechanisms, and aids usability by utilizing already familiar concepts for Linux users.

Mickaël Salaün from ANSSI talking about his Landlock LSM work at #linuxsecuritysummit 2017 pic.twitter.com/wYpbHuLgm2

— LinuxSecuritySummit (@LinuxSecSummit) September 14, 2017

Overall I found it to be an incredibly productive event, with many new and interesting ideas arising and lots of great collaboration in the hallway, lunch, and dinner tracks.

Slides from LSS may be found linked to the schedule abstracts.

We did not have a video sponsor for the event this year, and we’ll work on that again for next year’s summit.  We have discussed holding LSS again next year in conjunction with OSSNA, which is expected to be in Vancouver in August.

We are also investigating a European LSS in addition to the main summit for 2018 and beyond, as a way to help engage more widely with Linux security folk.  Stay tuned for official announcements on these!

Thanks once again to the awesome event staff at LF, especially Jillian Hall, who ensured everything ran smoothly.  Thanks also to the program committee who review, discuss, and vote on every proposal, ensuring that we have the best content for the event, and who work on technical planning for many months prior to the event.  And of course thanks to the presenters and attendees, without whom there would literally and figuratively be no event :)

See you in 2018!

 

October 02, 2017 01:52 AM

September 25, 2017

Pavel Machek: Colorful LEDs

RGB LEDs do not exist according to Linux LED subsystem. They are modeled as three separate LEDs, red, green and blue; that matches the hardware.

Unfortunately, it has problems. Lets begin with inconsistent naming: some drivers use :r suffix, some use :red. There's no explicit grouping of LEDs for one light -- there's no place to store parameters common for the light. (LEDs could be grouped by name.)

RGB colorspace is pretty well defined, and people expect to set specific colors. Unfortunately.... that does not work well with LEDs. First, LEDs are usually not balanced according to human perception system, so full power to the LEDs (255, 255, 255) may not
result in white. Second, monitors normally use gamma correction before displaying color, so (128, 128, 128) does not correspond to 50% of light being produced. But LEDs normally use PWM, so (128, 128, 128) does correspond to 50% light. Result is that colors are completely off.

I tested HSV colorspace for the LEDs. That would have advantage of old triggers being able to use selected colors... Unfortunately, on N900, white color is something like 15% blue, which would result in significantly reducing number of white intensities we can display.

September 25, 2017 08:30 AM

September 19, 2017

Pavel Machek: Unicsy phone

For a long time, I wanted a phone that runs Unix. And I got that, first Android, second Maemo on Nokia N900. With Android I realized that running Linux kernel is not enough. Android is really far away from normal Unix machine, and I'd argue away from anything usable, too. Maemo was slightly closer, and probably could be fixed if it was open-source.

But I realized Linux kernel is not really the most important part. There's more to Unix: compatibility with old apps, small programs where each one does one thing well, data in text formats so you can put them in git. Maemo got some parts right, at least you could run old apps in a useful way; but most important data on the phone (contacts, calendar) were still locked away in sqlite.

And that is something I'd like to change: phone that is ssh-friendly, text-editor-friendly and git-friendly. I call it "Unicsy phone". No, I don't want to do phone `cat addressbook | grep Friend | cut -f 1`... graphical utilities are okay. But console tools still should be there, and file formats should be reasonable.

So there is tui project, and recently postmarketos project appeared. Nokia N900 is mostly supported by mainline kernel (with exceptions of bluetooth and camera, everything works). There's work to be done, but it looks doable.

More is missing in the userspace. Phone parts need work, as expected. What is more surprising... there's emacs org mode, with great calendar capabilities, but I could not find matching application to display data nicely and provide alerts. Situation is even worse for contacts; emacs org can help there, too, but there does not seem to be agreement that this is the way to go. (And again, graphical applications would be nice).

September 19, 2017 10:17 PM

September 16, 2017

Pavel Machek: FlightGear fun

How to die in Boeing 707, quick and easy. Take off, realize that you should set up fuel heating, select Help|, aim for checklists.. and hit auto startup/shutdown. Instantly lose all the engines. Fortunately, you are at 6000', so you start looking for the airport. Then you
realize "hmm, perhaps I can do the startup thing now", and hit the menu item once again. But instead of running engines, you get fire warnings on all the engines. That does not look good. Confirm fire, extinguish all four engines, and resume looking for airport in range. Trim for best glide. Then number 3 comes up. Then number 4. Number one and you know it will be easy. Number two as you fly over the runway... go around and do normal approach.

September 16, 2017 11:41 AM

September 15, 2017

Michael Kerrisk (manpages): man-pages-4.13 is released

I've released man-pages-4.13. The release tarball is available on kernel.org. The browsable online pages can be found on man7.org. The Git repository for man-pages is available on kernel.org.

This release resulted from patches, bug reports, reviews, and comments from around 40 contributors. The release is rather larger than average. (The context diff runs to more than 90k lines.) The release includes more than 350 commits and contains some fairly wide-ranging formatting fix-ups that meant that all 1028 existing manual pages saw some change(s). In addition, 5 new manual pages were added.

Among the more significant changes in man-pages-4.13 are the following:

A special thanks to Eugene Syromyatnikov, who contributed 30 patches to this release!

September 15, 2017 01:37 PM

Linux Plumbers Conference: Linux Plumbers Conference Unconference Schedule Announced

Since we only have six proposals, we can schedule them in the unconference session without any need for actual voting at breakfast.  On a purely random basis, the schedule will be:

Unconference I:

09:30 Test driven development (TDD) in the kernel – Knut Omang
11:00 Support for adding DT based thermal zones at runtime – Moritz Fischer
11:50 Restartable Sequences interaction with debugger single-stepping – Mathieu Desnoyers

Unconference II:

14:00 Automated testing of LKML patches with Clang – Nick Desaulniers
14:50 ktask: multithread cpu-intensive kernel work – Daniel Jordan
16:00 Soft Affinity for Workloads – Rohit Jain

I’ll add these to the plumbers schedule (if the author doesn’t already have an account, I’ll show up as the speaker, but please take the above list as definitive for actual speaker).

Looking forward to seeing you all at this exciting new event for Plumbers,

September 15, 2017 04:57 AM

September 14, 2017

Grant Likely: Arcade Panel Construction Time-Lapse Video

September 14, 2017 05:56 PM

Grant Likely: NeoPixel Arcade Buttons

September 14, 2017 05:54 PM

September 13, 2017

Grant Likely: Custom Arcade Control Panels

I’ve started building custom arcade controls for using with classic arcade game emulators. All Open Source and Open Hardware of course, with the source code up on GitHub.

OpenSCAD:Arcade is an arcade panel modeling too written in OpenSCAD. It designs the arcade panel layout and produces lasercutter output and frame dimensions.

STM32F3-Discovery-Arcade is a prototype USB HID device for arcade controls. It currently supports GPIO joysticks and buttons, quadrature trackballs/spinners, and will drive up to 4 channels of NeoPixel RGB LED strings. The project has both custom STM32 firmware and a custom adaptor PCB designed with KiCad.

Please go take a look.

September 13, 2017 11:39 PM

Gustavo F. Padovan: Slides of my talk at Open Source Summit NA

I just delivered a talk today at Open Source Summit NA, here in LA, about everything we’ve been doing to support explicit synchronization on the Media and Graphics pipeline in the kernel. You can find the slides here.

The DRM side is already mainline, but V4L2 is currently my focus of work along with the linux-media community in the kernel. Blog posts about that should appear soon on this blog.

September 13, 2017 07:56 PM

September 11, 2017

Linux Plumbers Conference: New to Plumbers: Unconference on Friday

The hallway track is always a popular feature of Linux Plumbers Conference.  New ideas and solutions emerge all the time.  But sometimes you start a discussion, and want to pull others in before the conference ends, and just can’t quite make it work.

This year, we’re trying an experiment at Linux Plumbers and reserving a room for an unconference session on Friday,  so the ad hoc problem solving sessions for those topics with the most participant interest can be held.

If there is a topic you want to have a 1 hour discussion around,  please put it on the etherpad with:

 

Topic:  <something short>
Host(s): <person who will host the discussion>
Description:   <describe problem you want to talk about>

 

We’ll close down the topic page on Thursday night at 8pm,  and print the collected topics out on In the morning and post them in the room.     During the breakfast period (from 8 to 9am), those wanting to participate will be given four dots to vote.  Vote by placing a dot on the topics of interest until 8:45am.   Sessions will be scheduled as the one with the most dots first, and in descending order until we run out of sessions or time.

Schedule will be posted in the room on Friday morning.

September 11, 2017 05:54 PM

September 07, 2017

James Morris: Linux Plumbers Conference Sessions for Linux Security Summit Attendees

Folks attending the 2017 Linux Security Summit (LSS) next week may be also interested in attending the TPMs and Containers sessions at Linux Plumbers Conference (LPC) on the Wednesday.

The LPC TPMs microconf will be held in the morning and lead by Matthew Garret, while the containers microconf will be run by Stéphane Graber in the afternoon.  Several security topics will be discussed in the containers session, including namespacing and stacking of LSM, and namespacing of IMA.

Attendance on the Wednesday for LPC is at no extra cost for registered attendees of LSS.  Many thanks to the LPC organizers for arranging this!

There will be followup BOF sessions on LSM stacking and namespacing at LSS on Thursday, per the schedule.

This should be a very productive week for Linux security development: see you there!

September 07, 2017 01:44 AM

September 06, 2017

Linux Plumbers Conference: Linux Plumbers Conference Preliminary Schedule Published

You can see the schedule by clicking on the ‘schedule’ tab above or by going to this url

http://www.linuxplumbersconf.org/2017/ocw/events/LPC2017/schedule

If you’d like any changes, please email contact@linuxplumbersconf.org and we’ll see what we can do to accommodate your request.

Please also remember that the schedule is subject to change.

September 06, 2017 10:01 PM

Greg Kroah-Hartman: 4.14 == This years LTS kernel

As the 4.13 release has now happened, the merge window for the 4.14 kernel release is now open. I mentioned this many weeks ago, but as the word doesn’t seem to have gotten very far based on various emails I’ve had recently, I figured I need to say it here as well.

So, here it is officially, 4.14 should be the next LTS kernel that I’ll be supporting with stable kernel patch backports for at least two years, unless it really is a horrid release and has major problems. If so, I reserve the right to pick a different kernel, but odds are, given just how well our development cycle has been going, that shouldn’t be a problem (although I guess I just doomed it now…)

As always, if people have questions about this, email me and I will be glad to discuss it, or talk to me in person next week at the LinuxCon^WOpenSourceSummit or Plumbers conference in Los Angeles, or at any of the other conferences I’ll be at this year (ELCE, Kernel Recipes, etc.)

September 06, 2017 02:41 PM

September 05, 2017

Kees Cook: security things in Linux v4.13

Previously: v4.12.

Here’s a short summary of some of interesting security things in Sunday’s v4.13 release of the Linux kernel:

security documentation ReSTification
The kernel has been switching to formatting documentation with ReST, and I noticed that none of the Documentation/security/ tree had been converted yet. I took the opportunity to take a few passes at formatting the existing documentation and, at Jon Corbet’s recommendation, split it up between end-user documentation (which is mainly how to use LSMs) and developer documentation (which is mainly how to use various internal APIs). A bunch of these docs need some updating, so maybe with the improved visibility, they’ll get some extra attention.

CONFIG_REFCOUNT_FULL
Since Peter Zijlstra implemented the refcount_t API in v4.11, Elena Reshetova (with Hans Liljestrand and David Windsor) has been systematically replacing atomic_t reference counters with refcount_t. As of v4.13, there are now close to 125 conversions with many more to come. However, there were concerns over the performance characteristics of the refcount_t implementation from the maintainers of the net, mm, and block subsystems. In order to assuage these concerns and help the conversion progress continue, I added an “unchecked” refcount_t implementation (identical to the earlier atomic_t implementation) as the default, with the fully checked implementation now available under CONFIG_REFCOUNT_FULL. The plan is that for v4.14 and beyond, the kernel can grow per-architecture implementations of refcount_t that have performance characteristics on par with atomic_t (as done in grsecurity’s PAX_REFCOUNT).

CONFIG_FORTIFY_SOURCE
Daniel Micay created a version of glibc’s FORTIFY_SOURCE compile-time and run-time protection for finding overflows in the common string (e.g. strcpy, strcmp) and memory (e.g. memcpy, memcmp) functions. The idea is that since the compiler already knows the size of many of the buffer arguments used by these functions, it can already build in checks for buffer overflows. When all the sizes are known at compile time, this can actually allow the compiler to fail the build instead of continuing with a proven overflow. When only some of the sizes are known (e.g. destination size is known at compile-time, but source size is only known at run-time) run-time checks are added to catch any cases where an overflow might happen. Adding this found several places where minor leaks were happening, and Daniel and I chased down fixes for them.

One interesting note about this protection is that is only examines the size of the whole object for its size (via __builtin_object_size(..., 0)). If you have a string within a structure, CONFIG_FORTIFY_SOURCE as currently implemented will make sure only that you can’t copy beyond the structure (but therefore, you can still overflow the string within the structure). The next step in enhancing this protection is to switch from 0 (above) to 1, which will use the closest surrounding subobject (e.g. the string). However, there are a lot of cases where the kernel intentionally copies across multiple structure fields, which means more fixes before this higher level can be enabled.

NULL-prefixed stack canary
Rik van Riel and Daniel Micay changed how the stack canary is defined on 64-bit systems to always make sure that the leading byte is zero. This provides a deterministic defense against overflowing string functions (e.g. strcpy), since they will either stop an overflowing read at the NULL byte, or be unable to write a NULL byte, thereby always triggering the canary check. This does reduce the entropy from 64 bits to 56 bits for overflow cases where NULL bytes can be written (e.g. memcpy), but the trade-off is worth it. (Besdies, x86_64’s canary was 32-bits until recently.)

IPC refactoring
Partially in support of allowing IPC structure layouts to be randomized by the randstruct plugin, Manfred Spraul and I reorganized the internal layout of how IPC is tracked in the kernel. The resulting allocations are smaller and much easier to deal with, even if I initially missed a few needed container_of() uses.

randstruct gcc plugin
I ported grsecurity’s clever randstruct gcc plugin to upstream. This plugin allows structure layouts to be randomized on a per-build basis, providing a probabilistic defense against attacks that need to know the location of sensitive structure fields in kernel memory (which is most attacks). By moving things around in this fashion, attackers need to perform much more work to determine the resulting layout before they can mount a reliable attack.

Unfortunately, due to the timing of the development cycle, only the “manual” mode of randstruct landed in upstream (i.e. marking structures with __randomize_layout). v4.14 will also have the automatic mode enabled, which randomizes all structures that contain only function pointers.

A large number of fixes to support randstruct have been landing from v4.10 through v4.13, most of which were already identified and fixed by grsecurity, but many were novel, either in newly added drivers, as whitelisted cross-structure casts, refactorings (like IPC noted above), or in a corner case on ARM found during upstream testing.

lower ELF_ET_DYN_BASE
One of the issues identified from the Stack Clash set of vulnerabilities was that it was possible to collide stack memory with the highest portion of a PIE program’s text memory since the default ELF_ET_DYN_BASE (the lowest possible random position of a PIE executable in memory) was already so high in the memory layout (specifically, 2/3rds of the way through the address space). Fixing this required teaching the ELF loader how to load interpreters as shared objects in the mmap region instead of as a PIE executable (to avoid potentially colliding with the binary it was loading). As a result, the PIE default could be moved down to ET_EXEC (0x400000) on 32-bit, entirely avoiding the subset of Stack Clash attacks. 64-bit could be moved to just above the 32-bit address space (0x100000000), leaving the entire 32-bit region open for VMs to do 32-bit addressing, but late in the cycle it was discovered that Address Sanitizer couldn’t handle it moving. With most of the Stack Clash risk only applicable to 32-bit, fixing 64-bit has been deferred until there is a way to teach Address Sanitizer how to load itself as a shared object instead of as a PIE binary.

early device randomness
I noticed that early device randomness wasn’t actually getting added to the kernel entropy pools, so I fixed that to improve the effectiveness of the latent_entropy gcc plugin.

That’s it for now; please let me know if I missed anything. As a side note, I was rather alarmed to discover that due to all my trivial ReSTification formatting, and tiny FORTIFY_SOURCE and randstruct fixes, I made it into the most active 4.13 developers list (by patch count) at LWN with 76 patches: a whopping 0.6% of the cycle’s patches. ;)

Anyway, the v4.14 merge window is open!

© 2017, Kees Cook. This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 License.
Creative Commons License

September 05, 2017 11:01 PM

Linux Plumbers Conference: Testing and Fuzzing Microconference Accepted into the Linux Plumbers Conference

We’re pleased to announce that newcomer Microconference Testing and Fuzzing will feature at Plumbers in Los Angeles this year.

The Agenda will feature the three fuzzers used for the Linux Kernel (Trinity, Syzkaller and Perf) along with discussion of formal verification tools, discussion of how to test stable trees, testing frameworks and also a discussion and demonstration of the drm/i915 checkin and test infrastructure.

Additionally, we will hold a session aimed at improving the testing process for linux-stable and distro kernels. Please plan to attend if you have input into how to integrate additional testing and make these kernels more reliable. Participants will include Greg Kroah-Hartman and major distro kernel maintainers.

For more details on this, please see this microconference’s wiki page.

We hope to see you there!

September 05, 2017 08:04 AM

August 26, 2017

Matt Domsch: Twilio Voice to Pagerduty alert using Python Flask, Zappa, AWS Lambda & AWS API Gateway

My SaaS product DevOps team at Quest Software uses several monitoring services to notice problems (hopefully before end users see them), and raises alerts for our team using PagerDuty. We also frequently need to integrate with existing company and partner products, for example our internal helpdesk and customer-facing technical-support processes. In this case, the helpdesk team wanted to have a phone number they could call to raise an alert to our team. The first suggestion was to simply put my name down as the 24×7 on-call contact, and make it my problem to alert the right people. I scoffed. We already had PagerDuty in place – why couldn’t we use that too? Simply because we didn’t have a phone number hooked up to PagerDuty. So, lets fix that.

A few searches quickly turned up a PagerDuty blog where David Hayes had done exactly this. Excellent! However, it was written to use Google App Engine, and my team has their processes predominately in Azure and AWS. I didn’t want to introduce yet another set of cloud services, for something conceptually so simple.

Twilio’s quickstarts do a nice job of showing how to use their API, and these use Flask for the web framework. How can I use Flask apps in AWS Lambda? Here enters Zappa, a tool for deploying Flask and Django apps into AWS Lambda & AWS API Gateway. Slick! Now I have all the pieces I need.

You can find the code on github. I’ve extended the quickstarts slightly, to have the phone response first prompt for the application that is experiencing issues, and recording that in a session cookie to be retrieved later. Then it prompts the user to leave a message. With these two pieces of information, we have enough to create the PagerDuty incident for the proper application, including information about the caller gathered from Caller ID (in case the recording is garbled), and a link to the recorded message. Not too shabby for ~125 lines of “my” code, at a cost of ~$1/month to Twilio for the phone number, almost $0.00 for AWS, and a couple pennies if anyone actually calls to raise an alert.

August 26, 2017 07:19 PM

August 24, 2017

Pete Zaitcev: Oh not again

Fedora is mulling dropping the 32-bit x86 again, after the F26, which means I need to buy a new router. It's not like I cannot afford one... But it's such as hassle to migrate. I'm thinking about installing one in the background and then re-numbering it, in order to minimize issues. Even then, I cannot test, for instance, that VLANs work right, until I actually phase the box into production. It's much easier to keep a compatible 32-bit box mirrored and ready on stand-by.

In a sense, the amazing ease of upgrades in modern Fedora lulled me into this. Before, I re-installed anyway, and so could roll 64-bit just as easily.

P.S. According to records at the hoster, my primary public VM was installed as Fedora 15 and continuously upgraded since then.

August 24, 2017 07:21 PM

August 17, 2017

Linux Plumbers Conference: Tracing/BPF Microconference Accepted into the Linux Plumbers Conference

Following on from the successful Tracing Microconference last year, we’re pleased to announce there will be a follow on at Plumbers in Los Angeles this year.

The agenda for this year will not focus only on tracing but also will include several topics around eBPF. As eBPF now interacts with tracing and there is still a lot of work to accomplish, such as building an infrastructure around the current tools to compile and utilize eBPF within the tracing framework. Topics outside of eBPF will include enhancing uprobes and tracing virtualize and layered environments. Of particular interest is new techniques to improve kernel to user space tracing integration. This includes usage of uftrace and better symbol resolution of user space addresses from within the kernel. Additionally there will be a discussion on challenges of real world use cases by non-kernel engineers.

For more details on this, please see this microconference’s wiki page.

We hope to see you there!

August 17, 2017 04:51 PM

August 16, 2017

Linux Plumbers Conference: Trusted Platform Module Microconference Accepted into the Linux Plumbers Conference

Following on from the TPM Microconference last year, we’re pleased to announce there will be a follow on at Plumbers in Los Angeles this year.

The agenda for this year will focus on a renewed attempt to unify the 2.0 TSS; cryptosystem integration to make TPMs just work for the average user; the current state of measured boot and where we’re going; using TXT with TPM in Linux and using TPM from containers.

For more details on this, please see this microconference’s wiki page

We hope to see you there!

August 16, 2017 12:01 AM

August 14, 2017

Dave Airlie (blogspot): radv on SI and CIK GPU - update

I recently acquired an r7 360 (BONAIRE) and spent some time getting radv stable and passing the same set of conformance tests that VI and Polaris pass.

The main missing thing was 10-bit integer format clamping for a bug in the SI/CIK fragment shader output hardware, where it truncates instead of clamps. The other missing piece was code for handling f16->f32 conversions according to the vulkan spec that I'd previously fixed for VI.

I also looked at a trace from amdgpu-pro and noticed it was using a ds_swizzle for the derivative calculations which avoided accessing LDS memory. I wrote support to use this path for radv/radeonsi since LLVM supported the intrinsic for a while now.

With these fixed CIK is pretty much in the same place as VI/Polaris.

I then plugged in my SI (Tahiti), and got lots of GPU hangs and crashes. I fixed a number of SI specific bugs (tiling and MSAA handling, stencil tiling). However even with those fixed I was getting random hangs, and a bunch of people on a bugzilla had noticed the same thing. I eventually discovered adding a shader pipeline and cache flush at the end of every command buffer (this took a few days to narrow down exactly). We aren't 100% sure why this is required on SI only, it may be a kernel bug, or a command processor bug, but it does mean radv on SI now can run games without hanging.

There are still a few CTS tests outstanding on SI only, and I'll probably get to them eventually, however I also got an RX Vega and once I get a newer BIOS for it from AMD I shall be spending some time fixing the radv support for it.

August 14, 2017 03:16 AM

August 10, 2017

Linux Plumbers Conference: Scheduler Workloads Microconference Accepted into the Linux Plumbers Conference

New to Linux Plumbers Conference this year, the Scheduler Workloads Microconference will focus on understanding various workloads and their impact on the Linux Kernel Scheduler. The objective is to initiate a cross organizational and architectural discussion involving currently available (or in development) benchmarks and their effectiveness in evaluating the scheduler for these workloads.

The agenda for this year will focus on sharing current workload and benchmark tools and traces and how these can be used to improve the various Linux subsystems, including power management and real time.  Given that benchmarking the Linux scheduler is a controversial topic and often depends on proprietary tools, we’ll also discuss how to develop fully open source tools and benchmarks for this.

For more details on this, please see this microconference’s wiki page.

We hope to see you there!

August 10, 2017 07:10 PM

August 08, 2017

Daniel Vetter: Why Github can't host the Linux Kernel Community

A while back at the awesome maintainerati I chatted with a few great fellow maintainers about how to scale really big open source projects, and how github forces projects into a certain way of scaling. The linux kernel has an entirely different model, which maintainers hosting their projects on github don’t understand, and I think it’s worth explaining why and how it works, and how it’s different.

Another motivation to finally get around to typing this all up is the HN discussion on my “Maintainers Don’t Scale” talk, where the top comment boils down to “… why don’t these dinosaurs use modern dev tooling?”. A few top kernel maintainers vigorously defend mailing lists and patch submissions over something like github pull requests, but at least some folks from the graphics subsystem would love more modern tooling which would be much easier to script. The problem is that github doesn’t support the way the linux kernel scales out to a huge number of contributors, and therefore we can’t simply move, not even just a few subsystems. And this isn’t about just hosting the git data, that part obviously works, but how pull requests, issues and forks work on github.

Scaling, the Github Way

Git is awesome, because everyone can fork and create branches and hack on the code very easily. And eventually you have something good, and you create a pull request for the main repo and get it reviewed, tested and merged. And github is awesome, because it figured out an UI that makes this complex stuff all nice&easy to discover and learn about, and so makes it a lot simpler for new folks to contribute to a project.

But eventually a project becomes a massive success, and no amount of tagging, labelling, sorting, bot-herding and automating will be able to keep on top of all the pull requests and issues in a repository, and it’s time to split things up into more manageable pieces again. More important, with a certain size and age of a project different parts need different rules and processes: The shiny new experimental library has different stability and CI criteria than the main code, and maybe you have some dumpster pile of deprecated plugins that aren’t support, but you can’t yet delete them: You need to split up your humongous project into sub-projects, each with their own flavour of process and merge criteria and their own repo with their own pull request and issue tracking. Generally it takes a few tens to few hundreds of full time contributors until the pain is big enough that such a huge reorganization is necessary.

Almost all projects hosted on github do this by splitting up their monorepo source tree into lots of different projects, each with its distinct set of functionality. Usually that results in a bunch of things that are considered the core, plus piles of plugins and libraries and extensions. All tied together with some kind of plugin or package manager, which in some cases directly fetches stuff from github repos.

Since almost every big project works like this I don’t think it’s necessary to delve on the benefits. But I’d like to highlight some of the issues this is causing:

Interlude: Why Pull Requests Exist

The linux kernel is one of the few projects I’m aware of which isn’t split up like this. Before we look at how that works - the kernel is a huge project and simply can’t be run without some sub-project structure - I think it’s interesting to look at why git does pull requests: On github pull request is the one true way for contributors to get their changes merged. But in the kernel changes are submitted as patches sent to mailing lists, even long after git has been widely adopted.

But the very first version of git supported pull requests. The audience of these first, rather rough, releases was kernel maintainers, git was written to solve Linus Torvalds’ maintainer problems. Clearly it was needed and useful, but not to handle changes from individual contributors: Even today, and much more back then, pull requests are used to forward the changes of an entire subsystem, or synchronize code refactoring or similar cross-cutting change across different sub-projects. As an example, the 4.12 network pull request from Dave S. Miller, committed by Linus: It contains 2k+ commits from 600 contributors and a bunch of merges for pull requests from subordinate maintainers. But almost all the patches themselves are committed by maintainers after picking up the patches from mailing lists, not by the authors themselves. This kernel process peculiarity that authors generally don’t commit into shared repositories is also why git tracks the committer and author separately.

Github’s innovation and improvement was then to use pull requests for everything, down to individual contributions. But that wasn’t what they were originally created for.

Scaling, the Linux Kernel Way

At first glance the kernel looks like a monorepo, with everything smashed into one place in Linus’ main repo. But that’s very far from it:

At first this just looks like a complicated way to fill everyone’s disk space with lots of stuff they don’t care about, but there’s a pile of compounding minor benefits that add up:

In short, I think this is a strictly more powerful model, since you can always fall back to doing things exactly like you would with multiple disjoint repositories. Heck there’s even kernel drivers which are in their own repository, disjoint from the main kernel tree, like the proprietary Nvidia driver. Well it’s just a bit of source code glue around a blob, but since it can’t contain anything from the kernel for legal reasons it is the perfect example.

This looks like a monorepo horror show!

Yes and no.

At first glance the linux kernel looks like a monorepo because it contains everything. And lots of people learned that monorepos are really painful, because past a certain size they just stop scaling.

But looking closer, it’s very, very far away from a single git repository. Just looking at the upstream subsystem and driver repositories gives you a few hundred. If you look at the entire ecosystem, including hardware vendors, distributions, other linux-based OS and individual products, you easily have a few thousand major repositories, and many, many more in total. Not counting any git repo that’s just for private use by individual contributors.

The crucial distinction is that linux has one single file hierarchy as the shared namespace across everything, but lots and lots of different repos for all the different pieces and concerns. It’s a monotree with multiple repositories, not a monorepo.

Examples, please!

Before I go into explaining why github cannot currently support this workflow, at least if you want to retain the benefits of the github UI and integration, we need some examples of how this works in practice. The short summary is that it’s all done with git pull requests between maintainers.

The simple case is percolating changes up the maintainer hierarchy, until it eventually lands in a tree somewhere that is shipped. This is easy, because the pull request only ever goes from one repository to the next, and so could be done already using the current github UI.

Much more fun are cross-subsystem changes, because then the pull request flow stops being an acyclic graph and morphs into a mesh. The first step is to get the changes reviewed and tested by all the involved subsystems and their maintainers. In the github flow this would be a pull request submitted to multiple repositories simultaneously, with the one single discussion stream shared among them all. Since this is the kernel, this step is done through patch submission with a pile of different mailing lists and maintainers as recipients.

The way it’s reviewed is usually not the way it’s merged, instead one of the subsystems is selected as the leading one and takes the pull requests, as long as all other maintainers agree to that merge path. Usually it’s the subsystem most affected by a set of changes, but sometimes also the one that already has some other work in-flight which conflicts with the pull request. Sometimes also an entirely new repository and maintainer crew is created, this often happens for functionality which spans the entire tree and isn’t neatly contained to a few files and directories in one place. A recent example is the DMA mapping tree, which tries to consolidate work that thus far has been spread across drivers, platform maintainers and architecture support groups.

But sometimes there’s multiple subsystems which would both conflict with a set of changes, and which would all need to resolve some non-trivial merge conflict. In that case the patches aren’t just directly applied (a rebasing pull request on github), but instead the pull request with just the necessary patches, based on a commit common to all subsystems, is merged into all subsystem trees. The common baseline is important to avoid polluting a subsystem tree with unrelated changes. Since the pull is for a specific topic only, these branches are commonly called topic branches.

One example I was involved with added code for audio-over-HDMI support, which spanned both the graphics and sound driver subsystems. The same commits from the same pull request where both merged into the Intel graphics driver and also merged into the sound subsystem.

An entirely different example that this isn’t insane is the only other relevant general purpose large scale OS project in the world also decided to have a monotree, with a commit flow modelled similar to what’s going on in linux. I’m talking about the folks with such a huge tree that they had to write an entire new GVFS virtual filesystem provider to support it …

Dear Github

Unfortunately github doesn’t support this workflow, at least not natively in the github UI. It can of course be done with just plain git tooling, but then you’re back to patches on mailing lists and pull requests over email, applied manually. In my opinion that’s the single one reason why the kernel community cannot benefit from moving to github. There’s also the minor issue of a few top maintainers being extremely outspoken against github in general, but that’s a not really a technical issue. And it’s not just the linux kernel, it’s all huge projects on github in general which struggle with scaling, because github doesn’t really give them the option to scale to multiple repositories, while sticking to with a monotree.

In short, I have one simple feature request to github:

Please support pull requests and issue tracking spanning different repos of a monotree.

Simple idea, huge implications.

Repositories and Organizations

First, it needs to be possible to have multiple forks of the same repo in one organization. Just look at git.kernel.org, most of these repositories are not personal. And even if you might have different organizations for e.g. different subsystems, requiring an organization for each repo is silly amounts of overkill and just makes access and user managed unnecessarily painful. In graphics for example we’d have 1 repo each for the userspace test suite, the shared userspace library, and a common set of tools and scripts used by maintainers and developers, which would work in github. But then we’d have the overall subsystem repo, plus a repository for core subsystem work and additional repositories for each big drivers. Those would all be forks, which github doesn’t do. And each of these repos has a bunch of branches, at least one for feature work, and another one for bugfixes for the current release cycle.

Combining all branches into one repository wouldn’t do, since the point of splitting repos is that pull requests and issues are separated, too.

Related, it needs to be possible to establish the fork relationship after the fact. For new projects who’ve always been on github this isn’t a big deal. But linux will be able to move at most a subsystem at a time, and there’s already tons of linux repositories on github which aren’t proper github forks of each another.

Pull Requests

Pull request need to be attached to multiple repos at the same time, while keeping one unified discussion stream. You can already reassign a pull request to a different branch of repo, but not at multiple repositories at the same time. Reassigning pull requests is really important, since new contributors will just create pull requests against what they think is the main repo. Bots can then shuffle those around to all the repos listed in e.g. a MAINTAINERS file for a given set of files and changes a pull request contains. When I chatted with githubbers I originally suggested they’d implement this directly. But I think as long as it’s all scriptable that’s better left to individual projects, since there’s no real standard.

There’s a pretty funky UI challenge here since the patch list might be different depending upon the branch the pull request is against. But that’s not always a user error, one repo might simple have merged a few patches already.

Also, the pull request status needs to be different for each repo. One maintainer might close it without merging, since they agreed that the other subsystem will pull it in, while the other maintainer will merge and close the pull. Another tree might even close the pull request as invalid, since it doesn’t apply to that older version or vendor fork. Even more fun, a pull request might get merged multiple times, in each subsystem with a different merge commit.

Issues

Like pull requests, issues can be relevant for multiple repos, and might need to be moved around. An example would be a bug that’s first reported against a distribution’s kernel repository. After triage it’s clear it’s a driver bug still present in the latest development branch and hence also relevant for that repo, plus the main upstream branch and maybe a few more.

Status should again be separate, since once push to one repo the bugfix isn’t instantly available in all of them. It might even need additional work to get backported to older kernels or distributions, and some might even decide that’s not worth it and close it as WONTFIX, even thought the it’s marked as successfully resolved in the relevant subsystem repository.

Summary: Monotree, not Monorepo

The Linux Kernel is not going to move to github. But moving the Linux way of scaling with a monotree, but mutliple repos, to github as a concept will be really beneficial for all the huge projects already there: It’ll give them a new, and in my opinion, more powerful way to handle their unique challenges.

August 08, 2017 12:00 AM

August 07, 2017

Paul E. Mc Kenney: Book review: "Antifragile: Things That Gain From Disorder"

This is the fourth and final book in Nassim Taleb's Incerto series, which makes a case for antifragility as a key component of design, taking the art of design one step beyond robustness. An antifragile system is one where variation, chaos, stress, and errors improve the results. For example, within limits, stressing muscles and bones makes them stronger. In contrast, stressing a device made of (say) aluminum will eventually cause it to fail. Taleb gives a lengthy list of examples in Table 1 starting on page 23, some of which seem more plausible than others. An example implausible entry lists rule-based systems as fragile, principles-based systems as robust, and virtue-based systems as antifragile. Although I can imagine a viewpoint where this makes sense, any expectation that a significantly large swath of present-day society will agree on a set of principles (never mind virtues!) seems insanely optimistic. The table nevertheless provides much good food for thought.

Taleb states that he has constructed antifragile financial strategies using insurance to control downside risks. But he also states on page 6 “Thou shalt not have antifragility at the expense of the fragility of others.” Perhaps Taleb figures that few will shed tears for any difficulties that insurance companies might get into, perhaps he is taking out policies that are too small to have material effect on the insurance company in question, or perhaps his policies are counter to the insurance company's main business, so that payouts to Taleb are anticorrelated with payouts to the company's other customers. One presumes that he has thought this through carefully, because a bankrupt insurance company might not be all that effective at controlling his downside risks.

Appendix I beginning on page 435 gives a graphical summary of the books main messages. Figure 28 on page 441 is good grist for the mills of those who would like humanity to become an intergalactic species: After all, confining the human race seems likely to limit its upside. (One counterargument would posit that a finite object might have unbounded value, but such counterarguments typically rely on there being a very large number of human beings interested in that finite object, which some would consider to counter this counterargument.)

The right-hand portion of Figure 30 on page 442 illustrates what the author calls local antifragility and global fragility. To see this, imagine that the x-axis represents variation from nominal conditions, and the y-axis represents payoff, with large positive payoffs being highly desired. The right-hand portion shows something not unrelated to the function x^2-x^4, which gives higher payoffs as you move in either direction from x=0, peaking when x reaches one divided by the square root of two (either positive or negative), dropping back to zero when x reaches +1 or -1, and dropping like a rock as one ventures further away from x=0. The author states that this local antifragility and global fragility is the most dangerous of all, but given that he repeatedly stresses that antifragile systems are antifragile only up to a point, this dangerous situation would seem to be the common case. Those of us who believe that life is inherently dangerous should have no problem with this apparent contradiction.

But what does all of this have to do with parallel programming???

Well, how about “Is RCU antifragile?”

One case for RCU antifragility is the batching optimizations that allow many (as in thousands) concurrent requests to share the same grace-period computation. Therefore, the heavier the update-side load on RCU, the more efficiently RCU operates.

However, load is but one of many aspects of RCU's environment that might be varied. For an extreme example, RCU is exceedingly fragile with respect to small perturbations of the program counter, as Peter Sewell so ably demonstrated, by running emacs, no less. RCU is also fragile with respect to timekeeping anomalies, for example, it can emit false-positive RCU CPU stall warnings if different CPUs have tens-of-seconds disagreements as to the current time. However, the aforementioned bones and muscles are similarly fragile with respect to any number of chemical substances (AKA “poisons”), to say nothing of well-known natural phenomena such as lightning bolts and landslides.

Even when excluding hardware misbehavior such as auto-perturbing program counters and unsynchronized clocks, RCU would still be subject to software aging, and RCU has in fact require multiple interventions from its developers and maintainer in order to keep up with changing hardware, workload, and usage. One could therefore argue that RCU is fragile with respect to perturbations of time, although the combination of RCU and its developers, reviewers, and maintainer seem to have kept up reasonably well thus far.

On the other hand, perhaps it is unrealistic to evaluate the antifragility of software without including black-hat hackers. Achieving antifragility in that sort of environment is still very much a grand challenge problem, but a challenge that must be faced. Oh, you think RCU is to low-level for this sort of attack? There was a time when I thought so. And then came rowhammer.

So please be careful, and, where possible, antifragile! It is after all a real world out there!!!

August 07, 2017 04:36 AM

August 03, 2017

Linux Plumbers Conference: Book Your Hotel for Plumbers by 18 August

As a reminder, we have a block of rooms at the JW Marriott LA Live
available to attendees at the discounted conference rate of $259/night
(plus applicable taxes). High speed internet is included in the room rate.

Our discounted room rate expires on 5:00 pm PST on August 18. We encourage
you to book today!

Visit our Attend page for additional details.

August 03, 2017 10:15 PM

July 29, 2017

Linux Plumbers Conference: Late Registration Begins Soon

The late registration for Linux Plumbers conference begins on 31 July. If you want to take advantage of the standard registration fees, register now on this link.

Standard registration is $550, late registration will be $650.

July 29, 2017 07:21 PM

Linux Plumbers Conference: Checkpoint-Restart Microconference Accepted into the Linux Plumbers Conference

Following on from the successful Checkpoint-Restart Microconference
last year, we’re pleased to announce that there will be another at
Plumbers in Los Angeles this year.

The agenda this year will focus on specific use cases of Checkpoint-
Restart, such as High Performance Computing, state saving uses such as
job scheduling and hot standby.  In addition we’ll be looking at
enhancements such as performance and using userfaultfd for dirty memory
tracking in iterative migration and what it would take to have
unprivileged checkpoint-restart.  Finally, we’ll have discussions on
checkpoint-restart aware applications and what sort of testing needs to
be applied to the upstream kernel to prevent any checkpoint-restore API
breakage as it evolves.

For more details on this, please see this microconference’s wiki page.

We hope to see you there!

July 29, 2017 04:42 PM

July 21, 2017

Michael Kerrisk (manpages): man-pages-4.12 is released

I've released man-pages-4.12. The release tarball is available on kernel.org. The browsable online pages can be found on man7.org. The Git repository for man-pages is available on kernel.org.

This release resulted from patches, bug reports, reviews, and comments from around 30 contributors. It includes just under 200 commits changing around 90 pages. This is a relatively small release, with one new manual page, ioctl_getfsmap(2). The most significant change in the release consists of a number of additions and improvements in the ld.so(8) page.

July 21, 2017 06:53 PM

July 20, 2017

Paul E. Mc Kenney: Parallel Programming: Getting the English text out of the way

We have been making good progress on the next release of Is Parallel Programming Hard, And, If So, What Can You Do About It?, and hope to have a new release out soonish.

In the meantime, for those of you for whom the English text in this book has simply gotten in the way, there is now an alternative:

perfbook_cn_cover

On the off-chance that any of you are seriously interested, this is available from
Amazon China, JD.com, Taobao.com, and Dangdang.com. For the rest of you, you have at least seen the picture.  ;–)

July 20, 2017 02:37 AM

July 18, 2017

Matthew Garrett: Avoiding TPM PCR fragility using Secure Boot

In measured boot, each component of the boot process is "measured" (ie, hashed and that hash recorded) in a register in the Trusted Platform Module (TPM) build into the system. The TPM has several different registers (Platform Configuration Registers, or PCRs) which are typically used for different purposes - for instance, PCR0 contains measurements of various system firmware components, PCR2 contains any option ROMs, PCR4 contains information about the partition table and the bootloader. The allocation of these is defined by the PC Client working group of the Trusted Computing Group. However, once the boot loader takes over, we're outside the spec[1].

One important thing to note here is that the TPM doesn't actually have any ability to directly interfere with the boot process. If you try to boot modified code on a system, the TPM will contain different measurements but boot will still succeed. What the TPM can do is refuse to hand over secrets unless the measurements are correct. This allows for configurations where your disk encryption key can be stored in the TPM and then handed over automatically if the measurements are unaltered. If anybody interferes with your boot process then the measurements will be different, the TPM will refuse to hand over the key, your disk will remain encrypted and whoever's trying to compromise your machine will be sad.

The problem here is that a lot of things can affect the measurements. Upgrading your bootloader or kernel will do so. At that point if you reboot your disk fails to unlock and you become unhappy. To get around this your update system needs to notice that a new component is about to be installed, generate the new expected hashes and re-seal the secret to the TPM using the new hashes. If there are several different points in the update where this can happen, this can quite easily go wrong. And if it goes wrong, you're back to being unhappy.

Is there a way to improve this? Surprisingly, the answer is "yes" and the people to thank are Microsoft. Appendix A of a basically entirely unrelated spec defines a mechanism for storing the UEFI Secure Boot policy and used keys in PCR 7 of the TPM. The idea here is that you trust your OS vendor (since otherwise they could just backdoor your system anyway), so anything signed by your OS vendor is acceptable. If someone tries to boot something signed by a different vendor then PCR 7 will be different. If someone disables secure boot, PCR 7 will be different. If you upgrade your bootloader or kernel, PCR 7 will be the same. This simplifies things significantly.

I've put together a (not well-tested) patchset for Shim that adds support for including Shim's measurements in PCR 7. In conjunction with appropriate firmware, it should then be straightforward to seal secrets to PCR 7 and not worry about things breaking over system updates. This makes tying things like disk encryption keys to the TPM much more reasonable.

However, there's still one pretty major problem, which is that the initramfs (ie, the component responsible for setting up the disk encryption in the first place) isn't signed and isn't included in PCR 7[2]. An attacker can simply modify it to stash any TPM-backed secrets or mount the encrypted filesystem and then drop to a root prompt. This, uh, reduces the utility of the entire exercise.

The simplest solution to this that I've come up with depends on how Linux implements initramfs files. In its simplest form, an initramfs is just a cpio archive. In its slightly more complicated form, it's a compressed cpio archive. And in its peak form of evolution, it's a series of compressed cpio archives concatenated together. As the kernel reads each one in turn, it extracts it over the previous ones. That means that any files in the final archive will overwrite files of the same name in previous archives.

My proposal is to generate a small initramfs whose sole job is to get secrets from the TPM and stash them in the kernel keyring, and then measure an additional value into PCR 7 in order to ensure that the secrets can't be obtained again. Later disk encryption setup will then be able to set up dm-crypt using the secret already stored within the kernel. This small initramfs will be built into the signed kernel image, and the bootloader will be responsible for appending it to the end of any user-provided initramfs. This means that the TPM will only grant access to the secrets while trustworthy code is running - once the secret is in the kernel it will only be available for in-kernel use, and once PCR 7 has been modified the TPM won't give it to anyone else. A similar approach for some kernel command-line arguments (the kernel, module-init-tools and systemd all interpret the kernel command line left-to-right, with later arguments overriding earlier ones) would make it possible to ensure that certain kernel configuration options (such as the iommu) weren't overridable by an attacker.

There's obviously a few things that have to be done here (standardise how to embed such an initramfs in the kernel image, ensure that luks knows how to use the kernel keyring, teach all relevant bootloaders how to handle these images), but overall this should make it practical to use PCR 7 as a mechanism for supporting TPM-backed disk encryption secrets on Linux without introducing a hug support burden in the process.

[1] The patchset I've posted to add measured boot support to Grub use PCRs 8 and 9 to measure various components during the boot process, but other bootloaders may have different policies.

[2] This is because most Linux systems generate the initramfs locally rather than shipping it pre-built. It may also get rebuilt on various userspace updates, even if the kernel hasn't changed. Including it in PCR 7 would entirely break the fragility guarantees and defeat the point of all of this.

comment count unavailable comments

July 18, 2017 06:48 AM

July 13, 2017

Linux Plumbers Conference: VFIO/IOMMU/PCI Microconference Accepted into Linux Plumbers Conference

Following on from the successful PCI Microconference at Plumbers last year we’re pleased to announce a follow on this year with an expanded scope.

The agenda this year will focus on overlap and common development between VFIO/IOMMU/PCI subsystems, and in particular how consolidation of the shared virtual memory(SVM) API can drive an even tighter coupling between them.

This year we will also focus on user visible aspects such as using SVM to share page tables with devices and reporting I/O page faults to userspace in addition to discussing PCI and IOMMU interfaces and potential improvements.

For more details on this, please see this microconference’s wiki page.

We hope to see you there!

July 13, 2017 05:20 PM

July 11, 2017

Linux Plumbers Conference: Power Management and Energy-awareness Microconference Accepted into Linux Plumbers Conference

Following on from the successful Power Management and Energy-awareness at Plumbers last year we’re pleased to announce a follow on this year.

The agenda this year will focus on a range of topics including CPUfreq core improvements and schedutil governor extensions, how to best use scheduler signals to balance energy consumption and performance and user space interfaces to control capacity and utilization estimates.  We’ll also discuss selective throttling in thermally constrained systems, runtime PM for ACPI, CPU cluster idling and the possibility to implement resume from hibernation in a bootloader.

For more details on this, please see this microconference’s wiki page.

We hope to see you there!

July 11, 2017 04:15 PM

James Morris: Linux Security Summit 2017 Schedule Published

The schedule for the 2017 Linux Security Summit (LSS) is now published.

LSS will be held on September 14th and 15th in Los Angeles, CA, co-located with the new Open Source Summit (which includes LinuxCon, ContainerCon, and CloudCon).

The cost of LSS for attendees is $100 USD. Register here.

Highlights from the schedule include the following refereed presentations:

There’s also be the usual Linux kernel security subsystem updates, and BoF sessions (with LSM namespacing and LSM stacking sessions already planned).

See the schedule for full details of the program, and follow the twitter feed for the event.

This year, we’ll also be co-located with the Linux Plumbers Conference, which will include a containers microconference with several security development topics, and likely also a TPMs microconference.

A good critical mass of Linux security folk should be present across all of these events!

Thanks to the LSS program committee for carefully reviewing all of the submissions, and to the event staff at Linux Foundation for expertly planning the logistics of the event.

See you in Los Angeles!

July 11, 2017 11:30 AM

July 10, 2017

Kees Cook: security things in Linux v4.12

Previously: v4.11.

Here’s a quick summary of some of the interesting security things in last week’s v4.12 release of the Linux kernel:

x86 read-only and fixed-location GDT
With kernel memory base randomization, it was stil possible to figure out the per-cpu base address via the “sgdt” instruction, since it would reveal the per-cpu GDT location. To solve this, Thomas Garnier moved the GDT to a fixed location. And to solve the risk of an attacker targeting the GDT directly with a kernel bug, he also made it read-only.

usercopy consolidation
After hardened usercopy landed, Al Viro decided to take a closer look at all the usercopy routines and then consolidated the per-architecture uaccess code into a single implementation. The per-architecture code was functionally very similar to each other, so it made sense to remove the redundancy. In the process, he uncovered a number of unhandled corner cases in various architectures (that got fixed by the consolidation), and made hardened usercopy available on all remaining architectures.

ASLR entropy sysctl on PowerPC
Continuing to expand architecture support for the ASLR entropy sysctl, Michael Ellerman implemented the calculations needed for PowerPC. This lets userspace choose to crank up the entropy used for memory layouts.

LSM structures read-only
James Morris used __ro_after_init to make the LSM structures read-only after boot. This removes them as a desirable target for attackers. Since the hooks are called from all kinds of places in the kernel this was a favorite method for attackers to use to hijack execution of the kernel. (A similar target used to be the system call table, but that has long since been made read-only.) Be wary that CONFIG_SECURITY_SELINUX_DISABLE removes this protection, so make sure that config stays disabled.

KASLR enabled by default on x86
With many distros already enabling KASLR on x86 with CONFIG_RANDOMIZE_BASE and CONFIG_RANDOMIZE_MEMORY, Ingo Molnar felt the feature was mature enough to be enabled by default.

Expand stack canary to 64 bits on 64-bit systems
The stack canary values used by CONFIG_CC_STACKPROTECTOR is most powerful on x86 since it is different per task. (Other architectures run with a single canary for all tasks.) While the first canary chosen on x86 (and other architectures) was a full unsigned long, the subsequent canaries chosen per-task for x86 were being truncated to 32-bits. Daniel Micay fixed this so now x86 (and future architectures that gain per-task canary support) have significantly increased entropy for stack-protector.

Expanded stack/heap gap
Hugh Dickens, with input from many other folks, improved the kernel’s mitigation against having the stack and heap crash into each other. This is a stop-gap measure to help defend against the Stack Clash attacks. Additional hardening needs to come from the compiler to produce “stack probes” when doing large stack expansions. Any Variable Length Arrays on the stack or alloca() usage needs to have machine code generated to touch each page of memory within those areas to let the kernel know that the stack is expanding, but with single-page granularity.

That’s it for now; please let me know if I missed anything. The v4.13 merge window is open!

Edit: Brad Spengler pointed out that I failed to mention the CONFIG_SECURITY_SELINUX_DISABLE issue with read-only LSM structures. This has been added now.

© 2017, Kees Cook. This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 License.
Creative Commons License

July 10, 2017 08:24 AM

Dave Airlie (blogspot): radv and the vulkan deferred demo - no fps left behind!

A little while back I took to wondering why one particular demo from the Sascha Willems vulkan demos was a lot slower on radv compared to amdgpu-pro. Like half the speed slow.

I internally titled this my "no fps left behind" project.

The deferred demo, does an offscreen rendering to 3 2048x2048 color attachments and one 2048x2048 D32S8 depth attachment. It then does a rendering using those down to as 1280x720 screen image.

Bas identifed the first cause was probably the fact we were doing clear color eliminations on the offscreen surfaces when we didn't need to. AMD GPU have a delta-color compression feature, and with certain clear values you don't need to do the clear color eliminations step. This brought me back from about 1/2 the FPS to about 3/4, however it took me quite a while to figure out where the rest of the FPS were hiding.

I took a few diversions in my testing, I pulled in some experimental patches to allow the depth buffer to be texture cache compatible, so could bypass the depth decompression pass, however this didn't seem to budge the number too much.

I found a bunch of registers we were setting different values from -pro, nothing too much came of these.

I found some places we were using a compute shader to fill some DCC or htile surfaces to a value, then doing a clear and overwriting the values, not much help.

I noticed the vertex descriptions and buffer attachments on amdgpu-pro were done quite different to how radv does it. With vulkan you have vertex descriptors and bindings, with radv we generate a set of hw descriptors from the combination of both descriptors and bindings. The pro driver uses typed buffer loads in the shader to embed the descriptor contents in the shader, then it only updates the hw descriptors for the buffer bindings. This seems like it might be more efficient, guess what, no help. (LLVM just grew support for typed buffer loads, so we could probably move to this scheme if we wished now).

I dug out some patches that inline all the push constants and some descriptors so our shaders had less overhead, (really helps our meta shaders have less impact), no helps.

I noticed they export the shader results in a different order from the fragment shader, and always at the end. (no help). The vertex shader emits pos first, (no help). The vertex shader uses off exports for unused channels, (no help).

I went on holidays for a week and came back to stare at the traces again, when I my brain finally noticed something I'd missed. When binding the 3 color buffers, the addresses given as the base address were unusual. A surface has a 40-bit address, normally for alignment and tiling the bottom 16-bits are 0, and we shift 8 of those off completely before writing them. This leaves the bottom 8 bits of the base address has should be 0, and the CIK docs from AMD say that. However the pro traces didn't have these at 0. It appears from earlier evergreen/cayman documents these register control some tiling offset bits. After writing a hacky patch to set the values, I managed to get back the rest of the FPS I was missing in the deferred demo. I discussed with AMD developers, and we worked out the addrlib library has an API for working out these values, and it seems that it allows better memory bandwidth utilisation. I've written a patch to try and use these values correctly and sent it out along with the DCC avoidance patch.

Now I'm not sure this will help any real apps, we may not be hitting limitations in that area, and I'm never happy with the benchmarks I run myself. I thought I saw some FPS difference with some madmax scenes, but I might be lying to myself. Once the patches land in mesa I'm sure others will run benchmarks and we can see if there is any use case where they have an effect. The AMD radeonsi OpenGL driver can also do the same tweaks so hopefully there as well there will be some benefit.

Otherwise I can just write this off as making deferred run at equality and removing at least one of the deltas that radv has compared to the pro driver. Some of the other differences I discovered along the way might also have some promise in other scenarios, so I'll keep an eye on them.

Thanks to Bas, Marek and Christian for looking into what the magic meant!

July 10, 2017 08:08 AM

Dave Airlie: Migrating to blogsport

Due to lots of people telling me LJ is bad, mm'kay, I've migrated to blogspot.

New blog is/will be here: https://airlied.blogspot.com

July 10, 2017 06:36 AM

Dave Airlie (blogspot): Migrating by blog here

I'm moving my blog from LJ to blogspot, because people keep telling me LJ is up to no go, like hacking DNC servers and interfering in elections.

July 10, 2017 06:29 AM

July 08, 2017

Kernel Podcast: Linux Kernel Podcast for 2017/07/07

Audiohttp://traffic.libsyn.com/jcm/20170707.mp3

Linux 4.12 final is released, the 4.13 merge window opens, and various assorted ongoing kernel development is described in detail.

Editorial note

Reports of this podcast’s demise are greatly exaggerated. But it is worth noting that recording this weekly is HARD. That said, I am going to work on automation (I want the podcast to effectively write itself by providing a web UI via of LKML threads that allows anyone to write summaries, add author bios, links, etc. – and expand this to other communities) but that will all take some time. Until that happens, we’ll just have to live with some breaks.

Announcements

Linus Torvalds announced Linux 4.12 final. In his announcement mail, Linus reflects that “4.12 is just plain big”, noting that, this was “one of the bigger releases historically, and I think only 4.9 ends up having had more commits. And 4.9 was big at least partly because Greg announced it was an LTS [Long Term Support – receiving updates for several years] kernel”. In pure numbers, 4.12 adds over a million lines of code over 4.11, about half of which can be attributed to enablement for the AMD Vega GPU support. As usual, both Linux Weekly News (LWN) and KernelNewbies have excellent, and highly detailed summaries. Listeners are encouraged to support real kernel journalism by subscribing to Linux Weekly News and visiting lwn.net.

Theodore (Ted) Ts’o posted “Next steps and plans for the 2017 Maintainer and Kernel Summits”. He reminds everyone of the (slightly) revised format to the this year’s Kernel Summit (which is, as is often the case, co-located with a Linux Foundation event in the form of the Open Source Summit Prague in October). Notably, a program committee is established to help encourage submissions from those who feel they should be present at the event. To learn more, see the mailing list archives containing the announcement: https://lists.linuxfoundation.org/mailman/listinfo/ksummit-discuss (technically the deadline is already passed, or tomorrow, depending)

Greg K-H (Kroah-Hartman) announced Linux 4.4.76, 4.9.36, and 4.11.9.

Willy Tarreau announced Linux 3.10.106, including a reminder that this “LTS” [Long Term Stable] kernel is “scheduled for end of life on end of October”.

Steven Rostedt released preempt-rt (“Real Time”) kernels 3.10.107-rt122, 3.18.59-rt65, 4.4.75-rt88, and 4.9.35-rt25, all of which were simply rebases to stable kernel updates and had “no RT specific changes”. It will be interesting to see if some of the hotplug fixes Thomas Gleixner has sent for Linux 4.13 will resolve issues seen by some RT users when doing hotplug.

Sebastian Andrzej Siewior announced preempt-rt (“Real time”) kernels v4.9.33-rt23, and v4.11.7-rt3, which still notes potential for a deadlock under CPU hotplug.

Stpehen Hemminger announced iproute2 version 4.12.0 matching Linux 4.12. This includes support for features present in the new kernel, including flower support and enhancements to the TC (Traffic Control) code: https://www.kernel.org/pub/linux/utils/net/iproute2/iproute2-4.12.0.tar.gz

Bartosz Golaszewksi posted libgpiod v0.3:
https://github.com/brgl/libgpiod/releases/tag/v0.3

Mathieu Desnoyers announced LTTng modules 2.10.0-rc2, 2.9.3, 2.8.6, including support for “4.12 release candidate kernels”.

The 4.13 merge window

With the opening of the 4.13 merge window, many pull requests have begun flowing for what will become the new hotness in another couple of months. We won’t summarize each in detail (that resulted in a one hour long podcast the last time…) but will instead call out a few “interesting” changes of note. Stephen Rothwell also promptly updated his daily linux-next tree with the usual disclaimer that “Please do not add any v4.14 material to you[r] linux-next included branches until after v4.13-rc1 has been released”.

ACPI. Rafael J. Wysocki posted “ACPI updates for v4.13-rc1”, which includes an update to the ACPICA (ACPI Component Architecture) release of 20170531 that adds support to the OS-independent ACPICA layer for ACPI 6.2. This includes a number of new tables, including the PPTT (Processor Properties and Topology Table) that some of us have wanted to see for many years (as a means to more fully describe the NUMA properties of ARM servers, as just a random example…). In addition, Kees Cook has done some work to clean up the use of function pointer structures in ACPICA to use “designated initializers” so as “to make the structure layout randomization GCC plugin work with it”. All in all, this is a nice set of updates for all architectures.

AppArmor. John Johansen noted in his earlier pull request (to James Morris, who owns overall security subsystem pull requests headed to Linus) that an attempt was being made to get many of the Ubuntu specific AppArmor patches upstreamed. The 4.13 patches “introduces the domain labeling base code that Ubuntu has been carrying for several years”. He then plans to begin to RFC other Ubuntu-specific patches in later cycles.

ARM. Arnd Bergman notes a number of changes to 64-bit ARM platforms, including work done by Timur Tabi to change kernel def(ault)config files to “enable[s] a number of options that are typically required for server platforms”. It’s only been many years since this should have been the case in upstream Linux. Meanwhile, in a separate pull for “ARM: 64-bit DT [DeviceTree] updates”, support is added for many new boards (“For the first time I can remember, this is actually larger than the corresponding branch for 32-bit platforms”) including new varieties of “OrangePi” based on Allwinner chipsets.

Docs. Jon(athan) Corbet had noted that “You’ll also encounter more than the usual number of conflicts, which is saying something”. Linus “fixed the ones that were actual data conflicts” but he had some suggestions for how Kbuild could be modified such that an “make allmodconfig” checked for the existence of various files being reference in the rst documentation source files. He also noted that he was happy to see docbook “finally gone” but that sphinx, the tool used to generate documentation now, “isn’t exactly a speed demon”.

Hotplug. As noted elsewhere, Thomas Gleixner posted a pull request for various smp hotplug fixes that includes replacing an “open coded RWSEM [Read Write Semaphore] with a percpu RWSEM”. This is done to enable full coverage by the kernel’s “lockdep” locking dependency checker in order to catch hotplug deadlocks that have been seen on certain RT (Real Time) systems.

IRQ. Thomas Gleixner posted “irq updates for 4.13”, which includes “Expand the generic infrastructure handling the irq migration on CPU hotplug and convert X86 over to it” in preparation for cleaning up affinity management on blk multiqueue devices (preventing interrrupts being moved around during hotplug by instead shutting down affine interrupts intended to be always routed to a specific CPU). Thomas notes that “Jens [the blk maintainer] acked them and agreed that they should go with the irq changes”, but Linus later pushed back strongly after hitting merge conflicts that made him feel that some of these changes should have gone in via the blk tree instead of clashing with it. Linus was also concerned if the onlining code worked at all.

Objtool. Ingo Molnar posted a pull request including changes to the “objdump” tool intending to allow the tracking of stack pointer modifications through “machine instructions of disassembled functions found in kernel .o files”. The idea is to remove a dependency upon compiling the kernel with the CONFIG_FRAME_POINTERS=y option (which causes a larger stack frame and possible additional register pressure on some architectures) while still retaining the ability to generate correct kernel debuginfo data in the future.

PCI. Thomas Gleixner posted “x86/PCI updates for 4.13”, which includes work to separate PCI config space accessors from using a global PCI lock. Apparently, x86 already had an additional PCI config lock and so two layers of redundant locking were being employed, while neither was strictly necessary in the case of ECAM (“mmconfig”) based configuration, since “access to the extended configuration space [MMIO based configuration in PCIe] does not require locking”. Thomas also notes that a commit which had switched x86 to use ECAM [the MMIO mode] by default was removed so it will still use “type1 accessors” (the “old fashioned way” that Linus is so happy with) serialized by x86 internal locking for primary configuration space. This set of patches came in through x86 via Thomas with Bjorn Helgaas’s (PCI maintainer) permission.

RCU. Ingo Molnar noted that “The sole purpose of these changes is to shrink and simplify the RCU code base, which has suffered from creeping bloat”.

Scheduler. Ingo Molnar posted a pull request that included a number of changes, among them being NUMA scheduling improvements to address regressions seen when comparing 4.11 based kernels to older ones, from Rik van Riel.

VFS. Al Viro went to town with VFS updates split into more than 10 parts (yes, really, actually 11 as of this writing). These are caused by various intrusive changes which impact many parts of the kernel tree. Linus said he would “*much* rather do five separate pull requests where each pull has a stated reason and target, than do one big mixed-up one”. Which is good because Viro promised many more than 5. Patch series number 11 got the most feedback so far.

X86. Ingo Molnar also went to town, in typical fashion, with many different updates to the kernel. These included mm changes enabling more Intel 5-level paging features (switching the “GUP” or “Get User Pages” code over to the newer generic kernel implementation shared by other architectures), and “[C]ontinued work to add PCID [Process Context ID] support”. Per-process context IDs allow for TLB (Translation Lookaside Buffer – the micro caches that store virtual to physical memory translations following page table walks by the hardware walkers) flush infrastructure optimizations on legacy architectures such as x86 that do not have certain TLB hardware optimizations. Ingo also posted microcode updates that include support for saving microcode pointers and wiring them up for use early in the “resume-from-RAM” case, and fixes to the Hyper-V guest support that add a synthetic CPU MSR (Model Specific Register) providing the CPU TSC frequency to the guest.

Ongoing Development

ARM. Will Deacon posted the fith version of a patch series entitled “Add support for the ARMv8.3 Statistical Profiling Extension”, which provides a linear, virtually addressed memory buffer containing statistical samples (subject to various filtering) related to processor operations of interest that are performed by running (application) code. Sample records take the form of “packets”, which contain very detailed amounts of information, such as the virtual PC (Program Counter) address of a branch instruction, its type (conditional, unconditional, etc.), number of cycles waiting for the instruction to issue, the target, cycles spent executing the branch instruction, associated events (e.g. misprediction), and so on. Detailed information about the new extension is available in the ARM ARM, and is summarized in a blog post, here: https://community.arm.com/processors/b/blog/posts/statistical-profiling-extension-for-armv8-a

RISC-V. Palmer Dabbelt posted v4 of the enablement patch series adding support for the Open Source RISC-V architecture (which will then require various enablement for specific platforms that implement the architecture). In his patch posting, he notes changes from the previous version 3 that include disabling cmpxchg64 (a 64-bit instruction that performs an “atomic” compare and exchange operation, but which isn’t atomic on 32-bit systems) on 32-bit, adding an ELF_HWCAP (hardware capability) within binaries in order for users to determine the ISA of the machine, and various other miscellaneous changes. He asks for consideration that this be merged during the ongoing merge window for 4.13, which remains to be seen. We will track this in future episodes.

FOLL_FORCE. Keno Fischer noted that “Yes, people use FOLL_FORCE”, referencing a commit from Linus in which an effort had been made to “try to remove use of FOLL_FORCE entirely” on the procfs (/proc) filesystem. Keno says “We used these semantics as a hardening mechanism in the julia JIT. By opening /proc/self/mem and using these semantics, we could avoid needing RWX pages, or a dual mapping approach”. In other words, they cheat and don’t setup direct RWX mappings ahead of time but instead get access to them via the backdoor using the kernel’s “/proc/self/mem” interface directly. Linus replied, “Oh, we’ll just re-instate the kernel behavior, it was more an optimistic “maybe nobody will notice” thing, and apparently people did notice”.

GICv4. Marc Zyngier posted version 2 of a patch series entitled “irqchip: KVM: Add support for GICv4”, a “(monster of a) series [that] implements full suport for GICv4, bringing direct injection of MSIs [Message Signalled Interrupts] to KVM on arm and arm64, assuming you have the right hardware (which is quite unlikely)”. Marc says that the “stack has been *very lightly* tested on an arm64 model, with a PCI virtio block device passed from the host to a guet (using kvmtool and Jean-Philippe Brucker’s excellent VFIO support patches). As it has never seen any HW, I expect things to be subtly broken, so go forward and test if you can, though I’m mostly interested in people reviewing the code at the moment”. It’s awesome to see 64-bit ARM systems on par with legacy architectures when it comes to VM interrupt injection.

GPIO. Any Shevchenko posted a patch (with Linus Walleij’s approval) noting that Intel would help to maintain GPIO ACPI support in the GPIO subsystem.

Hardlockup. Nicholas Piggin posted “[RFC] arch hardlockup detector interfaces improvement” which aims to “make it easier for architectures that have their own NMI / hard lockup detector to reuse various configuration interfaces that are provided by generic detectors (cmdline, sysctl, suspend/resume calls)”. He “do[es] this by adding a separate CONFIG_SOFTLOCKUP_DETECTOR [kernel configuration option], and juggling around what goes under config options. HAVE_NMI_WATCHDOG continues to be the config for arch to override the hard lockup detector, which is expanded to cover a few more cases”.

HMM. Jérôme Glisse posted “Cache coherent device memory (CDM) with HMM” which layers above his previous HMM (Heterogenous Memory Management) to provide a generic means to manage device memory that behaves much like regular system memory but may still need managing “in isolation from regular memory” (for any number of reasons, including NUMA effects). This is particularly useful in the case of a coherently attached system bus being used to connect on-device memory memory, such as CAPI or CCIX. [disclaimer: this author chairs the CCIX software working group]

Hyper-V. KY Srinivasan posted an update version of his “Hyper-V: paravirtualized remote TLB flushing and hypercall improvements” patches, which aim to optimize the case of remote TLB flushing on other vCPUs within a guest. TLBs are micro caches that store VA (Virtual Address) to PA (Physical Address) translations for VMAs (Virtual Memory Areas) that need to be invalidated during a context switch operation from one process to another. Typically, an Operating System may either utilize an IPI (Inter-Processor-Interrupt) to schedule a remote function on other CPUs that will tear down their TLB entries, or – on more enlightened and sophisticated modern computer architectures – may perform a hardware broadcast invalidation instruction that achieves the same without the gratuitous overhead. On x86 systems, IPIs are commonly used by guest operating systems and their impact can be reduced by providing special guest hypercalls allowing for hypervisor assistance in place of broadcast IPIs. Jork Loeser also posted a patch updating the Hyper-V vPCI driver to “use the Server-2016 version of the vPCI protocol, fixing MSI creation”.

ILP32. Yury Norov posted version 8 of a patch series entitled “ILP32 for ARM64” which aims to enable support for the Integer Long Pointer 32-bit optional userspace ABI on 64-bit ARM processors. In ways similar to “x32” on 64-bit “x86” systems, ILP32 aims to provide the benefits of the new ARMv8 ISA without having to use 64-bit data types and pointers for code that doesn’t actually require such large data or a large address space. Pointers (pun intended) are provided to an example kernel, GLIBC, and an OpenSuSE-based Linux distribution built against the newer ABI.

IMC Instrumentation Support. Madhavan Srinivasan posted version 10 of a patch series entitled “IMC Instrumentation Support” which aims to provide support for “In-Memory-Collection” infrastructure present in IBM POWER9 processors. IMC apparently “contains various Performance Monitoring Units (PMUs) at Nest level (these are on-chip but off-core), Core level and Thread level. The Nest PMU counters are handled by a Nest IMC microcode which runs in the OCC (On-Chip Controller) complex. The microcode collects the counter data and moves the nest IMC counter data to memory”. This effectively seems to be a microcontroller managed mechanism for providing certain core and uncore counter data using a standardized interface.

Intel FPGA Device Drivers. Wu Hao posted version 2 of a patch series entitled “Intel FPGA Device Drivers”, which “provides interfaces for userspace applications to configure, enumerate, open and access FPGA accelerators on platforms equipped with Intel(R) PCIe based FPGA solutions and enables system level management functions such as FPGA partial reconfiguration, power management and virtualization”. In other words, many of the capabilities required for datacenter level deployment of PCIe-attached FPGA accelerators.

Interconnects. Georgi Djakov posted version 2 of a patch series entitled “Introduce on-chip interconnect API”, which aims to provide a generic API to help manage the many varied high performance interconnects present on modern high-end System-on-Chip “processors”. As he notes, “Modern SoCs have multiple processors and various dedicated cores (video, gpu, graphics, model). These cores are talking to each other and can generate a lot of data flowing through the on-chip interconnects. These interconnect buses could form different topologies such as crossbar, point to point buses, hierarchical buses or use the network-on-chip concept”. The API provides an ability (subject to hardware support thereof) to control bandwidth use, QoS (Quality-of-Service), and other settings. It also includes code to enable the Qualcomm msm8916 interconnect with a layered driver.

IRQs. Daniel Lezcano posted version 10 of a patch series entitled “irq: next irq tracking” which aims to predict future IRQ occurances based upon previous system behavior. “As previously discussed the code is not enabled by default, hence compiled out”. A small circular buffer is used to keep track of non-timer interrupt sources. “A third patch provides the mathematic to compute the regular intervals”. The goal is to predict future expected system wakeups, which is useful from a latency perspective, as well as for various scheduling, or energy calculations later on.

Memory Allocation Watchdog. Tetsuo Handa posted version 9 of a patch series entitled “mm: Add memory allocation watchdog kernel thread”, which “adds a watchdog which periodically reports number of memory allocating tasks, dying tasks and OOM victim tasks when some task is spending too long time inside __alloc_pages_slowpath() [the code path called when a running program – known as a task within the kernel – must synchronously block and wait for new memory pages to become available for allocation]”. Tetsuo adds, “Thanks to OOM [Out-Of-Memory] repear which can guarantee forward progress (by selected next OOM victim) as long as the OOM killer can be invoked, we can start testing low memory situations which are previously too difficult to test. And we are now aware that there are still corner cases remaining where the system hands without invoking the OOM killer”. The patch aims to help explain whenever long hangs are explained by memory allocation failure.

Memory Protection Keys. Ram Pai posted version 5 of a patch series entitled “powerpc: Memory Protection Keys”, which aims to enable a feature in future ISA3.0 compliant POWER architecture platforms comparable to the “memory protection keys” added by Intel to their Intel x64 Architecture (“x86” variant). As Ram notes, “The overall idea: A process allocates a key and associates it with an address range within its address space. The process then can dynamically set read/write permissions on the key without involving the kernel. Any code that violates the permissions of the address space; as define by its associated key, will receive a segmentation fault”. The patches enable support on the “PPC64 HPTE platform” and are noted to have passed all of the same tests as on x86.

Modules. Djalal Harouni posted version 4 of a patch series entitled “modules: automatic module loading restrictions”, which adds a new global sysctl flag, as well as per task one, called “modules_autoload_mode”. “This new flag allows to control only automatic module loading [the kernel-invoked auto loading of certain modules in response to user or system actions] and if it is allowed or not, aligning in the process the implicit operation with the explicit [existing option to disable all module loading] one where both are now covered by capabilities checks”. The idea is to prevent certain classes of security exploit wherein – for example – a system can be caused to load a vulnerable network module by sending it a certain packet, or an application calling a certain kernel function. Other such classes of attack exist against automatic module loading, and have been the subject of a number of CVE [Common Vulnerabilities and Exposures] releases requiring frantic system patching. This feature will allow sysadmins to limit module auto loading on some classes of systems (especially embedded/IoT devices).

Network filtering. Shubham Bansal posted an RFC patch entitled “RFC: arm eBPF JIT compiler” which “is the first implementation of eBPF JIT for [32-bit] ARM”. Russell King had various questions, including whether the code handled “endian issues” well, to which Shubham replied that he had not tested it with BE (Big Endian) but was interested in setting up qemu to run Big Endian ARM models and would welcome help improving the code.

NMI. Adrien Mahieux posted “x86/kernel: Add generic handler for NMI events” which “adds a generic handler where sysadmins can specify the behavior to adopt for each NMI event code. List of events is provided at module load or on kernel cmdline, so can also generic kdump upon boot error”. The options include silently ignoring NMIs (which actually passes them through to the next handler), drop NMIs (actually discard them), or to panic the kernel immediately. An example given is using the drop parameter during kdump in order to prevent a second NMI from triggering a panic while another crash dump is already capturing from the first.

Randomness. Jason A. Donenfield posted version 4 of a patch series entitled “Unseeded In-Kernel Randomness Fixes” which aims to address “a problem with get_random_bytes being used before the RNG [Random Number Generator] has actually been seeded [given an initial set of values following boot time]. The solution for fixing this appears to be multi-pronged. One of those prongs involves adding a simple blocking API so that modules that use the RNG in process context an just sleep (in an interruptable manner) until the RNG is ready to be used. This winds up being a very useful API that covers a few use cases, several of which are included in this patch set”.

Scheduler. Nico[las] Pitre posted “scheduler tinification” which “makes it possible to configure out some parts of the scheduler such as the deadline and realtime scheduler classes. The saving in kernel footprint is non negligible”. In the examples cited, kernel text shrinks by almost 8K, which is significant in some very small Linux systems, such as in IoT.

S.A.R.A. Salvatore Mesoraca posted “S.A.R.A. a new stacked LSM” (which your author is choosing to pronounce as in “Sarah”, for various reasons, and apparently actually stands for “S.A.R.A is Another Recursive Acronym”). This is “a stacked Linux Security Module that aims to collect heterogeneous security measures, providing a common interface to manage them. It can be useful to allow minor security features to use advanced management options, like user-space configuration files and tools, without too much overhead”.

Secure Memory Encryption (SME). Tom Lendacky posted version 8 of a patch series that implements support in Linux for this feature of certain future AMD CPUs. “SME can be used to mark individual pages of memory as encrypted through the page tables. A page of memory that is marked encrypted will be automatically decrypted when read from DRAM and will be automatically encrypted when written to DRAM”. In other words, SME allows a datacenter operator to build systems in which all data leaving the SoC is encrypted either at rest (on disk), or when hitting external memory buses that might (theoretically) be monitored. When combined with other features, such as “another AMD processor feature called Secure Encrypted Virtualization (SEV)” it becomes possible to protect user data from intrusive monitoring by hypervisor operators (whether mallicious or coerced). This is the correct way to provide memory encryption. While others have built a nonsense known as “enclaves”, the AMD approach correctly solves a more general problem. The AMD patches update various pieces of kernel infrastructure, from the UEFI code, to IOMMU support for carry page encryption state through.

SMIs. Kan Liang posted version 2 of a patch entitled “measure SMI cost (user)” which adds a “new sysfs entry /sys/device/cpu/freeze_on_smi” which will cause the “FREEZE_WHILE_SMM” bit in the Intel “IA32_DEBUGCTL” processor control register to be set. Once it is set, “the PMU core counters will freeze on SMI handler”. This can be usd with a “new –smi-cost mode in perf stat…to measure the SMI cost by calculating unhalted core cycles and aperf results”. SMIs, or “System Management Interrupts” are also referred to as “cycle stealing” in that they are used by platform firmware to perform various housekeeping tasks using the application processor cores, usually without either the Operating System, nor the user’s knowledge. SMIs are used by OEMs and ODMs to “add value”, but they are also used for such things as system fan control and other essentials. What should happen, of course, is that a generic management controller should be defined to handle this, but it was easier for the industry to build the mess that is SMIs, and for Intel to then add tracking for users to see where bad latencies come from.

Speculative Page Faults. Luarent Dufour posted version 5 of a patch series entitled “Speculative page faults”, which is “a port on kernel 4.12 of the work done by Peter Zijlstra to handle page fault without holding the mm semaphore”. As he notes, “The idea is to try to handle user space page faults without holding the mmap_sem [a per-task – the kernel side name for a running process – semaphore that is shared by all threads within a process]. This should allow better concurrency for massively threaded processes since the page fault handler will not wait for other threads[‘] memory layout change to be done, assuming that this change is done in another part of the process’s memory space. This type of page fault is named speculative page fault. If the speculative page fault fails because of a concurrency is detected of because underlying PMD [Page Middle Directory] or PTE [Page Table Entry] tables are not yet allocat[ed], it [fails] its processing and a classic page fault is then tried”.

THP. Kirill A. Shutemov posted a “HELP-NEEDED” thread entitled “Do not lose dirty bit on THP pages”, in which he notes that Vlastimil Babka “noted that pmdp_invalidate [Page Middle Directory Pointer invalidate] is not atomic and we can loose dirty and access bits if CPU sets them after pmdp dereference, but before set_pmd_at()”. Kirill notes that this doesn’t currently happen to lead to user-visible problems in the current kernel, but “fixing this would be critical for future work on THP: both huge-ext4 and THP [Transparent Huge Pages] swap out rely on proper dirty tracking”. By access and dirty tracking, Kirill means page table bits that indicate whether a page has been accessed or contains dirty data which should be written back to storage. Such bits are updated by hardware automatically on memory access. He adds that “Unfortunately, there’s no way to address the issue in a generic way. We need to fix all architectures that support THP one-by-one”. Hence the topic of the thread containing the words “HELP-NEEDED”. Martin Schwidefsky had some feedback to the proposed solution that it would not work on s390, but that if pmdp_invalidate returned the old entry, that could be used in order to update certain logic based on the dirty bits. Andrea Arcangeli replied to Martin, “That to me seems the simplest fix”. Separately, Kirill posted the “Last bits for initial 5-level paging” on x86.

Timers. Christoph Hellwig posted “RFC: better timer interface”, a patch series which “attempts to provide a “modern” timer interface where the callback gets the timer_list structure as an argument so that it can use container_of instead of having to cast to/from unsigned long all the time”. Arnd Bergmann noted that “This looks really nice, but what is the long-term plan for the interface? Do you expect that we will eventually change all 700+ users of timer_list to the new type, or do we keep both variants around indefinitely to avoid having to do mass-conversions?”. Christoph thought it was possible to perform a wholesale conversion, but that “it might take some time”.

Thunderbolt. Mika Westerberg posted version 3 of a patch series implementing “Thunderbolt security levels and NVM firmware upgrade”. Apparently, “PCs running Intel Falcon Ridge or newer need these in order to connect devices if the security level is set to “user(SL1) or secure(SL2)” from BIOS” and “The security levels were added to prevent DMA attacks when PCIe is tunneled over Thunderbolt fabric where IOMMU is not available or cannot be enabled for different reasons”. While cool, it is slightly saddening that some of the awesome demos from recent DEFCONs will be slightly harder to reproduce by nation state actors and those who really need to get outside more often.

VAS. Sukadev Bhattiprolu posted version 5 of a patch series entitled “Enable VAS”, a “hardware subsystem referred to as the Virtual Accelerator Switchboard” in the IBM POWER9 architecture. According to Sukadev, “VAS allows kernel subsystems and user space processes to directly access the Nest Accelerator (NX) engines which implement compression and encryption algorithms in the hardware”. In other words, these are simple workload acceleration engines that were previously only available using special (“icswx”) privileged instructions in earlier versions of POWER machines and are now to be available to userspace applications through a multiplexing API.

WMI. Darren Hart posted an updated “Convert WMI to a proper bus” patch series, which “converts WMI [Windows Management Instrumentation] into a proper bus, adds some useful information via sysfs, and exposes the embedded MOF binary. It converts dell-wmi to use the WMI bus architecture”. WMI is required to manage various contempory (especially laptop) hardware, including backlights.

Xen. Juergen Gross posted “xen: add sysfs node for guest type” which provides information known to the guest kernel but not previously exposed to userspace, including the type of virtualization in use (HVM, PV, or PVH), and so on.

zRam. Minchan Kim posted an RFC patch entitled “writeback incompressible pages to storage”, which seeks to have the best of both worlds – the compression of Ram while handling cases where memory is incompressible. In the case that an admin sets up a suitable block device, it can be arranged that incompressible pages are written out to storage instead of using RAM.

zswap. Srividya Desireddy posted version 2 of a patch that seeks to explicitly test for so-called “zero-filled” pages before submitting them for compression. This saves time and energy, and reduces application startup time (on the order of about 3% in the example given).

 

July 08, 2017 09:31 PM

July 06, 2017

Rusty Russell: Broadband Speeds, 2 Years Later

Two years ago, considering the blocksize debate, I made two attempts to measure average bandwidth growth, first using Akamai serving numbers (which gave an answer of 17% per year), and then using fixed-line broadband data from OFCOM UK, which gave an answer of 30% per annum.

We have two years more of data since then, so let’s take another look.

OFCOM (UK) Fixed Broadband Data

First, the OFCOM data:

So in the last two years, we’ve seen 26% increase in download speed, and 22% increase in upload, bringing us down from 36/37% to 33% over the 8 years. The divergence of download and upload improvements is concerning (I previously assumed they were the same, but we have to design for the lesser of the two for a peer-to-peer system).

The idea that upload speed may be topping out is reflected in the Nov-2016 report, which notes only an 8% upload increase in services advertised as “30Mbit” or above.

Akamai’s State Of The Internet Reports

Now let’s look at Akamai’s Q1 2016 report and Q1-2017 report.

This gives an estimate of 19% per annum in the last two years. Reassuringly, the US and UK (both fairly high-bandwidth countries, considered in my previous post to be a good estimate for the future of other countries) have increased by 26% and 19% in the last two years, indicating there’s no immediate ceiling to bandwidth.

You can play with the numbers for different geographies on the Akamai site.

Conclusion: 19% Is A Conservative Estimate

17% growth now seems a little pessimistic: in the last 9 years the US Akamai numbers suggest the US has increased by 19% per annum, the UK by almost 21%.  The gloss seems to be coming off the UK fixed-broadband numbers, but they’re still 22% upload increase for the last two years.  Even Australia and the Philippines have managed almost 21%.

July 06, 2017 10:01 AM

June 29, 2017

Linux Plumbers Conference: Containers Microconference accepted into Linux Plumbers Conference

Following on from the Containers Microconference last year, we’re pleased to announce there will be a follow on at Plumbers in Los Angeles this year.

The agenda for this year will focus on unsolved issues and other problem areas in the Linux Kernel Container interfaces with the goal of allowing all container runtimes and orchestration systems to provide enhanced services.  Of particular interest is the unprivileged use of container APIs in which we can use both to enable self containerising applications as well as to deprivilege (make more secure) container orchestration systems.  In addition we will be discussing the potential addition of new namespaces: (LSM for per-container security modules; IMA for per-container integrity and appraisal, file capabilities to allow setcap binaries to run within unprivileged containers)

For more details on this, please see this microconference’s wiki page.

We hope to see you there!

June 29, 2017 05:59 PM

June 20, 2017

Arnaldo Carvalho de Melo: Pahole in the news

Found another interesting article, this time mentioning a tool I wrote long ago and that, at least for kernel object files, has been working for a long time without much care on my part: pahole, go read a bit about it at Will Cohen’s “How to avoid wasting megabytes of memory a few bytes at a time” article.

Guess I should try running a companion script that tries to process all .o files in debuginfo packages to see how bad it is for non-kernel files, with all the DWARF changes over these years…


June 20, 2017 03:49 PM

June 15, 2017

Linux Plumbers Conference: Early Bird Rate Registration Ending Soon

A reminder that our Early Bird registration rate is ending soon. The last day at the Early Bird rate of 400$ is Sunday June 18th. We are also almost sold out of Early Bird slots (15% left of our quota). Get yours soon!
Starting June 19th registration will be at the regular rate of 550$.
Please see the Attend page for info.

June 15, 2017 11:20 PM

June 14, 2017

Paul E. Mc Kenney: Stupid RCU Tricks: Simplifying Linux-kernel RCU

The last month or two has seen a lot of work simplifying the Linux-kernel RCU implementation, with more than 2700 net lines of code removed. The remainder of this post lists the user-visible changes, along with alternative ways to get the corresponding job done.


  1. The infamous CONFIG_RCU_KTHREAD_PRIO Kconfig parameter is now defunct, but the rcutree.kthread_prio kernel boot parameter gets the job done.
  2. The CONFIG_NO_HZ_FULL_SYSIDLE Kconfig parameter has kicked the bucket. There is no replacement because no one was using it. If you need it, revert the -rcu commit tagged by sysidle.2017.05.11a.
  3. The CONFIG_PROVE_RCU_REPEATEDLY Kconfig parameter is no more. There is no replacement because as far as I know, no one has used it for many years. It was a great help in tracking down lockdep-RCU warnings back in the day, but these warnings are now sufficiently rare that finding them one boot at a time is no longer a problem. If you need it, do the obvious hacking on Kconfig and lockdep.c.
  4. The CONFIG_SPARSE_RCU_POINTER Kconfig parameter now rests in peace. There is no replacement because there doesn't seem to be any reason for RCU's sparse checking to be the only such checking that is optional. If you really need to disable RCU's sparse checking, hand-edit the definition as needed.
  5. The CONFIG_CLASSIC_SRCU Kconfig parameter bought the farm. This was only present to handle massive failures of the new Tree/Tiny SRCU implementations, but these appear to be quite reliable and should be used instead of Classic SRCU.
  6. RCU's debugfs tracing is done for. As far as I know, I was the only real user, and I haven't used it in years. If you need it, revert the -rcu commit tagged by debugfs.2017.05.15a.
  7. The CONFIG_RCU_NOCB_CPU_NONE, CONFIG_RCU_NOCB_CPU_ZERO, and CONFIG_RCU_NOCB_CPU_ALL Kconfig parameters have departed. Use the rcu_nocbs kernel boot parameter instead, which can do quite a bit more than those Kconfig parameters ever could.
  8. Tiny RCU's event tracing and RCU CPU stall warnings are now pushing up daisies. The point of Tiny RCU is to be tiny and educational, and these added features were not helping reach either of these two goals. The replacement is to reproduce the problem with Tree RCU.
  9. These changes should matter only to people running rcutorture:

    1. The CONFIG_RCU_TORTURE_TEST_SLOW_PREINIT and CONFIG_RCU_TORTURE_TEST_SLOW_PREINIT_DELAY Kconfig parameters have been entombed: Use the rcutree.gp_preinit_delay kernel boot parameter instead.
    2. The CONFIG_RCU_TORTURE_TEST_SLOW_INIT and CONFIG_RCU_TORTURE_TEST_SLOW_INIT_DELAY Kconfig parameters have given up the ghost: Use the rcutree.gp_init_delay kernel boot parameter instead.
    3. The CONFIG_RCU_TORTURE_TEST_SLOW_CLEANUP and CONFIG_RCU_TORTURE_TEST_SLOW_CLEANUP_DELAY Kconfig parameters have passed on: Use the rcutree.gp_cleanup_delay kernel boot parameter instead.
There will probably be a few more simplifications in the near future, but this should be at least enough for one merge window!

June 14, 2017 09:03 PM

June 12, 2017

Linux Plumbers Conference: RDMA Microconference Accepted into the Linux Plumbers Conference

Following on from the successful RDMA Microconference last year, which resulted in a lot of fruitful discussions we’re pleased to announce there will be a follow on at Plumbers in Los Angeles this year.

In addition to looking at the usual kernel core gaps and ABI issues, Documentation and testing, we’ll also be looking at new fabrics (including NVME), challenges to implement virtual RDMA device and integration possibilities with netdev.

For more details on this, please see this microconference’s wiki page.

June 12, 2017 09:53 PM