Kernel Planet

October 16, 2017

Greg Kroah-Hartman: Linux Kernel Community Enforcement Statement FAQ

Based on the recent Linux Kernel Community Enforcement Statement and the article describing the background and what it means , here are some Questions/Answers to help clear things up. These are based on questions that came up when the statement was discussed among the initial round of over 200 different kernel developers.

Q: Is this changing the license of the kernel?

A: No.

Q: Seriously? It really looks like a change to the license.

A: No, the license of the kernel is still GPLv2, as before. The kernel developers are providing certain additional promises that they encourage users and adopters to rely on. And by having a specific acking process it is clear that those who ack are making commitments personally (and perhaps, if authorized, on behalf of the companies that employ them). There is nothing that says those commitments are somehow binding on anyone else. This is exactly what we have done in the past when some but not all kernel developers signed off on the driver statement.

Q: Ok, but why have this “additional permissions” document?

A: In order to help address problems caused by current and potential future copyright “trolls” aka monetizers.

Q: Ok, but how will this help address the “troll” problem?

A: “Copyright trolls” use the GPL-2.0’s immediate termination and the threat of an immediate injunction to turn an alleged compliance concern into a contract claim that gives the troll an automatic claim for money damages. The article by Heather Meeker describes this quite well, please refer to that for more details. If even a short delay is inserted for coming into compliance, that delay disrupts this expedited legal process.

By simply saying, “We think you should have 30 days to come into compliance”, we undermine that “immediacy” which supports the request to the court for an immediate injunction. The threat of an immediate junction was used to get the companies to sign contracts. Then the troll goes back after the same company for another known violation shortly after and claims they’re owed the financial penalty for breaking the contract. Signing contracts to pay damages to financially enrich one individual is completely at odds with our community’s enforcement goals.

We are showing that the community is not out for financial gain when it comes to license issues – though we do care about the company coming into compliance.  All we want is the modifications to our code to be released back to the public, and for the developers who created that code to become part of our community so that we can continue to create the best software that works well for everyone.

This is all still entirely focused on bringing the users into compliance. The 30 days can be used productively to determine exactly what is wrong, and how to resolve it.

Q: Ok, but why are we referencing GPL-3.0?

A: By using the terms from the GPLv3 for this, we use a very well-vetted and understood procedure for granting the opportunity to come fix the failure and come into compliance. We benefit from many months of work to reach agreement on a termination provision that worked in legal systems all around the world and was entirely consistent with Free Software principles.

Q: But what is the point of the “non-defensive assertion of rights” disclaimer?

A: If a copyright holder is attacked, we don’t want or need to require that copyright holder to give the party suing them an opportunity to cure. The “non-defensive assertion of rights” is just a way to leave everything unchanged for a copyright holder that gets sued.  This is no different a position than what they had before this statement.

Q: So you are ok with using Linux as a defensive copyright method?

A: There is a current copyright troll problem that is undermining confidence in our community – where a “bad actor” is attacking companies in a way to achieve personal gain. We are addressing that issue. No one has asked us to make changes to address other litigation.

Q: Ok, this document sounds like it was written by a bunch of big companies, who is behind the drafting of it and how did it all happen?

A: Grant Likely, the chairman at the time of the Linux Foundation’s Technical Advisory Board (TAB), wrote the first draft of this document when the first copyright troll issue happened a few years ago. He did this as numerous companies and developers approached the TAB asking that the Linux kernel community do something about this new attack on our community. He showed the document to a lot of kernel developers and a few company representatives in order to get feedback on how it should be worded. After the troll seemed to go away, this work got put on the back-burner. When the copyright troll showed back up, along with a few other “copycat” like individuals, the work on the document was started back up by Chris Mason, the current chairman of the TAB. He worked with the TAB members, other kernel developers, lawyers who have been trying to defend these claims in Germany, and the TAB members’ Linux Foundation’s lawyers, in order to rework the document so that it would actually achieve the intended benefits and be useful in stopping these new attacks. The document was then reviewed and revised with input from Linus Torvalds and finally a document that the TAB agreed would be sufficient was finished. That document was then sent to over 200 of the most active kernel developers for the past year by Greg Kroah-Hartman to see if they, or their company, wished to support the document. That produced the initial “signatures” on the document, and the acks of the patch that added it to the Linux kernel source tree.

Q: How do I add my name to the document?

A: If you are a developer of the Linux kernel, simply send Greg a patch adding your name to the proper location in the document (sorting the names by last name), and he will be glad to accept it.

Q: How can my company show its support of this document?

A: If you are a developer working for a company that wishes to show that they also agree with this document, have the developer put the company name in ‘(’ ‘)’ after the developer’s name. This shows that both the developer, and the company behind the developer are in agreement with this statement.

Q: How can a company or individual that is not part of the Linux kernel community show its support of the document?

A: Become part of our community! Send us patches, surely there is something that you want to see changed in the kernel. If not, wonderful, post something on your company web site, or personal blog in support of this statement, we don’t mind that at all.

Q: I’ve been approached by a copyright troll for Netfilter. What should I do?

A: Please see the Netfilter FAQ here for how to handle this

Q: I have another question, how do I ask it?

A: Email Greg or the TAB, and they will be glad to help answer them.

October 16, 2017 09:05 AM

Greg Kroah-Hartman: Linux Kernel Community Enforcement Statement

By Greg Kroah-Hartman, Chris Mason, Rik van Riel, and Shuah Khan

The Linux kernel ecosystem of developers, companies and users has been wildly successful by any measure over the last couple decades. Even today, 26 years after the initial creation of the Linux kernel, the kernel developer community continues to grow, with more than 500 different companies and over 4,000 different developers getting changes merged into the tree during the past year. As Greg always says every year, the kernel continues to change faster this year than the last, this year we were running around 8.5 changes an hour, with 10,000 lines of code added, 2,000 modified, and 2,500 lines removed every hour of every day.

The stunning growth and widespread adoption of Linux, however, also requires ever evolving methods of achieving compliance with the terms of our community’s chosen license, the GPL-2.0. At this point, there is no lack of clarity on the base compliance expectations of our community. Our goals as an ecosystem are to make sure new participants are made aware of those expectations and the materials available to assist them, and to help them grow into our community.  Some of us spend a lot of time traveling to different companies all around the world doing this, and lots of other people and groups have been working tirelessly to create practical guides for everyone to learn how to use Linux in a way that is compliant with the license. Some of these activities include:

Unfortunately the same processes that we use to assure fulfillment of license obligations and availability of source code can also be used unjustly in trolling activities to extract personal monetary rewards. In particular, issues have arisen as a developer from the Netfilter community, Patrick McHardy, has sought to enforce his copyright claims in secret and for large sums of money by threatening or engaging in litigation. Some of his compliance claims are issues that should and could easily be resolved. However, he has also made claims based on ambiguities in the GPL-2.0 that no one in our community has ever considered part of compliance.  

Examples of these claims have been distributing over-the-air firmware, requiring a cell phone maker to deliver a paper copy of source code offer letter; claiming the source code server must be setup with a download speed as fast as the binary server based on the “equivalent access” language of Section 3; requiring the GPL-2.0 to be delivered in a local language; and many others.

How he goes about this activity was recently documented very well by Heather Meeker.

Numerous active contributors to the kernel community have tried to reach out to Patrick to have a discussion about his activities, to no response. Further, the Netfilter community suspended Patrick from contributing for violations of their principles of enforcement. The Netfilter community also published their own FAQ on this matter.

While the kernel community has always supported enforcement efforts to bring companies into compliance, we have never even considered enforcement for the purpose of extracting monetary gain.  It is not possible to know an exact figure due to the secrecy of Patrick’s actions, but we are aware of activity that has resulted in payments of at least a few million Euros.  We are also aware that these actions, which have continued for at least four years, have threatened the confidence in our ecosystem.

Because of this, and to help clarify what the majority of Linux kernel community members feel is the correct way to enforce our license, the Technical Advisory Board of the Linux Foundation has worked together with lawyers in our community, individual developers, and many companies that participate in the development of, and rely on Linux, to draft a Kernel Enforcement Statement to help address both this specific issue we are facing today, and to help prevent any future issues like this from happening again.

A key goal of all enforcement of the GPL-2.0 license has and continues to be bringing companies into compliance with the terms of the license. The Kernel Enforcement Statement is designed to do just that.  It adopts the same termination provisions we are all familiar with from GPL-3.0 as an Additional Permission giving companies confidence that they will have time to come into compliance if a failure is identified. Their ability to rely on this Additional Permission will hopefully re-establish user confidence and help direct enforcement activity back to the original purpose we have all sought over the years – actual compliance.  

Kernel developers in our ecosystem may put their own acknowledgement to the Statement by sending a patch to Greg adding their name to the Statement, like any other kernel patch submission, and it will be gladly merged. Those authorized to ‘ack’ on behalf of their company may add their company name in (parenthesis) after their name as well.

Note, a number of questions did come up when this was discussed with the kernel developer community. Please see Greg’s FAQ post answering the most common ones if you have further questions about this topic.

October 16, 2017 09:00 AM

Pavel Machek: Help time travelers!

Ok, so I have various machines here. It seems only about half of them has working RTC. That are the boring ones.

And even the boring ones have pretty imprecise RTCs... For example Nokia N9. I only power it up from time to time, I believe it drifts something like minute per month... For normal use with SIM card, it can probably correct from GSM network if you happen to have a cell phone signal, but...

More interesting machines... Old thinkpad is running without CMOS battery. ARM OLPC has _three_ RTCs, but not a single working one. N900 has working RTC but no or dead backup battery. On these, RTC driver probably knows time is not valid, but feeds the garbage into the system time, anyway. Ouch. Neither Sharp Zaurus SL-5500 nor C-3000 had battery backup on RTC...

Even in new end-user machines, time quality varies a lot. "First boot, please enter time" is only accurate to seconds, if the user is careful. RTC is usually not very accurate, either... and noone uses adjtime these days. GSM time and ntpdate are probably accurate to miliseconds, GPS can provide time down to picoseconds... And broken systems are so common "swclock" is available in init system to store time in file, so it at least does not go backwards.

https (and other crypto) depends on time... so it is important to know approximate month we are in.

Is it time we handle it better?

Could we return both time and log2(expected error) from system calls?

That way we could hide the clock in GUI if time is not available or not precise to minutes, ignore certificate dates when time is not precise to months, and you would not have to send me a "Pavel, are you time traveling, again?" message next time my mailer sends email dated to 1970.

October 16, 2017 07:38 AM

October 14, 2017

James Bottomley: Using Elliptic Curve Cryptography with TPM2

One of the most significant advances going from TPM1.2 to TPM2 was the addition of algorithm agility: The ability of TPM2 to work with arbitrary symmetric and asymmetric encryption schemes.  In practice, in spite of this much vaunted agile encryption capability, most actual TPM2 chips I’ve seen only support a small number of asymmetric encryption schemes, usually RSA2048 and a couple of Elliptic Curves.  However, the ability to support any Elliptic Curve at all is a step up from TPM1.2.  This blog post will detail how elliptic curve schemes can be integrated into existing cryptographic systems using TPM2.  However, before we start on the practice, we need at least a tiny swing through the theory of Elliptic Curves.

What is an Elliptic Curve?

An Elliptic Curve (EC) is simply the set of points that lie on the curve in the two dimensional plane (x,y) defined by the equation

x3 = y3 + ax + b

which means that every elliptic curve can be parametrised by two constants a and b.  The set of all points lying on the curve plus a point at infinity is combined with an addition operation to produce an abelian (commutative) group.  The addition property is defined by drawing straight lines between two points and seeing where they intersect the curve (or picking the infinity point if they don’t intersect).  Wikipedia has a nice diagrammatic description of this here.  The infinity point acts as the identity of the addition rule and the whole group is denoted E.

The utility for cryptography is that you can define an integer multiplier operation which is simply the element added to itself n times, so for P ∈ E, you can always find Q ∈ E such that

Q = P + P + P … = n × P

And, since it’s a simple multiplication like operation, it’s very easy to compute Q.  However, given P and Q it is computationally very difficult to get back to n.  In fact, it can be demonstrated mathematically that trying to compute n is equivalent to the discrete logarithm problem which is the mathematical basis for the cryptographic security of RSA.  This also means that EC keys suffer the same (actually more so) problems as RSA keys: they’re not Quantum Computing secure (vulnerable to the Quantum Shor’s algorithm) and they would be instantly compromised if the discrete logarithm problem were ever solved.

Therefore, for any elliptic curve, E, you can choose a known point G ∈ E, select a large integer d and you can compute a point P = d × G.  You can then publish (P, G, E) as your public key knowing it’s computationally infeasible for anyone to derive your private key d.

For instance, Diffie-Hellman key exchange can be done by agreeing (E, G) and getting Alice and Bob to select private keys dA, dB.  Then knowing Bob’s public key PB, Alice can select a random integer r, which she publishes, and compute a key agreement as a secret point on the Elliptic Curve (r dA) × PB.  Bob can derive the same Elliptic Curve point because

(r dA) × PB = (r dA)dB × G = (r dB) dA × G = (r dB) × PA

The agreement is a point on the curve, but you can use an agreed hashing or other mechanism to get from the point to a symmetric key.

Seems simple, but the problem for computing is that we really want to use integers and right at the moment the elliptic curve is defined over all the real numbers, meaning E is of infinite size and involves floating point computations (of rather large precision).

Elliptic Curves over Finite Fields

Fortunately there is a mathematical theory of finite fields, called Galois Theory, which allows us to take the Galois Field over prime number p, which is denoted GF(p), and compute Elliptic Curve points over this field.  This derivation, which is mathematically rather complicated, is denoted E(GF(p)), where every point (x,y) is represented by a pair of integers between 0 and p-1.  There is another theory that says the number of elements in E(GF(p))

n = |E(GF(p))|

is roughly the same size as p, meaning if you choose a 32 bit prime p, you likely have a field over roughly 2^32 elements.  For every point P in E(GF(p)) it is also mathematically proveable that n × P = 0. where 0 is the zero point (which was the infinity point in the real elliptic curve).

This means that you can take any point, G,  in E(GF(p)) and compute a subgroup based on it:

EG = { ∀m ∈ Zn : m × G }

If you’re lucky |EG| = |E(GF(p))| and G is the generator of the entire group.  However, G may only generate a subgroup and you will find |EG| = h|E(GF(p))| where integer h is called the cofactor.  In general you want the cofactor to be small (preferably less than four) for EG to be cryptographically useful.

For a computer’s purposes, EG is the elliptic curve group used for integer arithmetic in the cryptographic algorithms.  The Curve and Generator is then defined by (p, a, b, Gx, Gy, n, h) which are the published parameters of the key (Gx, Gy represent the x and y elements of point G).  You select a random number d as your private key and your public key P = d × G exactly as above, except now P is easy to compute with integer operations.

Problems with Elliptic Curves

Although I stated above that solving P = d × G is equivalent in difficulty to the discrete logarithm problem, that’s not generally true.  If the discrete logarithm problem were solved, then we’d easily be able to compute d for every generator and curve, but it is possible to pick curves for which d can be easily computed without solving the discrete logarithm problem. This is the reason why you should never pick your own curve parameters (even if you think you know what you’re doing) because it’s very easy to choose a compromised curve.  As a demonstration of the difficulty of the problem: each of the major nation state actors, Russia, China and the US, publishes their own curve parameters for use in their own cryptographic EC implementations and each of them thinks the parameters published by the others is compromised in a way that allows the respective national security agencies to derive private keys.  So if nation state actors can’t tell if a curve is compromised or not, you surely won’t be able to either.

Therefore, to be secure in EC cryptography, you pick and existing curve which has been vetted and select some random Generator Point on it.  Of course, if you’re paranoid, that means you won’t be using any of the nation state supplied curves …

Using the TPM2 with Elliptic Curves in Cryptosystems

The initial target for this work was the openssl cryptosystem whose libraries are widely used for deriving other uses (like https in apache or openssh). Originally, when I did the initial TPM2 enabling of openssl as described in this blog post, I added TPM2 as a patch to the existing TPM 1.2 openssl_tpm_engine.  Unfortunately, openssl_tpm_engine seems to be pretty much defunct at this point, so I started my own openssl_tpm2_engine as a separate git tree to begin experimenting with Elliptic Curve keys (if you don’t use git, you can download the tar file here). One of the benefits to running my own source tree is that I can now add a testing infrastructure that makes use of the IBM TPM emulator to make sure that the basic cryptographic operations all work which means that make check functions even when a TPM2 isn’t available.  The current key creation and import algorithms use secured connections to the TPM (to avoid eavesdropping) which means it’s only really possible to construct them using the IBM TSS. To make all of this easier, I’ve set up an openSUSE Build Service repository which is building for all major architectures and the openSUSE and Fedora distributions (ignore the failures, they’re currently induced because the TPM emulator only currently works on 64 bit little endian systems, so make check is failing, but the TPM people at IBM are working on this, so eventually the builds should be complete).

TPM2 itself also has some annoying restrictions.  The biggest of which is that it doesn’t allow you to pass in arbitrary elliptic curve parameters; you may only use elliptic curves which the TPM itself knows.  This will be annoying if you have an existing EC key you’re trying to import because the TPM may reject it as an unknown algorithm.  For instance, openssl can actually compute with arbitrary EC parameters, but has 39 current elliptic curves parametrised by name. By contrast, my Nuvoton TPM2 inside my Dell XPS 13 knows precisely two curves:

jejb@jarvis:~> create_tpm2_key --list-curves
prime256v1
bnp256

However, assuming you’ve picked a compatible curve for your EC private key (and you’ve defined a parent key for the storage hierarchy) you can simply import it to a TPM bound key:

create_tpm2_key -p 81000001 -w key.priv key.tpm

The tool will report an error if it can’t convert the curve parameters to a named elliptic curve known to the TPM

jejb@jarvis:~> openssl genpkey -algorithm EC -pkeyopt ec_paramgen_curve:brainpoolP256r1 > key.priv
jejb@jarvis:~> create_tpm2_key -p 81000001 -w key.priv key.tpm
TPM does not support the curve in this EC key
openssl_to_tpm_public failed with 166
TPM_RC_CURVE - curve not supported Handle number unspecified

You can also create TPM resident private keys simply by specifying the algorithm

create_tpm2_key -p 81000001 --ecc bnp256 key.tpm

Once you have your TPM based EC keys, you can use them to create public keys and certificates.  For instance, you create a self-signed X509 certificate based on the tpm key by

openssl req -new -x509 -sha256  -key key.tpm -engine tpm2 -keyform engine -out my.crt

Why you should use EC keys with the TPM

The initial attraction is the same as for RSA keys: making it impossible to extract your private key from the system.  However, the mathematical calculations for EC keys are much simpler than for RSA keys and don’t involve finding strong primes, so it’s much simpler for the TPM (being a fairly weak calculation machine) to derive private and public EC keys.  For instance the times taken to derive a RSA key from the primary seed and an EC key differ dramatically

jejb@jarvis:~> time tsscreateprimary -hi o -ecc bnp256 -st
Handle 80ffffff

real 0m0.111s
user 0m0.000s
sys 0m0.014s

jejb@jarvis:~> time tsscreateprimary -hi o -rsa -st
Handle 80ffffff

real 0m20.473s
user 0m0.015s
sys 0m0.084s

so for a slow system like the TPM, using EC keys is a significant speed advantage.  Additionally, there are other advantages.  The standard EC Key signature algorithm is a modification of the NIST Digital Signature Algorithm called ECDSA.  However DSA and ECDSA require a cryptographically strong (and secret) random number as Sony found out to their cost in the EC Key compromise of Playstation 3.  The TPM is a good source of cryptographically strong random numbers and if it generates the signature internally, you can be absolutely sure of keeping the input random number secret.

Why you might want to avoid EC keys altogether

In spite of the many advantages described above, EC keys suffer one additional disadvantage over RSA keys in that Elliptic Curves in general are very hot fields of mathematical research so even if the curve you use today is genuinely not compromised, it’s not impossible that a mathematical advance tomorrow will make the curve you chose (and thus all the private keys you generated) vulnerable.  Of course, the same goes for RSA if anyone ever cracks the discrete logarithm problem, but solving that problem would likely be fully published to world acclaim and recognition as a significant contribution to the advancement of number theory.  Discovering an attack on a currently used elliptic curve on the other hand might be better remunerated by offering to sell it privately to one of the national security agencies …

October 14, 2017 10:52 PM

October 11, 2017

Paul E. Mc Kenney: Stupid RCU Tricks: In the audience for a pair of RCU talks!

I had the privilege of attending CPPCON last month. Michael Wong, Maged Michael, and I presented a parallel-programming overview, in which I presented the "Hardware and its Habits" chapter of Is Parallel Programming Hard, And, If So, What Can You Do About It?.

But the highlight for me was actually sitting in the audience for a pair of talks by people who had implemented RCU in C++.

Ansel Sermersheim presented a two-part talk entitled Multithreading is the answer. What is the question?. The second part of this talk covered lockless containers, and used a variant of RCU to implement a low-overhead libGuarded facility in order to more easily avoid deadlocks. The implementation is similar to the Linux-kernel real-time RCU implementation by Jim Houston and Joe Korty in that the counterpart to rcu_read_unlock() actively registers a quiescent state. Ansel's implementation goes further by also driving callback invocation from rcu_read_unlock(). Now I don't recommend this for a general-purpose RCU implementation due to the possibility of deadlock should a resource need to be held across rcu_read_unlock() and acquired within the callback. However, this approach should work just fine in the case where the callbacks just free memory and the memory allocator does not contain too many RCU read-side critical sections.

Fedor Pikus presented a talk entitled Read, Copy, Update, then what? RCU for non-kernel programmers, in which he gave a quite-decent introduction to use of RCU. This introduction included an improved version of my long-standing where-to-use-RCU diagram, which I fully intend to incorporate. I had a number of but-you-could moments, including the usual "put the size in with the array" advice, ways of updating things already exposed to readers, and the fact that RCU really can tolerate multiple writers, along with some concerns about counter overflow. Nevertheless, an impressive amount of great information in a one-hour talk!

It is very good to see more people making use of RCU!

October 11, 2017 09:47 PM

October 04, 2017

Dave Airlie (blogspot): radv: a conformant Vulkan driver (with caveats)

If you take a look at the conformant vulkan list, you might see entry 220.

Software in the Public Interest, Inc. 2017-10-04 Vulkan_1_0 220
AMD Radeon R9 285 Intel i5-4460 x86_64 Linux 4.13 X.org DRI3.

 This is radv, and this is the first conformance submission done under the X.org (SPI) membership of the Khronos adopter program.

This submission was a bit of a trial run for radv developers, but Mesa 17.2 + llvm 5.0 on Bas's R9 285 card.

We can extend this submission to cover all VI GPUs.

In practice we pass all the same tests on CIK and Polaris GPUs, but we will have to do complete submission runs on those when we get a chance.

But major milestone/rubberstamp reached, radv is now a conformant Vulkan driver. Thanks go to Bas and all the other contributors and people who's code we've leveraged!

October 04, 2017 07:49 PM

October 02, 2017

James Morris: Linux Security Summit 2017 Roundup

The 2017 Linux Security Summit (LSS) was held last month in Los Angeles over the 14th and 15th of September.  It was co-located with Open Source Summit North America (OSSNA) and the Linux Plumbers Conference (LPC).

LSS 2017 sign at conference

LSS 2017

Once again we were fortunate to have general logistics managed by the Linux Foundation, allowing the program committee to focus on organizing technical content.  We had a record number of submissions this year and accepted approximately one third of them.  Attendance was very strong, with ~160 attendees — another record for the event.

LSS 2017 Attendees

LSS 2017 Attendees

On the day prior to LSS, attendees were able to access a day of LPC, which featured two tracks with a security focus:

Many thanks to the LPC organizers for arranging the schedule this way and allowing LSS folk to attend the day!

Realtime notes were made of these microconfs via etherpad:

I was particularly interested in the topic of better integrating LSM with containers, as there is an increasingly common requirement for nesting of security policies, where each container may run its own apparently independent security policy, and also a potentially independent security model.  I proposed the approach of introducing a security namespace, where all security interfaces within the kernel are namespaced, including LSM.  It would potentially solve the container use-cases, and also the full LSM stacking case championed by Casey Schaufler (which would allow entirely arbitrary stacking of security modules).

This would be a very challenging project, to say the least, and one which is further complicated by containers not being a first class citizen of the kernel.   This leads to security policy boundaries clashing with semantic functional boundaries e.g. what does it mean from a security policy POV when you have namespaced filesystems but not networking?

Discussion turned to the idea that it is up to the vendor/user to configure containers in a way which makes sense for them, and similarly, they would also need to ensure that they configure security policy in a manner appropriate to that configuration.  I would say this means that semantic responsibility is pushed to the user with the kernel largely remaining a set of composable mechanisms, in relation to containers and security policy.  This provides a great deal of flexibility, but requires those building systems to take a great deal of care in their design.

There are still many issues to resolve, both upstream and at the distro/user level, and I expect this to be an active area of Linux security development for some time.  There were some excellent followup discussions in this area, including an approach which constrains the problem space. (Stay tuned)!

A highlight of the TPMs session was an update on the TPM 2.0 software stack, by Philip Tricca and Jarkko Sakkinen.  The slides may be downloaded here.  We should see a vastly improved experience over TPM 1.x with v2.0 hardware capabilities, and the new software stack.  I suppose the next challenge will be TPMs in the post-quantum era?

There were further technical discussions on TPMs and container security during subsequent days at LSS.  Bringing the two conference groups together here made for a very productive event overall.

TPMs microconf at LPC with Philip Tricca presenting on the 2.0 software stack.

This year, due to the overlap with LPC, we unfortunately did not have any LWN coverage.  There are, however, excellent writeups available from attendees:

There were many awesome talks.

The CII Best Practices Badge presentation by David Wheeler was an unexpected highlight for me.  CII refers to the Linux Foundation’s Core Infrastructure Initiative , a preemptive security effort for Open Source.  The Best Practices Badge Program is a secure development maturity model designed to allow open source projects to improve their security in an evolving and measurable manner.  There’s been very impressive engagement with the project from across open source, and I believe this is a critically important effort for security.

CII Bade Project adoption (from David Wheeler’s slides).

During Dan Cashman’s talk on SELinux policy modularization in Android O,  an interesting data point came up:

Interesting data from the talk: 44% of Android kernel vulns blocked by SELinux due to attack surface reduction. https://t.co/FnU544B3XP

— James Morris (@xjamesmorris) September 15, 2017

We of course expect to see application vulnerability mitigations arising from Mandatory Access Control (MAC) policies (SELinux, Smack, and AppArmor), but if you look closely this refers to kernel vulnerabilities.   So what is happening here?  It turns out that a side effect of MAC policies, particularly those implemented in tightly-defined environments such as Android, is a reduction in kernel attack surface.  It is generally more difficult to reach such kernel vulnerabilities when you have MAC security policies.  This is a side-effect of MAC, not a primary design goal, but nevertheless appears to be very effective in practice!

Another highlight for me was the update on the Kernel Self Protection Project lead by Kees, which is now approaching its 2nd anniversary, and continues the important work of hardening the mainline Linux kernel itself against attack.  I would like to also acknowledge the essential and original research performed in this area by grsecurity/PaX, from which this mainline work draws.

From a new development point of view, I’m thrilled to see the progress being made by Mickaël Salaün, on Landlock LSM, which provides unprivileged sandboxing via seccomp and LSM.  This is a novel approach which will allow applications to define and propagate their own sandbox policies.  Similar concepts are available in other OSs such as OSX (seatbelt) and BSD (pledge).  The great thing about Landlock is its consolidation of two existing Linux kernel security interfaces: LSM and Seccomp.  This ensures re-use of existing mechanisms, and aids usability by utilizing already familiar concepts for Linux users.

Mickaël Salaün from ANSSI talking about his Landlock LSM work at #linuxsecuritysummit 2017 pic.twitter.com/wYpbHuLgm2

— LinuxSecuritySummit (@LinuxSecSummit) September 14, 2017

Overall I found it to be an incredibly productive event, with many new and interesting ideas arising and lots of great collaboration in the hallway, lunch, and dinner tracks.

Slides from LSS may be found linked to the schedule abstracts.

We did not have a video sponsor for the event this year, and we’ll work on that again for next year’s summit.  We have discussed holding LSS again next year in conjunction with OSSNA, which is expected to be in Vancouver in August.

We are also investigating a European LSS in addition to the main summit for 2018 and beyond, as a way to help engage more widely with Linux security folk.  Stay tuned for official announcements on these!

Thanks once again to the awesome event staff at LF, especially Jillian Hall, who ensured everything ran smoothly.  Thanks also to the program committee who review, discuss, and vote on every proposal, ensuring that we have the best content for the event, and who work on technical planning for many months prior to the event.  And of course thanks to the presenters and attendees, without whom there would literally and figuratively be no event :)

See you in 2018!

 

October 02, 2017 01:52 AM

September 25, 2017

Pavel Machek: Colorful LEDs

RGB LEDs do not exist according to Linux LED subsystem. They are modeled as three separate LEDs, red, green and blue; that matches the hardware.

Unfortunately, it has problems. Lets begin with inconsistent naming: some drivers use :r suffix, some use :red. There's no explicit grouping of LEDs for one light -- there's no place to store parameters common for the light. (LEDs could be grouped by name.)

RGB colorspace is pretty well defined, and people expect to set specific colors. Unfortunately.... that does not work well with LEDs. First, LEDs are usually not balanced according to human perception system, so full power to the LEDs (255, 255, 255) may not
result in white. Second, monitors normally use gamma correction before displaying color, so (128, 128, 128) does not correspond to 50% of light being produced. But LEDs normally use PWM, so (128, 128, 128) does correspond to 50% light. Result is that colors are completely off.

I tested HSV colorspace for the LEDs. That would have advantage of old triggers being able to use selected colors... Unfortunately, on N900, white color is something like 15% blue, which would result in significantly reducing number of white intensities we can display.

September 25, 2017 08:30 AM

September 19, 2017

Pavel Machek: Unicsy phone

For a long time, I wanted a phone that runs Unix. And I got that, first Android, second Maemo on Nokia N900. With Android I realized that running Linux kernel is not enough. Android is really far away from normal Unix machine, and I'd argue away from anything usable, too. Maemo was slightly closer, and probably could be fixed if it was open-source.

But I realized Linux kernel is not really the most important part. There's more to Unix: compatibility with old apps, small programs where each one does one thing well, data in text formats so you can put them in git. Maemo got some parts right, at least you could run old apps in a useful way; but most important data on the phone (contacts, calendar) were still locked away in sqlite.

And that is something I'd like to change: phone that is ssh-friendly, text-editor-friendly and git-friendly. I call it "Unicsy phone". No, I don't want to do phone `cat addressbook | grep Friend | cut -f 1`... graphical utilities are okay. But console tools still should be there, and file formats should be reasonable.

So there is tui project, and recently postmarketos project appeared. Nokia N900 is mostly supported by mainline kernel (with exceptions of bluetooth and camera, everything works). There's work to be done, but it looks doable.

More is missing in the userspace. Phone parts need work, as expected. What is more surprising... there's emacs org mode, with great calendar capabilities, but I could not find matching application to display data nicely and provide alerts. Situation is even worse for contacts; emacs org can help there, too, but there does not seem to be agreement that this is the way to go. (And again, graphical applications would be nice).

September 19, 2017 10:17 PM

September 16, 2017

Pavel Machek: FlightGear fun

How to die in Boeing 707, quick and easy. Take off, realize that you should set up fuel heating, select Help|, aim for checklists.. and hit auto startup/shutdown. Instantly lose all the engines. Fortunately, you are at 6000', so you start looking for the airport. Then you
realize "hmm, perhaps I can do the startup thing now", and hit the menu item once again. But instead of running engines, you get fire warnings on all the engines. That does not look good. Confirm fire, extinguish all four engines, and resume looking for airport in range. Trim for best glide. Then number 3 comes up. Then number 4. Number one and you know it will be easy. Number two as you fly over the runway... go around and do normal approach.

September 16, 2017 11:41 AM

September 15, 2017

Michael Kerrisk (manpages): man-pages-4.13 is released

I've released man-pages-4.13. The release tarball is available on kernel.org. The browsable online pages can be found on man7.org. The Git repository for man-pages is available on kernel.org.

This release resulted from patches, bug reports, reviews, and comments from around 40 contributors. The release is rather larger than average. (The context diff runs to more than 90k lines.) The release includes more than 350 commits and contains some fairly wide-ranging formatting fix-ups that meant that all 1028 existing manual pages saw some change(s). In addition, 5 new manual pages were added.

Among the more significant changes in man-pages-4.13 are the following:

A special thanks to Eugene Syromyatnikov, who contributed 30 patches to this release!

September 15, 2017 01:37 PM

Linux Plumbers Conference: Linux Plumbers Conference Unconference Schedule Announced

Since we only have six proposals, we can schedule them in the unconference session without any need for actual voting at breakfast.  On a purely random basis, the schedule will be:

Unconference I:

09:30 Test driven development (TDD) in the kernel – Knut Omang
11:00 Support for adding DT based thermal zones at runtime – Moritz Fischer
11:50 Restartable Sequences interaction with debugger single-stepping – Mathieu Desnoyers

Unconference II:

14:00 Automated testing of LKML patches with Clang – Nick Desaulniers
14:50 ktask: multithread cpu-intensive kernel work – Daniel Jordan
16:00 Soft Affinity for Workloads – Rohit Jain

I’ll add these to the plumbers schedule (if the author doesn’t already have an account, I’ll show up as the speaker, but please take the above list as definitive for actual speaker).

Looking forward to seeing you all at this exciting new event for Plumbers,

September 15, 2017 04:57 AM

September 14, 2017

Grant Likely: Arcade Panel Construction Time-Lapse Video

September 14, 2017 05:56 PM

Grant Likely: NeoPixel Arcade Buttons

September 14, 2017 05:54 PM

September 13, 2017

Grant Likely: Custom Arcade Control Panels

I’ve started building custom arcade controls for using with classic arcade game emulators. All Open Source and Open Hardware of course, with the source code up on GitHub.

OpenSCAD:Arcade is an arcade panel modeling too written in OpenSCAD. It designs the arcade panel layout and produces lasercutter output and frame dimensions.

STM32F3-Discovery-Arcade is a prototype USB HID device for arcade controls. It currently supports GPIO joysticks and buttons, quadrature trackballs/spinners, and will drive up to 4 channels of NeoPixel RGB LED strings. The project has both custom STM32 firmware and a custom adaptor PCB designed with KiCad.

Please go take a look.

September 13, 2017 11:39 PM

Gustavo F. Padovan: Slides of my talk at Open Source Summit NA

I just delivered a talk today at Open Source Summit NA, here in LA, about everything we’ve been doing to support explicit synchronization on the Media and Graphics pipeline in the kernel. You can find the slides here.

The DRM side is already mainline, but V4L2 is currently my focus of work along with the linux-media community in the kernel. Blog posts about that should appear soon on this blog.

September 13, 2017 07:56 PM

September 11, 2017

Linux Plumbers Conference: New to Plumbers: Unconference on Friday

The hallway track is always a popular feature of Linux Plumbers Conference.  New ideas and solutions emerge all the time.  But sometimes you start a discussion, and want to pull others in before the conference ends, and just can’t quite make it work.

This year, we’re trying an experiment at Linux Plumbers and reserving a room for an unconference session on Friday,  so the ad hoc problem solving sessions for those topics with the most participant interest can be held.

If there is a topic you want to have a 1 hour discussion around,  please put it on the etherpad with:

 

Topic:  <something short>
Host(s): <person who will host the discussion>
Description:   <describe problem you want to talk about>

 

We’ll close down the topic page on Thursday night at 8pm,  and print the collected topics out on In the morning and post them in the room.     During the breakfast period (from 8 to 9am), those wanting to participate will be given four dots to vote.  Vote by placing a dot on the topics of interest until 8:45am.   Sessions will be scheduled as the one with the most dots first, and in descending order until we run out of sessions or time.

Schedule will be posted in the room on Friday morning.

September 11, 2017 05:54 PM

September 07, 2017

James Morris: Linux Plumbers Conference Sessions for Linux Security Summit Attendees

Folks attending the 2017 Linux Security Summit (LSS) next week may be also interested in attending the TPMs and Containers sessions at Linux Plumbers Conference (LPC) on the Wednesday.

The LPC TPMs microconf will be held in the morning and lead by Matthew Garret, while the containers microconf will be run by Stéphane Graber in the afternoon.  Several security topics will be discussed in the containers session, including namespacing and stacking of LSM, and namespacing of IMA.

Attendance on the Wednesday for LPC is at no extra cost for registered attendees of LSS.  Many thanks to the LPC organizers for arranging this!

There will be followup BOF sessions on LSM stacking and namespacing at LSS on Thursday, per the schedule.

This should be a very productive week for Linux security development: see you there!

September 07, 2017 01:44 AM

September 06, 2017

Linux Plumbers Conference: Linux Plumbers Conference Preliminary Schedule Published

You can see the schedule by clicking on the ‘schedule’ tab above or by going to this url

http://www.linuxplumbersconf.org/2017/ocw/events/LPC2017/schedule

If you’d like any changes, please email contact@linuxplumbersconf.org and we’ll see what we can do to accommodate your request.

Please also remember that the schedule is subject to change.

September 06, 2017 10:01 PM

Greg Kroah-Hartman: 4.14 == This years LTS kernel

As the 4.13 release has now happened, the merge window for the 4.14 kernel release is now open. I mentioned this many weeks ago, but as the word doesn’t seem to have gotten very far based on various emails I’ve had recently, I figured I need to say it here as well.

So, here it is officially, 4.14 should be the next LTS kernel that I’ll be supporting with stable kernel patch backports for at least two years, unless it really is a horrid release and has major problems. If so, I reserve the right to pick a different kernel, but odds are, given just how well our development cycle has been going, that shouldn’t be a problem (although I guess I just doomed it now…)

As always, if people have questions about this, email me and I will be glad to discuss it, or talk to me in person next week at the LinuxCon^WOpenSourceSummit or Plumbers conference in Los Angeles, or at any of the other conferences I’ll be at this year (ELCE, Kernel Recipes, etc.)

September 06, 2017 02:41 PM

September 05, 2017

Kees Cook: security things in Linux v4.13

Previously: v4.12.

Here’s a short summary of some of interesting security things in Sunday’s v4.13 release of the Linux kernel:

security documentation ReSTification
The kernel has been switching to formatting documentation with ReST, and I noticed that none of the Documentation/security/ tree had been converted yet. I took the opportunity to take a few passes at formatting the existing documentation and, at Jon Corbet’s recommendation, split it up between end-user documentation (which is mainly how to use LSMs) and developer documentation (which is mainly how to use various internal APIs). A bunch of these docs need some updating, so maybe with the improved visibility, they’ll get some extra attention.

CONFIG_REFCOUNT_FULL
Since Peter Zijlstra implemented the refcount_t API in v4.11, Elena Reshetova (with Hans Liljestrand and David Windsor) has been systematically replacing atomic_t reference counters with refcount_t. As of v4.13, there are now close to 125 conversions with many more to come. However, there were concerns over the performance characteristics of the refcount_t implementation from the maintainers of the net, mm, and block subsystems. In order to assuage these concerns and help the conversion progress continue, I added an “unchecked” refcount_t implementation (identical to the earlier atomic_t implementation) as the default, with the fully checked implementation now available under CONFIG_REFCOUNT_FULL. The plan is that for v4.14 and beyond, the kernel can grow per-architecture implementations of refcount_t that have performance characteristics on par with atomic_t (as done in grsecurity’s PAX_REFCOUNT).

CONFIG_FORTIFY_SOURCE
Daniel Micay created a version of glibc’s FORTIFY_SOURCE compile-time and run-time protection for finding overflows in the common string (e.g. strcpy, strcmp) and memory (e.g. memcpy, memcmp) functions. The idea is that since the compiler already knows the size of many of the buffer arguments used by these functions, it can already build in checks for buffer overflows. When all the sizes are known at compile time, this can actually allow the compiler to fail the build instead of continuing with a proven overflow. When only some of the sizes are known (e.g. destination size is known at compile-time, but source size is only known at run-time) run-time checks are added to catch any cases where an overflow might happen. Adding this found several places where minor leaks were happening, and Daniel and I chased down fixes for them.

One interesting note about this protection is that is only examines the size of the whole object for its size (via __builtin_object_size(..., 0)). If you have a string within a structure, CONFIG_FORTIFY_SOURCE as currently implemented will make sure only that you can’t copy beyond the structure (but therefore, you can still overflow the string within the structure). The next step in enhancing this protection is to switch from 0 (above) to 1, which will use the closest surrounding subobject (e.g. the string). However, there are a lot of cases where the kernel intentionally copies across multiple structure fields, which means more fixes before this higher level can be enabled.

NULL-prefixed stack canary
Rik van Riel and Daniel Micay changed how the stack canary is defined on 64-bit systems to always make sure that the leading byte is zero. This provides a deterministic defense against overflowing string functions (e.g. strcpy), since they will either stop an overflowing read at the NULL byte, or be unable to write a NULL byte, thereby always triggering the canary check. This does reduce the entropy from 64 bits to 56 bits for overflow cases where NULL bytes can be written (e.g. memcpy), but the trade-off is worth it. (Besdies, x86_64’s canary was 32-bits until recently.)

IPC refactoring
Partially in support of allowing IPC structure layouts to be randomized by the randstruct plugin, Manfred Spraul and I reorganized the internal layout of how IPC is tracked in the kernel. The resulting allocations are smaller and much easier to deal with, even if I initially missed a few needed container_of() uses.

randstruct gcc plugin
I ported grsecurity’s clever randstruct gcc plugin to upstream. This plugin allows structure layouts to be randomized on a per-build basis, providing a probabilistic defense against attacks that need to know the location of sensitive structure fields in kernel memory (which is most attacks). By moving things around in this fashion, attackers need to perform much more work to determine the resulting layout before they can mount a reliable attack.

Unfortunately, due to the timing of the development cycle, only the “manual” mode of randstruct landed in upstream (i.e. marking structures with __randomize_layout). v4.14 will also have the automatic mode enabled, which randomizes all structures that contain only function pointers.

A large number of fixes to support randstruct have been landing from v4.10 through v4.13, most of which were already identified and fixed by grsecurity, but many were novel, either in newly added drivers, as whitelisted cross-structure casts, refactorings (like IPC noted above), or in a corner case on ARM found during upstream testing.

lower ELF_ET_DYN_BASE
One of the issues identified from the Stack Clash set of vulnerabilities was that it was possible to collide stack memory with the highest portion of a PIE program’s text memory since the default ELF_ET_DYN_BASE (the lowest possible random position of a PIE executable in memory) was already so high in the memory layout (specifically, 2/3rds of the way through the address space). Fixing this required teaching the ELF loader how to load interpreters as shared objects in the mmap region instead of as a PIE executable (to avoid potentially colliding with the binary it was loading). As a result, the PIE default could be moved down to ET_EXEC (0x400000) on 32-bit, entirely avoiding the subset of Stack Clash attacks. 64-bit could be moved to just above the 32-bit address space (0x100000000), leaving the entire 32-bit region open for VMs to do 32-bit addressing, but late in the cycle it was discovered that Address Sanitizer couldn’t handle it moving. With most of the Stack Clash risk only applicable to 32-bit, fixing 64-bit has been deferred until there is a way to teach Address Sanitizer how to load itself as a shared object instead of as a PIE binary.

early device randomness
I noticed that early device randomness wasn’t actually getting added to the kernel entropy pools, so I fixed that to improve the effectiveness of the latent_entropy gcc plugin.

That’s it for now; please let me know if I missed anything. As a side note, I was rather alarmed to discover that due to all my trivial ReSTification formatting, and tiny FORTIFY_SOURCE and randstruct fixes, I made it into the most active 4.13 developers list (by patch count) at LWN with 76 patches: a whopping 0.6% of the cycle’s patches. ;)

Anyway, the v4.14 merge window is open!

© 2017, Kees Cook. This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 License.
Creative Commons License

September 05, 2017 11:01 PM

Linux Plumbers Conference: Testing and Fuzzing Microconference Accepted into the Linux Plumbers Conference

We’re pleased to announce that newcomer Microconference Testing and Fuzzing will feature at Plumbers in Los Angeles this year.

The Agenda will feature the three fuzzers used for the Linux Kernel (Trinity, Syzkaller and Perf) along with discussion of formal verification tools, discussion of how to test stable trees, testing frameworks and also a discussion and demonstration of the drm/i915 checkin and test infrastructure.

Additionally, we will hold a session aimed at improving the testing process for linux-stable and distro kernels. Please plan to attend if you have input into how to integrate additional testing and make these kernels more reliable. Participants will include Greg Kroah-Hartman and major distro kernel maintainers.

For more details on this, please see this microconference’s wiki page.

We hope to see you there!

September 05, 2017 08:04 AM

August 26, 2017

Matt Domsch: Twilio Voice to Pagerduty alert using Python Flask, Zappa, AWS Lambda & AWS API Gateway

My SaaS product DevOps team at Quest Software uses several monitoring services to notice problems (hopefully before end users see them), and raises alerts for our team using PagerDuty. We also frequently need to integrate with existing company and partner products, for example our internal helpdesk and customer-facing technical-support processes. In this case, the helpdesk team wanted to have a phone number they could call to raise an alert to our team. The first suggestion was to simply put my name down as the 24×7 on-call contact, and make it my problem to alert the right people. I scoffed. We already had PagerDuty in place – why couldn’t we use that too? Simply because we didn’t have a phone number hooked up to PagerDuty. So, lets fix that.

A few searches quickly turned up a PagerDuty blog where David Hayes had done exactly this. Excellent! However, it was written to use Google App Engine, and my team has their processes predominately in Azure and AWS. I didn’t want to introduce yet another set of cloud services, for something conceptually so simple.

Twilio’s quickstarts do a nice job of showing how to use their API, and these use Flask for the web framework. How can I use Flask apps in AWS Lambda? Here enters Zappa, a tool for deploying Flask and Django apps into AWS Lambda & AWS API Gateway. Slick! Now I have all the pieces I need.

You can find the code on github. I’ve extended the quickstarts slightly, to have the phone response first prompt for the application that is experiencing issues, and recording that in a session cookie to be retrieved later. Then it prompts the user to leave a message. With these two pieces of information, we have enough to create the PagerDuty incident for the proper application, including information about the caller gathered from Caller ID (in case the recording is garbled), and a link to the recorded message. Not too shabby for ~125 lines of “my” code, at a cost of ~$1/month to Twilio for the phone number, almost $0.00 for AWS, and a couple pennies if anyone actually calls to raise an alert.

August 26, 2017 07:19 PM

August 24, 2017

Pete Zaitcev: Oh not again

Fedora is mulling dropping the 32-bit x86 again, after the F26, which means I need to buy a new router. It's not like I cannot afford one... But it's such as hassle to migrate. I'm thinking about installing one in the background and then re-numbering it, in order to minimize issues. Even then, I cannot test, for instance, that VLANs work right, until I actually phase the box into production. It's much easier to keep a compatible 32-bit box mirrored and ready on stand-by.

In a sense, the amazing ease of upgrades in modern Fedora lulled me into this. Before, I re-installed anyway, and so could roll 64-bit just as easily.

P.S. According to records at the hoster, my primary public VM was installed as Fedora 15 and continuously upgraded since then.

August 24, 2017 07:21 PM

August 17, 2017

Linux Plumbers Conference: Tracing/BPF Microconference Accepted into the Linux Plumbers Conference

Following on from the successful Tracing Microconference last year, we’re pleased to announce there will be a follow on at Plumbers in Los Angeles this year.

The agenda for this year will not focus only on tracing but also will include several topics around eBPF. As eBPF now interacts with tracing and there is still a lot of work to accomplish, such as building an infrastructure around the current tools to compile and utilize eBPF within the tracing framework. Topics outside of eBPF will include enhancing uprobes and tracing virtualize and layered environments. Of particular interest is new techniques to improve kernel to user space tracing integration. This includes usage of uftrace and better symbol resolution of user space addresses from within the kernel. Additionally there will be a discussion on challenges of real world use cases by non-kernel engineers.

For more details on this, please see this microconference’s wiki page.

We hope to see you there!

August 17, 2017 04:51 PM

August 16, 2017

Linux Plumbers Conference: Trusted Platform Module Microconference Accepted into the Linux Plumbers Conference

Following on from the TPM Microconference last year, we’re pleased to announce there will be a follow on at Plumbers in Los Angeles this year.

The agenda for this year will focus on a renewed attempt to unify the 2.0 TSS; cryptosystem integration to make TPMs just work for the average user; the current state of measured boot and where we’re going; using TXT with TPM in Linux and using TPM from containers.

For more details on this, please see this microconference’s wiki page

We hope to see you there!

August 16, 2017 12:01 AM

August 14, 2017

Dave Airlie (blogspot): radv on SI and CIK GPU - update

I recently acquired an r7 360 (BONAIRE) and spent some time getting radv stable and passing the same set of conformance tests that VI and Polaris pass.

The main missing thing was 10-bit integer format clamping for a bug in the SI/CIK fragment shader output hardware, where it truncates instead of clamps. The other missing piece was code for handling f16->f32 conversions according to the vulkan spec that I'd previously fixed for VI.

I also looked at a trace from amdgpu-pro and noticed it was using a ds_swizzle for the derivative calculations which avoided accessing LDS memory. I wrote support to use this path for radv/radeonsi since LLVM supported the intrinsic for a while now.

With these fixed CIK is pretty much in the same place as VI/Polaris.

I then plugged in my SI (Tahiti), and got lots of GPU hangs and crashes. I fixed a number of SI specific bugs (tiling and MSAA handling, stencil tiling). However even with those fixed I was getting random hangs, and a bunch of people on a bugzilla had noticed the same thing. I eventually discovered adding a shader pipeline and cache flush at the end of every command buffer (this took a few days to narrow down exactly). We aren't 100% sure why this is required on SI only, it may be a kernel bug, or a command processor bug, but it does mean radv on SI now can run games without hanging.

There are still a few CTS tests outstanding on SI only, and I'll probably get to them eventually, however I also got an RX Vega and once I get a newer BIOS for it from AMD I shall be spending some time fixing the radv support for it.

August 14, 2017 03:16 AM

August 10, 2017

Linux Plumbers Conference: Scheduler Workloads Microconference Accepted into the Linux Plumbers Conference

New to Linux Plumbers Conference this year, the Scheduler Workloads Microconference will focus on understanding various workloads and their impact on the Linux Kernel Scheduler. The objective is to initiate a cross organizational and architectural discussion involving currently available (or in development) benchmarks and their effectiveness in evaluating the scheduler for these workloads.

The agenda for this year will focus on sharing current workload and benchmark tools and traces and how these can be used to improve the various Linux subsystems, including power management and real time.  Given that benchmarking the Linux scheduler is a controversial topic and often depends on proprietary tools, we’ll also discuss how to develop fully open source tools and benchmarks for this.

For more details on this, please see this microconference’s wiki page.

We hope to see you there!

August 10, 2017 07:10 PM

August 08, 2017

Daniel Vetter: Why Github can't host the Linux Kernel Community

A while back at the awesome maintainerati I chatted with a few great fellow maintainers about how to scale really big open source projects, and how github forces projects into a certain way of scaling. The linux kernel has an entirely different model, which maintainers hosting their projects on github don’t understand, and I think it’s worth explaining why and how it works, and how it’s different.

Another motivation to finally get around to typing this all up is the HN discussion on my “Maintainers Don’t Scale” talk, where the top comment boils down to “… why don’t these dinosaurs use modern dev tooling?”. A few top kernel maintainers vigorously defend mailing lists and patch submissions over something like github pull requests, but at least some folks from the graphics subsystem would love more modern tooling which would be much easier to script. The problem is that github doesn’t support the way the linux kernel scales out to a huge number of contributors, and therefore we can’t simply move, not even just a few subsystems. And this isn’t about just hosting the git data, that part obviously works, but how pull requests, issues and forks work on github.

Scaling, the Github Way

Git is awesome, because everyone can fork and create branches and hack on the code very easily. And eventually you have something good, and you create a pull request for the main repo and get it reviewed, tested and merged. And github is awesome, because it figured out an UI that makes this complex stuff all nice&easy to discover and learn about, and so makes it a lot simpler for new folks to contribute to a project.

But eventually a project becomes a massive success, and no amount of tagging, labelling, sorting, bot-herding and automating will be able to keep on top of all the pull requests and issues in a repository, and it’s time to split things up into more manageable pieces again. More important, with a certain size and age of a project different parts need different rules and processes: The shiny new experimental library has different stability and CI criteria than the main code, and maybe you have some dumpster pile of deprecated plugins that aren’t support, but you can’t yet delete them: You need to split up your humongous project into sub-projects, each with their own flavour of process and merge criteria and their own repo with their own pull request and issue tracking. Generally it takes a few tens to few hundreds of full time contributors until the pain is big enough that such a huge reorganization is necessary.

Almost all projects hosted on github do this by splitting up their monorepo source tree into lots of different projects, each with its distinct set of functionality. Usually that results in a bunch of things that are considered the core, plus piles of plugins and libraries and extensions. All tied together with some kind of plugin or package manager, which in some cases directly fetches stuff from github repos.

Since almost every big project works like this I don’t think it’s necessary to delve on the benefits. But I’d like to highlight some of the issues this is causing:

Interlude: Why Pull Requests Exist

The linux kernel is one of the few projects I’m aware of which isn’t split up like this. Before we look at how that works - the kernel is a huge project and simply can’t be run without some sub-project structure - I think it’s interesting to look at why git does pull requests: On github pull request is the one true way for contributors to get their changes merged. But in the kernel changes are submitted as patches sent to mailing lists, even long after git has been widely adopted.

But the very first version of git supported pull requests. The audience of these first, rather rough, releases was kernel maintainers, git was written to solve Linus Torvalds’ maintainer problems. Clearly it was needed and useful, but not to handle changes from individual contributors: Even today, and much more back then, pull requests are used to forward the changes of an entire subsystem, or synchronize code refactoring or similar cross-cutting change across different sub-projects. As an example, the 4.12 network pull request from Dave S. Miller, committed by Linus: It contains 2k+ commits from 600 contributors and a bunch of merges for pull requests from subordinate maintainers. But almost all the patches themselves are committed by maintainers after picking up the patches from mailing lists, not by the authors themselves. This kernel process peculiarity that authors generally don’t commit into shared repositories is also why git tracks the committer and author separately.

Github’s innovation and improvement was then to use pull requests for everything, down to individual contributions. But that wasn’t what they were originally created for.

Scaling, the Linux Kernel Way

At first glance the kernel looks like a monorepo, with everything smashed into one place in Linus’ main repo. But that’s very far from it:

At first this just looks like a complicated way to fill everyone’s disk space with lots of stuff they don’t care about, but there’s a pile of compounding minor benefits that add up:

In short, I think this is a strictly more powerful model, since you can always fall back to doing things exactly like you would with multiple disjoint repositories. Heck there’s even kernel drivers which are in their own repository, disjoint from the main kernel tree, like the proprietary Nvidia driver. Well it’s just a bit of source code glue around a blob, but since it can’t contain anything from the kernel for legal reasons it is the perfect example.

This looks like a monorepo horror show!

Yes and no.

At first glance the linux kernel looks like a monorepo because it contains everything. And lots of people learned that monorepos are really painful, because past a certain size they just stop scaling.

But looking closer, it’s very, very far away from a single git repository. Just looking at the upstream subsystem and driver repositories gives you a few hundred. If you look at the entire ecosystem, including hardware vendors, distributions, other linux-based OS and individual products, you easily have a few thousand major repositories, and many, many more in total. Not counting any git repo that’s just for private use by individual contributors.

The crucial distinction is that linux has one single file hierarchy as the shared namespace across everything, but lots and lots of different repos for all the different pieces and concerns. It’s a monotree with multiple repositories, not a monorepo.

Examples, please!

Before I go into explaining why github cannot currently support this workflow, at least if you want to retain the benefits of the github UI and integration, we need some examples of how this works in practice. The short summary is that it’s all done with git pull requests between maintainers.

The simple case is percolating changes up the maintainer hierarchy, until it eventually lands in a tree somewhere that is shipped. This is easy, because the pull request only ever goes from one repository to the next, and so could be done already using the current github UI.

Much more fun are cross-subsystem changes, because then the pull request flow stops being an acyclic graph and morphs into a mesh. The first step is to get the changes reviewed and tested by all the involved subsystems and their maintainers. In the github flow this would be a pull request submitted to multiple repositories simultaneously, with the one single discussion stream shared among them all. Since this is the kernel, this step is done through patch submission with a pile of different mailing lists and maintainers as recipients.

The way it’s reviewed is usually not the way it’s merged, instead one of the subsystems is selected as the leading one and takes the pull requests, as long as all other maintainers agree to that merge path. Usually it’s the subsystem most affected by a set of changes, but sometimes also the one that already has some other work in-flight which conflicts with the pull request. Sometimes also an entirely new repository and maintainer crew is created, this often happens for functionality which spans the entire tree and isn’t neatly contained to a few files and directories in one place. A recent example is the DMA mapping tree, which tries to consolidate work that thus far has been spread across drivers, platform maintainers and architecture support groups.

But sometimes there’s multiple subsystems which would both conflict with a set of changes, and which would all need to resolve some non-trivial merge conflict. In that case the patches aren’t just directly applied (a rebasing pull request on github), but instead the pull request with just the necessary patches, based on a commit common to all subsystems, is merged into all subsystem trees. The common baseline is important to avoid polluting a subsystem tree with unrelated changes. Since the pull is for a specific topic only, these branches are commonly called topic branches.

One example I was involved with added code for audio-over-HDMI support, which spanned both the graphics and sound driver subsystems. The same commits from the same pull request where both merged into the Intel graphics driver and also merged into the sound subsystem.

An entirely different example that this isn’t insane is the only other relevant general purpose large scale OS project in the world also decided to have a monotree, with a commit flow modelled similar to what’s going on in linux. I’m talking about the folks with such a huge tree that they had to write an entire new GVFS virtual filesystem provider to support it …

Dear Github

Unfortunately github doesn’t support this workflow, at least not natively in the github UI. It can of course be done with just plain git tooling, but then you’re back to patches on mailing lists and pull requests over email, applied manually. In my opinion that’s the single one reason why the kernel community cannot benefit from moving to github. There’s also the minor issue of a few top maintainers being extremely outspoken against github in general, but that’s a not really a technical issue. And it’s not just the linux kernel, it’s all huge projects on github in general which struggle with scaling, because github doesn’t really give them the option to scale to multiple repositories, while sticking to with a monotree.

In short, I have one simple feature request to github:

Please support pull requests and issue tracking spanning different repos of a monotree.

Simple idea, huge implications.

Repositories and Organizations

First, it needs to be possible to have multiple forks of the same repo in one organization. Just look at git.kernel.org, most of these repositories are not personal. And even if you might have different organizations for e.g. different subsystems, requiring an organization for each repo is silly amounts of overkill and just makes access and user managed unnecessarily painful. In graphics for example we’d have 1 repo each for the userspace test suite, the shared userspace library, and a common set of tools and scripts used by maintainers and developers, which would work in github. But then we’d have the overall subsystem repo, plus a repository for core subsystem work and additional repositories for each big drivers. Those would all be forks, which github doesn’t do. And each of these repos has a bunch of branches, at least one for feature work, and another one for bugfixes for the current release cycle.

Combining all branches into one repository wouldn’t do, since the point of splitting repos is that pull requests and issues are separated, too.

Related, it needs to be possible to establish the fork relationship after the fact. For new projects who’ve always been on github this isn’t a big deal. But linux will be able to move at most a subsystem at a time, and there’s already tons of linux repositories on github which aren’t proper github forks of each another.

Pull Requests

Pull request need to be attached to multiple repos at the same time, while keeping one unified discussion stream. You can already reassign a pull request to a different branch of repo, but not at multiple repositories at the same time. Reassigning pull requests is really important, since new contributors will just create pull requests against what they think is the main repo. Bots can then shuffle those around to all the repos listed in e.g. a MAINTAINERS file for a given set of files and changes a pull request contains. When I chatted with githubbers I originally suggested they’d implement this directly. But I think as long as it’s all scriptable that’s better left to individual projects, since there’s no real standard.

There’s a pretty funky UI challenge here since the patch list might be different depending upon the branch the pull request is against. But that’s not always a user error, one repo might simple have merged a few patches already.

Also, the pull request status needs to be different for each repo. One maintainer might close it without merging, since they agreed that the other subsystem will pull it in, while the other maintainer will merge and close the pull. Another tree might even close the pull request as invalid, since it doesn’t apply to that older version or vendor fork. Even more fun, a pull request might get merged multiple times, in each subsystem with a different merge commit.

Issues

Like pull requests, issues can be relevant for multiple repos, and might need to be moved around. An example would be a bug that’s first reported against a distribution’s kernel repository. After triage it’s clear it’s a driver bug still present in the latest development branch and hence also relevant for that repo, plus the main upstream branch and maybe a few more.

Status should again be separate, since once push to one repo the bugfix isn’t instantly available in all of them. It might even need additional work to get backported to older kernels or distributions, and some might even decide that’s not worth it and close it as WONTFIX, even thought the it’s marked as successfully resolved in the relevant subsystem repository.

Summary: Monotree, not Monorepo

The Linux Kernel is not going to move to github. But moving the Linux way of scaling with a monotree, but mutliple repos, to github as a concept will be really beneficial for all the huge projects already there: It’ll give them a new, and in my opinion, more powerful way to handle their unique challenges.

August 08, 2017 12:00 AM

August 07, 2017

Paul E. Mc Kenney: Book review: "Antifragile: Things That Gain From Disorder"

This is the fourth and final book in Nassim Taleb's Incerto series, which makes a case for antifragility as a key component of design, taking the art of design one step beyond robustness. An antifragile system is one where variation, chaos, stress, and errors improve the results. For example, within limits, stressing muscles and bones makes them stronger. In contrast, stressing a device made of (say) aluminum will eventually cause it to fail. Taleb gives a lengthy list of examples in Table 1 starting on page 23, some of which seem more plausible than others. An example implausible entry lists rule-based systems as fragile, principles-based systems as robust, and virtue-based systems as antifragile. Although I can imagine a viewpoint where this makes sense, any expectation that a significantly large swath of present-day society will agree on a set of principles (never mind virtues!) seems insanely optimistic. The table nevertheless provides much good food for thought.

Taleb states that he has constructed antifragile financial strategies using insurance to control downside risks. But he also states on page 6 “Thou shalt not have antifragility at the expense of the fragility of others.” Perhaps Taleb figures that few will shed tears for any difficulties that insurance companies might get into, perhaps he is taking out policies that are too small to have material effect on the insurance company in question, or perhaps his policies are counter to the insurance company's main business, so that payouts to Taleb are anticorrelated with payouts to the company's other customers. One presumes that he has thought this through carefully, because a bankrupt insurance company might not be all that effective at controlling his downside risks.

Appendix I beginning on page 435 gives a graphical summary of the books main messages. Figure 28 on page 441 is good grist for the mills of those who would like humanity to become an intergalactic species: After all, confining the human race seems likely to limit its upside. (One counterargument would posit that a finite object might have unbounded value, but such counterarguments typically rely on there being a very large number of human beings interested in that finite object, which some would consider to counter this counterargument.)

The right-hand portion of Figure 30 on page 442 illustrates what the author calls local antifragility and global fragility. To see this, imagine that the x-axis represents variation from nominal conditions, and the y-axis represents payoff, with large positive payoffs being highly desired. The right-hand portion shows something not unrelated to the function x^2-x^4, which gives higher payoffs as you move in either direction from x=0, peaking when x reaches one divided by the square root of two (either positive or negative), dropping back to zero when x reaches +1 or -1, and dropping like a rock as one ventures further away from x=0. The author states that this local antifragility and global fragility is the most dangerous of all, but given that he repeatedly stresses that antifragile systems are antifragile only up to a point, this dangerous situation would seem to be the common case. Those of us who believe that life is inherently dangerous should have no problem with this apparent contradiction.

But what does all of this have to do with parallel programming???

Well, how about “Is RCU antifragile?”

One case for RCU antifragility is the batching optimizations that allow many (as in thousands) concurrent requests to share the same grace-period computation. Therefore, the heavier the update-side load on RCU, the more efficiently RCU operates.

However, load is but one of many aspects of RCU's environment that might be varied. For an extreme example, RCU is exceedingly fragile with respect to small perturbations of the program counter, as Peter Sewell so ably demonstrated, by running emacs, no less. RCU is also fragile with respect to timekeeping anomalies, for example, it can emit false-positive RCU CPU stall warnings if different CPUs have tens-of-seconds disagreements as to the current time. However, the aforementioned bones and muscles are similarly fragile with respect to any number of chemical substances (AKA “poisons”), to say nothing of well-known natural phenomena such as lightning bolts and landslides.

Even when excluding hardware misbehavior such as auto-perturbing program counters and unsynchronized clocks, RCU would still be subject to software aging, and RCU has in fact require multiple interventions from its developers and maintainer in order to keep up with changing hardware, workload, and usage. One could therefore argue that RCU is fragile with respect to perturbations of time, although the combination of RCU and its developers, reviewers, and maintainer seem to have kept up reasonably well thus far.

On the other hand, perhaps it is unrealistic to evaluate the antifragility of software without including black-hat hackers. Achieving antifragility in that sort of environment is still very much a grand challenge problem, but a challenge that must be faced. Oh, you think RCU is to low-level for this sort of attack? There was a time when I thought so. And then came rowhammer.

So please be careful, and, where possible, antifragile! It is after all a real world out there!!!

August 07, 2017 04:36 AM

August 03, 2017

Linux Plumbers Conference: Book Your Hotel for Plumbers by 18 August

As a reminder, we have a block of rooms at the JW Marriott LA Live
available to attendees at the discounted conference rate of $259/night
(plus applicable taxes). High speed internet is included in the room rate.

Our discounted room rate expires on 5:00 pm PST on August 18. We encourage
you to book today!

Visit our Attend page for additional details.

August 03, 2017 10:15 PM

July 29, 2017

Linux Plumbers Conference: Late Registration Begins Soon

The late registration for Linux Plumbers conference begins on 31 July. If you want to take advantage of the standard registration fees, register now on this link.

Standard registration is $550, late registration will be $650.

July 29, 2017 07:21 PM

Linux Plumbers Conference: Checkpoint-Restart Microconference Accepted into the Linux Plumbers Conference

Following on from the successful Checkpoint-Restart Microconference
last year, we’re pleased to announce that there will be another at
Plumbers in Los Angeles this year.

The agenda this year will focus on specific use cases of Checkpoint-
Restart, such as High Performance Computing, state saving uses such as
job scheduling and hot standby.  In addition we’ll be looking at
enhancements such as performance and using userfaultfd for dirty memory
tracking in iterative migration and what it would take to have
unprivileged checkpoint-restart.  Finally, we’ll have discussions on
checkpoint-restart aware applications and what sort of testing needs to
be applied to the upstream kernel to prevent any checkpoint-restore API
breakage as it evolves.

For more details on this, please see this microconference’s wiki page.

We hope to see you there!

July 29, 2017 04:42 PM

July 21, 2017

Michael Kerrisk (manpages): man-pages-4.12 is released

I've released man-pages-4.12. The release tarball is available on kernel.org. The browsable online pages can be found on man7.org. The Git repository for man-pages is available on kernel.org.

This release resulted from patches, bug reports, reviews, and comments from around 30 contributors. It includes just under 200 commits changing around 90 pages. This is a relatively small release, with one new manual page, ioctl_getfsmap(2). The most significant change in the release consists of a number of additions and improvements in the ld.so(8) page.

July 21, 2017 06:53 PM

July 20, 2017

Paul E. Mc Kenney: Parallel Programming: Getting the English text out of the way

We have been making good progress on the next release of Is Parallel Programming Hard, And, If So, What Can You Do About It?, and hope to have a new release out soonish.

In the meantime, for those of you for whom the English text in this book has simply gotten in the way, there is now an alternative:

perfbook_cn_cover

On the off-chance that any of you are seriously interested, this is available from
Amazon China, JD.com, Taobao.com, and Dangdang.com. For the rest of you, you have at least seen the picture.  ;–)

July 20, 2017 02:37 AM

July 18, 2017

Matthew Garrett: Avoiding TPM PCR fragility using Secure Boot

In measured boot, each component of the boot process is "measured" (ie, hashed and that hash recorded) in a register in the Trusted Platform Module (TPM) build into the system. The TPM has several different registers (Platform Configuration Registers, or PCRs) which are typically used for different purposes - for instance, PCR0 contains measurements of various system firmware components, PCR2 contains any option ROMs, PCR4 contains information about the partition table and the bootloader. The allocation of these is defined by the PC Client working group of the Trusted Computing Group. However, once the boot loader takes over, we're outside the spec[1].

One important thing to note here is that the TPM doesn't actually have any ability to directly interfere with the boot process. If you try to boot modified code on a system, the TPM will contain different measurements but boot will still succeed. What the TPM can do is refuse to hand over secrets unless the measurements are correct. This allows for configurations where your disk encryption key can be stored in the TPM and then handed over automatically if the measurements are unaltered. If anybody interferes with your boot process then the measurements will be different, the TPM will refuse to hand over the key, your disk will remain encrypted and whoever's trying to compromise your machine will be sad.

The problem here is that a lot of things can affect the measurements. Upgrading your bootloader or kernel will do so. At that point if you reboot your disk fails to unlock and you become unhappy. To get around this your update system needs to notice that a new component is about to be installed, generate the new expected hashes and re-seal the secret to the TPM using the new hashes. If there are several different points in the update where this can happen, this can quite easily go wrong. And if it goes wrong, you're back to being unhappy.

Is there a way to improve this? Surprisingly, the answer is "yes" and the people to thank are Microsoft. Appendix A of a basically entirely unrelated spec defines a mechanism for storing the UEFI Secure Boot policy and used keys in PCR 7 of the TPM. The idea here is that you trust your OS vendor (since otherwise they could just backdoor your system anyway), so anything signed by your OS vendor is acceptable. If someone tries to boot something signed by a different vendor then PCR 7 will be different. If someone disables secure boot, PCR 7 will be different. If you upgrade your bootloader or kernel, PCR 7 will be the same. This simplifies things significantly.

I've put together a (not well-tested) patchset for Shim that adds support for including Shim's measurements in PCR 7. In conjunction with appropriate firmware, it should then be straightforward to seal secrets to PCR 7 and not worry about things breaking over system updates. This makes tying things like disk encryption keys to the TPM much more reasonable.

However, there's still one pretty major problem, which is that the initramfs (ie, the component responsible for setting up the disk encryption in the first place) isn't signed and isn't included in PCR 7[2]. An attacker can simply modify it to stash any TPM-backed secrets or mount the encrypted filesystem and then drop to a root prompt. This, uh, reduces the utility of the entire exercise.

The simplest solution to this that I've come up with depends on how Linux implements initramfs files. In its simplest form, an initramfs is just a cpio archive. In its slightly more complicated form, it's a compressed cpio archive. And in its peak form of evolution, it's a series of compressed cpio archives concatenated together. As the kernel reads each one in turn, it extracts it over the previous ones. That means that any files in the final archive will overwrite files of the same name in previous archives.

My proposal is to generate a small initramfs whose sole job is to get secrets from the TPM and stash them in the kernel keyring, and then measure an additional value into PCR 7 in order to ensure that the secrets can't be obtained again. Later disk encryption setup will then be able to set up dm-crypt using the secret already stored within the kernel. This small initramfs will be built into the signed kernel image, and the bootloader will be responsible for appending it to the end of any user-provided initramfs. This means that the TPM will only grant access to the secrets while trustworthy code is running - once the secret is in the kernel it will only be available for in-kernel use, and once PCR 7 has been modified the TPM won't give it to anyone else. A similar approach for some kernel command-line arguments (the kernel, module-init-tools and systemd all interpret the kernel command line left-to-right, with later arguments overriding earlier ones) would make it possible to ensure that certain kernel configuration options (such as the iommu) weren't overridable by an attacker.

There's obviously a few things that have to be done here (standardise how to embed such an initramfs in the kernel image, ensure that luks knows how to use the kernel keyring, teach all relevant bootloaders how to handle these images), but overall this should make it practical to use PCR 7 as a mechanism for supporting TPM-backed disk encryption secrets on Linux without introducing a hug support burden in the process.

[1] The patchset I've posted to add measured boot support to Grub use PCRs 8 and 9 to measure various components during the boot process, but other bootloaders may have different policies.

[2] This is because most Linux systems generate the initramfs locally rather than shipping it pre-built. It may also get rebuilt on various userspace updates, even if the kernel hasn't changed. Including it in PCR 7 would entirely break the fragility guarantees and defeat the point of all of this.

comment count unavailable comments

July 18, 2017 06:48 AM

July 13, 2017

Linux Plumbers Conference: VFIO/IOMMU/PCI Microconference Accepted into Linux Plumbers Conference

Following on from the successful PCI Microconference at Plumbers last year we’re pleased to announce a follow on this year with an expanded scope.

The agenda this year will focus on overlap and common development between VFIO/IOMMU/PCI subsystems, and in particular how consolidation of the shared virtual memory(SVM) API can drive an even tighter coupling between them.

This year we will also focus on user visible aspects such as using SVM to share page tables with devices and reporting I/O page faults to userspace in addition to discussing PCI and IOMMU interfaces and potential improvements.

For more details on this, please see this microconference’s wiki page.

We hope to see you there!

July 13, 2017 05:20 PM

July 11, 2017

Linux Plumbers Conference: Power Management and Energy-awareness Microconference Accepted into Linux Plumbers Conference

Following on from the successful Power Management and Energy-awareness at Plumbers last year we’re pleased to announce a follow on this year.

The agenda this year will focus on a range of topics including CPUfreq core improvements and schedutil governor extensions, how to best use scheduler signals to balance energy consumption and performance and user space interfaces to control capacity and utilization estimates.  We’ll also discuss selective throttling in thermally constrained systems, runtime PM for ACPI, CPU cluster idling and the possibility to implement resume from hibernation in a bootloader.

For more details on this, please see this microconference’s wiki page.

We hope to see you there!

July 11, 2017 04:15 PM

James Morris: Linux Security Summit 2017 Schedule Published

The schedule for the 2017 Linux Security Summit (LSS) is now published.

LSS will be held on September 14th and 15th in Los Angeles, CA, co-located with the new Open Source Summit (which includes LinuxCon, ContainerCon, and CloudCon).

The cost of LSS for attendees is $100 USD. Register here.

Highlights from the schedule include the following refereed presentations:

There’s also be the usual Linux kernel security subsystem updates, and BoF sessions (with LSM namespacing and LSM stacking sessions already planned).

See the schedule for full details of the program, and follow the twitter feed for the event.

This year, we’ll also be co-located with the Linux Plumbers Conference, which will include a containers microconference with several security development topics, and likely also a TPMs microconference.

A good critical mass of Linux security folk should be present across all of these events!

Thanks to the LSS program committee for carefully reviewing all of the submissions, and to the event staff at Linux Foundation for expertly planning the logistics of the event.

See you in Los Angeles!

July 11, 2017 11:30 AM

July 10, 2017

Kees Cook: security things in Linux v4.12

Previously: v4.11.

Here’s a quick summary of some of the interesting security things in last week’s v4.12 release of the Linux kernel:

x86 read-only and fixed-location GDT
With kernel memory base randomization, it was stil possible to figure out the per-cpu base address via the “sgdt” instruction, since it would reveal the per-cpu GDT location. To solve this, Thomas Garnier moved the GDT to a fixed location. And to solve the risk of an attacker targeting the GDT directly with a kernel bug, he also made it read-only.

usercopy consolidation
After hardened usercopy landed, Al Viro decided to take a closer look at all the usercopy routines and then consolidated the per-architecture uaccess code into a single implementation. The per-architecture code was functionally very similar to each other, so it made sense to remove the redundancy. In the process, he uncovered a number of unhandled corner cases in various architectures (that got fixed by the consolidation), and made hardened usercopy available on all remaining architectures.

ASLR entropy sysctl on PowerPC
Continuing to expand architecture support for the ASLR entropy sysctl, Michael Ellerman implemented the calculations needed for PowerPC. This lets userspace choose to crank up the entropy used for memory layouts.

LSM structures read-only
James Morris used __ro_after_init to make the LSM structures read-only after boot. This removes them as a desirable target for attackers. Since the hooks are called from all kinds of places in the kernel this was a favorite method for attackers to use to hijack execution of the kernel. (A similar target used to be the system call table, but that has long since been made read-only.) Be wary that CONFIG_SECURITY_SELINUX_DISABLE removes this protection, so make sure that config stays disabled.

KASLR enabled by default on x86
With many distros already enabling KASLR on x86 with CONFIG_RANDOMIZE_BASE and CONFIG_RANDOMIZE_MEMORY, Ingo Molnar felt the feature was mature enough to be enabled by default.

Expand stack canary to 64 bits on 64-bit systems
The stack canary values used by CONFIG_CC_STACKPROTECTOR is most powerful on x86 since it is different per task. (Other architectures run with a single canary for all tasks.) While the first canary chosen on x86 (and other architectures) was a full unsigned long, the subsequent canaries chosen per-task for x86 were being truncated to 32-bits. Daniel Micay fixed this so now x86 (and future architectures that gain per-task canary support) have significantly increased entropy for stack-protector.

Expanded stack/heap gap
Hugh Dickens, with input from many other folks, improved the kernel’s mitigation against having the stack and heap crash into each other. This is a stop-gap measure to help defend against the Stack Clash attacks. Additional hardening needs to come from the compiler to produce “stack probes” when doing large stack expansions. Any Variable Length Arrays on the stack or alloca() usage needs to have machine code generated to touch each page of memory within those areas to let the kernel know that the stack is expanding, but with single-page granularity.

That’s it for now; please let me know if I missed anything. The v4.13 merge window is open!

Edit: Brad Spengler pointed out that I failed to mention the CONFIG_SECURITY_SELINUX_DISABLE issue with read-only LSM structures. This has been add now.

© 2017, Kees Cook. This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 License.
Creative Commons License

July 10, 2017 08:24 AM

Dave Airlie (blogspot): radv and the vulkan deferred demo - no fps left behind!

A little while back I took to wondering why one particular demo from the Sascha Willems vulkan demos was a lot slower on radv compared to amdgpu-pro. Like half the speed slow.

I internally titled this my "no fps left behind" project.

The deferred demo, does an offscreen rendering to 3 2048x2048 color attachments and one 2048x2048 D32S8 depth attachment. It then does a rendering using those down to as 1280x720 screen image.

Bas identifed the first cause was probably the fact we were doing clear color eliminations on the offscreen surfaces when we didn't need to. AMD GPU have a delta-color compression feature, and with certain clear values you don't need to do the clear color eliminations step. This brought me back from about 1/2 the FPS to about 3/4, however it took me quite a while to figure out where the rest of the FPS were hiding.

I took a few diversions in my testing, I pulled in some experimental patches to allow the depth buffer to be texture cache compatible, so could bypass the depth decompression pass, however this didn't seem to budge the number too much.

I found a bunch of registers we were setting different values from -pro, nothing too much came of these.

I found some places we were using a compute shader to fill some DCC or htile surfaces to a value, then doing a clear and overwriting the values, not much help.

I noticed the vertex descriptions and buffer attachments on amdgpu-pro were done quite different to how radv does it. With vulkan you have vertex descriptors and bindings, with radv we generate a set of hw descriptors from the combination of both descriptors and bindings. The pro driver uses typed buffer loads in the shader to embed the descriptor contents in the shader, then it only updates the hw descriptors for the buffer bindings. This seems like it might be more efficient, guess what, no help. (LLVM just grew support for typed buffer loads, so we could probably move to this scheme if we wished now).

I dug out some patches that inline all the push constants and some descriptors so our shaders had less overhead, (really helps our meta shaders have less impact), no helps.

I noticed they export the shader results in a different order from the fragment shader, and always at the end. (no help). The vertex shader emits pos first, (no help). The vertex shader uses off exports for unused channels, (no help).

I went on holidays for a week and came back to stare at the traces again, when I my brain finally noticed something I'd missed. When binding the 3 color buffers, the addresses given as the base address were unusual. A surface has a 40-bit address, normally for alignment and tiling the bottom 16-bits are 0, and we shift 8 of those off completely before writing them. This leaves the bottom 8 bits of the base address has should be 0, and the CIK docs from AMD say that. However the pro traces didn't have these at 0. It appears from earlier evergreen/cayman documents these register control some tiling offset bits. After writing a hacky patch to set the values, I managed to get back the rest of the FPS I was missing in the deferred demo. I discussed with AMD developers, and we worked out the addrlib library has an API for working out these values, and it seems that it allows better memory bandwidth utilisation. I've written a patch to try and use these values correctly and sent it out along with the DCC avoidance patch.

Now I'm not sure this will help any real apps, we may not be hitting limitations in that area, and I'm never happy with the benchmarks I run myself. I thought I saw some FPS difference with some madmax scenes, but I might be lying to myself. Once the patches land in mesa I'm sure others will run benchmarks and we can see if there is any use case where they have an effect. The AMD radeonsi OpenGL driver can also do the same tweaks so hopefully there as well there will be some benefit.

Otherwise I can just write this off as making deferred run at equality and removing at least one of the deltas that radv has compared to the pro driver. Some of the other differences I discovered along the way might also have some promise in other scenarios, so I'll keep an eye on them.

Thanks to Bas, Marek and Christian for looking into what the magic meant!

July 10, 2017 08:08 AM

Dave Airlie: Migrating to blogsport

Due to lots of people telling me LJ is bad, mm'kay, I've migrated to blogspot.

New blog is/will be here: https://airlied.blogspot.com

July 10, 2017 06:36 AM

Dave Airlie (blogspot): Migrating by blog here

I'm moving my blog from LJ to blogspot, because people keep telling me LJ is up to no go, like hacking DNC servers and interfering in elections.

July 10, 2017 06:29 AM

July 08, 2017

Kernel Podcast: Linux Kernel Podcast for 2017/07/07

Audiohttp://traffic.libsyn.com/jcm/20170707.mp3

Linux 4.12 final is released, the 4.13 merge window opens, and various assorted ongoing kernel development is described in detail.

Editorial note

Reports of this podcast’s demise are greatly exaggerated. But it is worth noting that recording this weekly is HARD. That said, I am going to work on automation (I want the podcast to effectively write itself by providing a web UI via of LKML threads that allows anyone to write summaries, add author bios, links, etc. – and expand this to other communities) but that will all take some time. Until that happens, we’ll just have to live with some breaks.

Announcements

Linus Torvalds announced Linux 4.12 final. In his announcement mail, Linus reflects that “4.12 is just plain big”, noting that, this was “one of the bigger releases historically, and I think only 4.9 ends up having had more commits. And 4.9 was big at least partly because Greg announced it was an LTS [Long Term Support – receiving updates for several years] kernel”. In pure numbers, 4.12 adds over a million lines of code over 4.11, about half of which can be attributed to enablement for the AMD Vega GPU support. As usual, both Linux Weekly News (LWN) and KernelNewbies have excellent, and highly detailed summaries. Listeners are encouraged to support real kernel journalism by subscribing to Linux Weekly News and visiting lwn.net.

Theodore (Ted) Ts’o posted “Next steps and plans for the 2017 Maintainer and Kernel Summits”. He reminds everyone of the (slightly) revised format to the this year’s Kernel Summit (which is, as is often the case, co-located with a Linux Foundation event in the form of the Open Source Summit Prague in October). Notably, a program committee is established to help encourage submissions from those who feel they should be present at the event. To learn more, see the mailing list archives containing the announcement: https://lists.linuxfoundation.org/mailman/listinfo/ksummit-discuss (technically the deadline is already passed, or tomorrow, depending)

Greg K-H (Kroah-Hartman) announced Linux 4.4.76, 4.9.36, and 4.11.9.

Willy Tarreau announced Linux 3.10.106, including a reminder that this “LTS” [Long Term Stable] kernel is “scheduled for end of life on end of October”.

Steven Rostedt released preempt-rt (“Real Time”) kernels 3.10.107-rt122, 3.18.59-rt65, 4.4.75-rt88, and 4.9.35-rt25, all of which were simply rebases to stable kernel updates and had “no RT specific changes”. It will be interesting to see if some of the hotplug fixes Thomas Gleixner has sent for Linux 4.13 will resolve issues seen by some RT users when doing hotplug.

Sebastian Andrzej Siewior announced preempt-rt (“Real time”) kernels v4.9.33-rt23, and v4.11.7-rt3, which still notes potential for a deadlock under CPU hotplug.

Stpehen Hemminger announced iproute2 version 4.12.0 matching Linux 4.12. This includes support for features present in the new kernel, including flower support and enhancements to the TC (Traffic Control) code: https://www.kernel.org/pub/linux/utils/net/iproute2/iproute2-4.12.0.tar.gz

Bartosz Golaszewksi posted libgpiod v0.3:
https://github.com/brgl/libgpiod/releases/tag/v0.3

Mathieu Desnoyers announced LTTng modules 2.10.0-rc2, 2.9.3, 2.8.6, including support for “4.12 release candidate kernels”.

The 4.13 merge window

With the opening of the 4.13 merge window, many pull requests have begun flowing for what will become the new hotness in another couple of months. We won’t summarize each in detail (that resulted in a one hour long podcast the last time…) but will instead call out a few “interesting” changes of note. Stephen Rothwell also promptly updated his daily linux-next tree with the usual disclaimer that “Please do not add any v4.14 material to you[r] linux-next included branches until after v4.13-rc1 has been released”.

ACPI. Rafael J. Wysocki posted “ACPI updates for v4.13-rc1”, which includes an update to the ACPICA (ACPI Component Architecture) release of 20170531 that adds support to the OS-independent ACPICA layer for ACPI 6.2. This includes a number of new tables, including the PPTT (Processor Properties and Topology Table) that some of us have wanted to see for many years (as a means to more fully describe the NUMA properties of ARM servers, as just a random example…). In addition, Kees Cook has done some work to clean up the use of function pointer structures in ACPICA to use “designated initializers” so as “to make the structure layout randomization GCC plugin work with it”. All in all, this is a nice set of updates for all architectures.

AppArmor. John Johansen noted in his earlier pull request (to James Morris, who owns overall security subsystem pull requests headed to Linus) that an attempt was being made to get many of the Ubuntu specific AppArmor patches upstreamed. The 4.13 patches “introduces the domain labeling base code that Ubuntu has been carrying for several years”. He then plans to begin to RFC other Ubuntu-specific patches in later cycles.

ARM. Arnd Bergman notes a number of changes to 64-bit ARM platforms, including work done by Timur Tabi to change kernel def(ault)config files to “enable[s] a number of options that are typically required for server platforms”. It’s only been many years since this should have been the case in upstream Linux. Meanwhile, in a separate pull for “ARM: 64-bit DT [DeviceTree] updates”, support is added for many new boards (“For the first time I can remember, this is actually larger than the corresponding branch for 32-bit platforms”) including new varieties of “OrangePi” based on Allwinner chipsets.

Docs. Jon(athan) Corbet had noted that “You’ll also encounter more than the usual number of conflicts, which is saying something”. Linus “fixed the ones that were actual data conflicts” but he had some suggestions for how Kbuild could be modified such that an “make allmodconfig” checked for the existence of various files being reference in the rst documentation source files. He also noted that he was happy to see docbook “finally gone” but that sphinx, the tool used to generate documentation now, “isn’t exactly a speed demon”.

Hotplug. As noted elsewhere, Thomas Gleixner posted a pull request for various smp hotplug fixes that includes replacing an “open coded RWSEM [Read Write Semaphore] with a percpu RWSEM”. This is done to enable full coverage by the kernel’s “lockdep” locking dependency checker in order to catch hotplug deadlocks that have been seen on certain RT (Real Time) systems.

IRQ. Thomas Gleixner posted “irq updates for 4.13”, which includes “Expand the generic infrastructure handling the irq migration on CPU hotplug and convert X86 over to it” in preparation for cleaning up affinity management on blk multiqueue devices (preventing interrrupts being moved around during hotplug by instead shutting down affine interrupts intended to be always routed to a specific CPU). Thomas notes that “Jens [the blk maintainer] acked them and agreed that they should go with the irq changes”, but Linus later pushed back strongly after hitting merge conflicts that made him feel that some of these changes should have gone in via the blk tree instead of clashing with it. Linus was also concerned if the onlining code worked at all.

Objtool. Ingo Molnar posted a pull request including changes to the “objdump” tool intending to allow the tracking of stack pointer modifications through “machine instructions of disassembled functions found in kernel .o files”. The idea is to remove a dependency upon compiling the kernel with the CONFIG_FRAME_POINTERS=y option (which causes a larger stack frame and possible additional register pressure on some architectures) while still retaining the ability to generate correct kernel debuginfo data in the future.

PCI. Thomas Gleixner posted “x86/PCI updates for 4.13”, which includes work to separate PCI config space accessors from using a global PCI lock. Apparently, x86 already had an additional PCI config lock and so two layers of redundant locking were being employed, while neither was strictly necessary in the case of ECAM (“mmconfig”) based configuration, since “access to the extended configuration space [MMIO based configuration in PCIe] does not require locking”. Thomas also notes that a commit which had switched x86 to use ECAM [the MMIO mode] by default was removed so it will still use “type1 accessors” (the “old fashioned way” that Linus is so happy with) serialized by x86 internal locking for primary configuration space. This set of patches came in through x86 via Thomas with Bjorn Helgaas’s (PCI maintainer) permission.

RCU. Ingo Molnar noted that “The sole purpose of these changes is to shrink and simplify the RCU code base, which has suffered from creeping bloat”.

Scheduler. Ingo Molnar posted a pull request that included a number of changes, among them being NUMA scheduling improvements to address regressions seen when comparing 4.11 based kernels to older ones, from Rik van Riel.

VFS. Al Viro went to town with VFS updates split into more than 10 parts (yes, really, actually 11 as of this writing). These are caused by various intrusive changes which impact many parts of the kernel tree. Linus said he would “*much* rather do five separate pull requests where each pull has a stated reason and target, than do one big mixed-up one”. Which is good because Viro promised many more than 5. Patch series number 11 got the most feedback so far.

X86. Ingo Molnar also went to town, in typical fashion, with many different updates to the kernel. These included mm changes enabling more Intel 5-level paging features (switching the “GUP” or “Get User Pages” code over to the newer generic kernel implementation shared by other architectures), and “[C]ontinued work to add PCID [Process Context ID] support”. Per-process context IDs allow for TLB (Translation Lookaside Buffer – the micro caches that store virtual to physical memory translations following page table walks by the hardware walkers) flush infrastructure optimizations on legacy architectures such as x86 that do not have certain TLB hardware optimizations. Ingo also posted microcode updates that include support for saving microcode pointers and wiring them up for use early in the “resume-from-RAM” case, and fixes to the Hyper-V guest support that add a synthetic CPU MSR (Model Specific Register) providing the CPU TSC frequency to the guest.

Ongoing Development

ARM. Will Deacon posted the fith version of a patch series entitled “Add support for the ARMv8.3 Statistical Profiling Extension”, which provides a linear, virtually addressed memory buffer containing statistical samples (subject to various filtering) related to processor operations of interest that are performed by running (application) code. Sample records take the form of “packets”, which contain very detailed amounts of information, such as the virtual PC (Program Counter) address of a branch instruction, its type (conditional, unconditional, etc.), number of cycles waiting for the instruction to issue, the target, cycles spent executing the branch instruction, associated events (e.g. misprediction), and so on. Detailed information about the new extension is available in the ARM ARM, and is summarized in a blog post, here: https://community.arm.com/processors/b/blog/posts/statistical-profiling-extension-for-armv8-a

RISC-V. Palmer Dabbelt posted v4 of the enablement patch series adding support for the Open Source RISC-V architecture (which will then require various enablement for specific platforms that implement the architecture). In his patch posting, he notes changes from the previous version 3 that include disabling cmpxchg64 (a 64-bit instruction that performs an “atomic” compare and exchange operation, but which isn’t atomic on 32-bit systems) on 32-bit, adding an ELF_HWCAP (hardware capability) within binaries in order for users to determine the ISA of the machine, and various other miscellaneous changes. He asks for consideration that this be merged during the ongoing merge window for 4.13, which remains to be seen. We will track this in future episodes.

FOLL_FORCE. Keno Fischer noted that “Yes, people use FOLL_FORCE”, referencing a commit from Linus in which an effort had been made to “try to remove use of FOLL_FORCE entirely” on the procfs (/proc) filesystem. Keno says “We used these semantics as a hardening mechanism in the julia JIT. By opening /proc/self/mem and using these semantics, we could avoid needing RWX pages, or a dual mapping approach”. In other words, they cheat and don’t setup direct RWX mappings ahead of time but instead get access to them via the backdoor using the kernel’s “/proc/self/mem” interface directly. Linus replied, “Oh, we’ll just re-instate the kernel behavior, it was more an optimistic “maybe nobody will notice” thing, and apparently people did notice”.

GICv4. Marc Zyngier posted version 2 of a patch series entitled “irqchip: KVM: Add support for GICv4”, a “(monster of a) series [that] implements full suport for GICv4, bringing direct injection of MSIs [Message Signalled Interrupts] to KVM on arm and arm64, assuming you have the right hardware (which is quite unlikely)”. Marc says that the “stack has been *very lightly* tested on an arm64 model, with a PCI virtio block device passed from the host to a guet (using kvmtool and Jean-Philippe Brucker’s excellent VFIO support patches). As it has never seen any HW, I expect things to be subtly broken, so go forward and test if you can, though I’m mostly interested in people reviewing the code at the moment”. It’s awesome to see 64-bit ARM systems on par with legacy architectures when it comes to VM interrupt injection.

GPIO. Any Shevchenko posted a patch (with Linus Walleij’s approval) noting that Intel would help to maintain GPIO ACPI support in the GPIO subsystem.

Hardlockup. Nicholas Piggin posted “[RFC] arch hardlockup detector interfaces improvement” which aims to “make it easier for architectures that have their own NMI / hard lockup detector to reuse various configuration interfaces that are provided by generic detectors (cmdline, sysctl, suspend/resume calls)”. He “do[es] this by adding a separate CONFIG_SOFTLOCKUP_DETECTOR [kernel configuration option], and juggling around what goes under config options. HAVE_NMI_WATCHDOG continues to be the config for arch to override the hard lockup detector, which is expanded to cover a few more cases”.

HMM. Jérôme Glisse posted “Cache coherent device memory (CDM) with HMM” which layers above his previous HMM (Heterogenous Memory Management) to provide a generic means to manage device memory that behaves much like regular system memory but may still need managing “in isolation from regular memory” (for any number of reasons, including NUMA effects). This is particularly useful in the case of a coherently attached system bus being used to connect on-device memory memory, such as CAPI or CCIX. [disclaimer: this author chairs the CCIX software working group]

Hyper-V. KY Srinivasan posted an update version of his “Hyper-V: paravirtualized remote TLB flushing and hypercall improvements” patches, which aim to optimize the case of remote TLB flushing on other vCPUs within a guest. TLBs are micro caches that store VA (Virtual Address) to PA (Physical Address) translations for VMAs (Virtual Memory Areas) that need to be invalidated during a context switch operation from one process to another. Typically, an Operating System may either utilize an IPI (Inter-Processor-Interrupt) to schedule a remote function on other CPUs that will tear down their TLB entries, or – on more enlightened and sophisticated modern computer architectures – may perform a hardware broadcast invalidation instruction that achieves the same without the gratuitous overhead. On x86 systems, IPIs are commonly used by guest operating systems and their impact can be reduced by providing special guest hypercalls allowing for hypervisor assistance in place of broadcast IPIs. Jork Loeser also posted a patch updating the Hyper-V vPCI driver to “use the Server-2016 version of the vPCI protocol, fixing MSI creation”.

ILP32. Yury Norov posted version 8 of a patch series entitled “ILP32 for ARM64” which aims to enable support for the Integer Long Pointer 32-bit optional userspace ABI on 64-bit ARM processors. In ways similar to “x32” on 64-bit “x86” systems, ILP32 aims to provide the benefits of the new ARMv8 ISA without having to use 64-bit data types and pointers for code that doesn’t actually require such large data or a large address space. Pointers (pun intended) are provided to an example kernel, GLIBC, and an OpenSuSE-based Linux distribution built against the newer ABI.

IMC Instrumentation Support. Madhavan Srinivasan posted version 10 of a patch series entitled “IMC Instrumentation Support” which aims to provide support for “In-Memory-Collection” infrastructure present in IBM POWER9 processors. IMC apparently “contains various Performance Monitoring Units (PMUs) at Nest level (these are on-chip but off-core), Core level and Thread level. The Nest PMU counters are handled by a Nest IMC microcode which runs in the OCC (On-Chip Controller) complex. The microcode collects the counter data and moves the nest IMC counter data to memory”. This effectively seems to be a microcontroller managed mechanism for providing certain core and uncore counter data using a standardized interface.

Intel FPGA Device Drivers. Wu Hao posted version 2 of a patch series entitled “Intel FPGA Device Drivers”, which “provides interfaces for userspace applications to configure, enumerate, open and access FPGA accelerators on platforms equipped with Intel(R) PCIe based FPGA solutions and enables system level management functions such as FPGA partial reconfiguration, power management and virtualization”. In other words, many of the capabilities required for datacenter level deployment of PCIe-attached FPGA accelerators.

Interconnects. Georgi Djakov posted version 2 of a patch series entitled “Introduce on-chip interconnect API”, which aims to provide a generic API to help manage the many varied high performance interconnects present on modern high-end System-on-Chip “processors”. As he notes, “Modern SoCs have multiple processors and various dedicated cores (video, gpu, graphics, model). These cores are talking to each other and can generate a lot of data flowing through the on-chip interconnects. These interconnect buses could form different topologies such as crossbar, point to point buses, hierarchical buses or use the network-on-chip concept”. The API provides an ability (subject to hardware support thereof) to control bandwidth use, QoS (Quality-of-Service), and other settings. It also includes code to enable the Qualcomm msm8916 interconnect with a layered driver.

IRQs. Daniel Lezcano posted version 10 of a patch series entitled “irq: next irq tracking” which aims to predict future IRQ occurances based upon previous system behavior. “As previously discussed the code is not enabled by default, hence compiled out”. A small circular buffer is used to keep track of non-timer interrupt sources. “A third patch provides the mathematic to compute the regular intervals”. The goal is to predict future expected system wakeups, which is useful from a latency perspective, as well as for various scheduling, or energy calculations later on.

Memory Allocation Watchdog. Tetsuo Handa posted version 9 of a patch series entitled “mm: Add memory allocation watchdog kernel thread”, which “adds a watchdog which periodically reports number of memory allocating tasks, dying tasks and OOM victim tasks when some task is spending too long time inside __alloc_pages_slowpath() [the code path called when a running program – known as a task within the kernel – must synchronously block and wait for new memory pages to become available for allocation]”. Tetsuo adds, “Thanks to OOM [Out-Of-Memory] repear which can guarantee forward progress (by selected next OOM victim) as long as the OOM killer can be invoked, we can start testing low memory situations which are previously too difficult to test. And we are now aware that there are still corner cases remaining where the system hands without invoking the OOM killer”. The patch aims to help explain whenever long hangs are explained by memory allocation failure.

Memory Protection Keys. Ram Pai posted version 5 of a patch series entitled “powerpc: Memory Protection Keys”, which aims to enable a feature in future ISA3.0 compliant POWER architecture platforms comparable to the “memory protection keys” added by Intel to their Intel x64 Architecture (“x86” variant). As Ram notes, “The overall idea: A process allocates a key and associates it with an address range within its address space. The process then can dynamically set read/write permissions on the key without involving the kernel. Any code that violates the permissions of the address space; as define by its associated key, will receive a segmentation fault”. The patches enable support on the “PPC64 HPTE platform” and are noted to have passed all of the same tests as on x86.

Modules. Djalal Harouni posted version 4 of a patch series entitled “modules: automatic module loading restrictions”, which adds a new global sysctl flag, as well as per task one, called “modules_autoload_mode”. “This new flag allows to control only automatic module loading [the kernel-invoked auto loading of certain modules in response to user or system actions] and if it is allowed or not, aligning in the process the implicit operation with the explicit [existing option to disable all module loading] one where both are now covered by capabilities checks”. The idea is to prevent certain classes of security exploit wherein – for example – a system can be caused to load a vulnerable network module by sending it a certain packet, or an application calling a certain kernel function. Other such classes of attack exist against automatic module loading, and have been the subject of a number of CVE [Common Vulnerabilities and Exposures] releases requiring frantic system patching. This feature will allow sysadmins to limit module auto loading on some classes of systems (especially embedded/IoT devices).

Network filtering. Shubham Bansal posted an RFC patch entitled “RFC: arm eBPF JIT compiler” which “is the first implementation of eBPF JIT for [32-bit] ARM”. Russell King had various questions, including whether the code handled “endian issues” well, to which Shubham replied that he had not tested it with BE (Big Endian) but was interested in setting up qemu to run Big Endian ARM models and would welcome help improving the code.

NMI. Adrien Mahieux posted “x86/kernel: Add generic handler for NMI events” which “adds a generic handler where sysadmins can specify the behavior to adopt for each NMI event code. List of events is provided at module load or on kernel cmdline, so can also generic kdump upon boot error”. The options include silently ignoring NMIs (which actually passes them through to the next handler), drop NMIs (actually discard them), or to panic the kernel immediately. An example given is using the drop parameter during kdump in order to prevent a second NMI from triggering a panic while another crash dump is already capturing from the first.

Randomness. Jason A. Donenfield posted version 4 of a patch series entitled “Unseeded In-Kernel Randomness Fixes” which aims to address “a problem with get_random_bytes being used before the RNG [Random Number Generator] has actually been seeded [given an initial set of values following boot time]. The solution for fixing this appears to be multi-pronged. One of those prongs involves adding a simple blocking API so that modules that use the RNG in process context an just sleep (in an interruptable manner) until the RNG is ready to be used. This winds up being a very useful API that covers a few use cases, several of which are included in this patch set”.

Scheduler. Nico[las] Pitre posted “scheduler tinification” which “makes it possible to configure out some parts of the scheduler such as the deadline and realtime scheduler classes. The saving in kernel footprint is non negligible”. In the examples cited, kernel text shrinks by almost 8K, which is significant in some very small Linux systems, such as in IoT.

S.A.R.A. Salvatore Mesoraca posted “S.A.R.A. a new stacked LSM” (which your author is choosing to pronounce as in “Sarah”, for various reasons, and apparently actually stands for “S.A.R.A is Another Recursive Acronym”). This is “a stacked Linux Security Module that aims to collect heterogeneous security measures, providing a common interface to manage them. It can be useful to allow minor security features to use advanced management options, like user-space configuration files and tools, without too much overhead”.

Secure Memory Encryption (SME). Tom Lendacky posted version 8 of a patch series that implements support in Linux for this feature of certain future AMD CPUs. “SME can be used to mark individual pages of memory as encrypted through the page tables. A page of memory that is marked encrypted will be automatically decrypted when read from DRAM and will be automatically encrypted when written to DRAM”. In other words, SME allows a datacenter operator to build systems in which all data leaving the SoC is encrypted either at rest (on disk), or when hitting external memory buses that might (theoretically) be monitored. When combined with other features, such as “another AMD processor feature called Secure Encrypted Virtualization (SEV)” it becomes possible to protect user data from intrusive monitoring by hypervisor operators (whether mallicious or coerced). This is the correct way to provide memory encryption. While others have built a nonsense known as “enclaves”, the AMD approach correctly solves a more general problem. The AMD patches update various pieces of kernel infrastructure, from the UEFI code, to IOMMU support for carry page encryption state through.

SMIs. Kan Liang posted version 2 of a patch entitled “measure SMI cost (user)” which adds a “new sysfs entry /sys/device/cpu/freeze_on_smi” which will cause the “FREEZE_WHILE_SMM” bit in the Intel “IA32_DEBUGCTL” processor control register to be set. Once it is set, “the PMU core counters will freeze on SMI handler”. This can be usd with a “new –smi-cost mode in perf stat…to measure the SMI cost by calculating unhalted core cycles and aperf results”. SMIs, or “System Management Interrupts” are also referred to as “cycle stealing” in that they are used by platform firmware to perform various housekeeping tasks using the application processor cores, usually without either the Operating System, nor the user’s knowledge. SMIs are used by OEMs and ODMs to “add value”, but they are also used for such things as system fan control and other essentials. What should happen, of course, is that a generic management controller should be defined to handle this, but it was easier for the industry to build the mess that is SMIs, and for Intel to then add tracking for users to see where bad latencies come from.

Speculative Page Faults. Luarent Dufour posted version 5 of a patch series entitled “Speculative page faults”, which is “a port on kernel 4.12 of the work done by Peter Zijlstra to handle page fault without holding the mm semaphore”. As he notes, “The idea is to try to handle user space page faults without holding the mmap_sem [a per-task – the kernel side name for a running process – semaphore that is shared by all threads within a process]. This should allow better concurrency for massively threaded processes since the page fault handler will not wait for other threads[‘] memory layout change to be done, assuming that this change is done in another part of the process’s memory space. This type of page fault is named speculative page fault. If the speculative page fault fails because of a concurrency is detected of because underlying PMD [Page Middle Directory] or PTE [Page Table Entry] tables are not yet allocat[ed], it [fails] its processing and a classic page fault is then tried”.

THP. Kirill A. Shutemov posted a “HELP-NEEDED” thread entitled “Do not lose dirty bit on THP pages”, in which he notes that Vlastimil Babka “noted that pmdp_invalidate [Page Middle Directory Pointer invalidate] is not atomic and we can loose dirty and access bits if CPU sets them after pmdp dereference, but before set_pmd_at()”. Kirill notes that this doesn’t currently happen to lead to user-visible problems in the current kernel, but “fixing this would be critical for future work on THP: both huge-ext4 and THP [Transparent Huge Pages] swap out rely on proper dirty tracking”. By access and dirty tracking, Kirill means page table bits that indicate whether a page has been accessed or contains dirty data which should be written back to storage. Such bits are updated by hardware automatically on memory access. He adds that “Unfortunately, there’s no way to address the issue in a generic way. We need to fix all architectures that support THP one-by-one”. Hence the topic of the thread containing the words “HELP-NEEDED”. Martin Schwidefsky had some feedback to the proposed solution that it would not work on s390, but that if pmdp_invalidate returned the old entry, that could be used in order to update certain logic based on the dirty bits. Andrea Arcangeli replied to Martin, “That to me seems the simplest fix”. Separately, Kirill posted the “Last bits for initial 5-level paging” on x86.

Timers. Christoph Hellwig posted “RFC: better timer interface”, a patch series which “attempts to provide a “modern” timer interface where the callback gets the timer_list structure as an argument so that it can use container_of instead of having to cast to/from unsigned long all the time”. Arnd Bergmann noted that “This looks really nice, but what is the long-term plan for the interface? Do you expect that we will eventually change all 700+ users of timer_list to the new type, or do we keep both variants around indefinitely to avoid having to do mass-conversions?”. Christoph thought it was possible to perform a wholesale conversion, but that “it might take some time”.

Thunderbolt. Mika Westerberg posted version 3 of a patch series implementing “Thunderbolt security levels and NVM firmware upgrade”. Apparently, “PCs running Intel Falcon Ridge or newer need these in order to connect devices if the security level is set to “user(SL1) or secure(SL2)” from BIOS” and “The security levels were added to prevent DMA attacks when PCIe is tunneled over Thunderbolt fabric where IOMMU is not available or cannot be enabled for different reasons”. While cool, it is slightly saddening that some of the awesome demos from recent DEFCONs will be slightly harder to reproduce by nation state actors and those who really need to get outside more often.

VAS. Sukadev Bhattiprolu posted version 5 of a patch series entitled “Enable VAS”, a “hardware subsystem referred to as the Virtual Accelerator Switchboard” in the IBM POWER9 architecture. According to Sukadev, “VAS allows kernel subsystems and user space processes to directly access the Nest Accelerator (NX) engines which implement compression and encryption algorithms in the hardware”. In other words, these are simple workload acceleration engines that were previously only available using special (“icswx”) privileged instructions in earlier versions of POWER machines and are now to be available to userspace applications through a multiplexing API.

WMI. Darren Hart posted an updated “Convert WMI to a proper bus” patch series, which “converts WMI [Windows Management Instrumentation] into a proper bus, adds some useful information via sysfs, and exposes the embedded MOF binary. It converts dell-wmi to use the WMI bus architecture”. WMI is required to manage various contempory (especially laptop) hardware, including backlights.

Xen. Juergen Gross posted “xen: add sysfs node for guest type” which provides information known to the guest kernel but not previously exposed to userspace, including the type of virtualization in use (HVM, PV, or PVH), and so on.

zRam. Minchan Kim posted an RFC patch entitled “writeback incompressible pages to storage”, which seeks to have the best of both worlds – the compression of Ram while handling cases where memory is incompressible. In the case that an admin sets up a suitable block device, it can be arranged that incompressible pages are written out to storage instead of using RAM.

zswap. Srividya Desireddy posted version 2 of a patch that seeks to explicitly test for so-called “zero-filled” pages before submitting them for compression. This saves time and energy, and reduces application startup time (on the order of about 3% in the example given).

 

July 08, 2017 09:31 PM

July 06, 2017

Rusty Russell: Broadband Speeds, 2 Years Later

Two years ago, considering the blocksize debate, I made two attempts to measure average bandwidth growth, first using Akamai serving numbers (which gave an answer of 17% per year), and then using fixed-line broadband data from OFCOM UK, which gave an answer of 30% per annum.

We have two years more of data since then, so let’s take another look.

OFCOM (UK) Fixed Broadband Data

First, the OFCOM data:

So in the last two years, we’ve seen 26% increase in download speed, and 22% increase in upload, bringing us down from 36/37% to 33% over the 8 years. The divergence of download and upload improvements is concerning (I previously assumed they were the same, but we have to design for the lesser of the two for a peer-to-peer system).

The idea that upload speed may be topping out is reflected in the Nov-2016 report, which notes only an 8% upload increase in services advertised as “30Mbit” or above.

Akamai’s State Of The Internet Reports

Now let’s look at Akamai’s Q1 2016 report and Q1-2017 report.

This gives an estimate of 19% per annum in the last two years. Reassuringly, the US and UK (both fairly high-bandwidth countries, considered in my previous post to be a good estimate for the future of other countries) have increased by 26% and 19% in the last two years, indicating there’s no immediate ceiling to bandwidth.

You can play with the numbers for different geographies on the Akamai site.

Conclusion: 19% Is A Conservative Estimate

17% growth now seems a little pessimistic: in the last 9 years the US Akamai numbers suggest the US has increased by 19% per annum, the UK by almost 21%.  The gloss seems to be coming off the UK fixed-broadband numbers, but they’re still 22% upload increase for the last two years.  Even Australia and the Philippines have managed almost 21%.

July 06, 2017 10:01 AM

June 29, 2017

Linux Plumbers Conference: Containers Microconference accepted into Linux Plumbers Conference

Following on from the Containers Microconference last year, we’re pleased to announce there will be a follow on at Plumbers in Los Angeles this year.

The agenda for this year will focus on unsolved issues and other problem areas in the Linux Kernel Container interfaces with the goal of allowing all container runtimes and orchestration systems to provide enhanced services.  Of particular interest is the unprivileged use of container APIs in which we can use both to enable self containerising applications as well as to deprivilege (make more secure) container orchestration systems.  In addition we will be discussing the potential addition of new namespaces: (LSM for per-container security modules; IMA for per-container integrity and appraisal, file capabilities to allow setcap binaries to run within unprivileged containers)

For more details on this, please see this microconference’s wiki page.

We hope to see you there!

June 29, 2017 05:59 PM

June 20, 2017

Arnaldo Carvalho de Melo: Pahole in the news

Found another interesting article, this time mentioning a tool I wrote long ago and that, at least for kernel object files, has been working for a long time without much care on my part: pahole, go read a bit about it at Will Cohen’s “How to avoid wasting megabytes of memory a few bytes at a time” article.

Guess I should try running a companion script that tries to process all .o files in debuginfo packages to see how bad it is for non-kernel files, with all the DWARF changes over these years…


June 20, 2017 03:49 PM

June 15, 2017

Linux Plumbers Conference: Early Bird Rate Registration Ending Soon

A reminder that our Early Bird registration rate is ending soon. The last day at the Early Bird rate of 400$ is Sunday June 18th. We are also almost sold out of Early Bird slots (15% left of our quota). Get yours soon!
Starting June 19th registration will be at the regular rate of 550$.
Please see the Attend page for info.

June 15, 2017 11:20 PM

June 14, 2017

Paul E. Mc Kenney: Stupid RCU Tricks: Simplifying Linux-kernel RCU

The last month or two has seen a lot of work simplifying the Linux-kernel RCU implementation, with more than 2700 net lines of code removed. The remainder of this post lists the user-visible changes, along with alternative ways to get the corresponding job done.


  1. The infamous CONFIG_RCU_KTHREAD_PRIO Kconfig parameter is now defunct, but the rcutree.kthread_prio kernel boot parameter gets the job done.
  2. The CONFIG_NO_HZ_FULL_SYSIDLE Kconfig parameter has kicked the bucket. There is no replacement because no one was using it. If you need it, revert the -rcu commit tagged by sysidle.2017.05.11a.
  3. The CONFIG_PROVE_RCU_REPEATEDLY Kconfig parameter is no more. There is no replacement because as far as I know, no one has used it for many years. It was a great help in tracking down lockdep-RCU warnings back in the day, but these warnings are now sufficiently rare that finding them one boot at a time is no longer a problem. If you need it, do the obvious hacking on Kconfig and lockdep.c.
  4. The CONFIG_SPARSE_RCU_POINTER Kconfig parameter now rests in peace. There is no replacement because there doesn't seem to be any reason for RCU's sparse checking to be the only such checking that is optional. If you really need to disable RCU's sparse checking, hand-edit the definition as needed.
  5. The CONFIG_CLASSIC_SRCU Kconfig parameter bought the farm. This was only present to handle massive failures of the new Tree/Tiny SRCU implementations, but these appear to be quite reliable and should be used instead of Classic SRCU.
  6. RCU's debugfs tracing is done for. As far as I know, I was the only real user, and I haven't used it in years. If you need it, revert the -rcu commit tagged by debugfs.2017.05.15a.
  7. The CONFIG_RCU_NOCB_CPU_NONE, CONFIG_RCU_NOCB_CPU_ZERO, and CONFIG_RCU_NOCB_CPU_ALL Kconfig parameters have departed. Use the rcu_nocbs kernel boot parameter instead, which can do quite a bit more than those Kconfig parameters ever could.
  8. Tiny RCU's event tracing and RCU CPU stall warnings are now pushing up daisies. The point of Tiny RCU is to be tiny and educational, and these added features were not helping reach either of these two goals. The replacement is to reproduce the problem with Tree RCU.
  9. These changes should matter only to people running rcutorture:

    1. The CONFIG_RCU_TORTURE_TEST_SLOW_PREINIT and CONFIG_RCU_TORTURE_TEST_SLOW_PREINIT_DELAY Kconfig parameters have been entombed: Use the rcutree.gp_preinit_delay kernel boot parameter instead.
    2. The CONFIG_RCU_TORTURE_TEST_SLOW_INIT and CONFIG_RCU_TORTURE_TEST_SLOW_INIT_DELAY Kconfig parameters have given up the ghost: Use the rcutree.gp_init_delay kernel boot parameter instead.
    3. The CONFIG_RCU_TORTURE_TEST_SLOW_CLEANUP and CONFIG_RCU_TORTURE_TEST_SLOW_CLEANUP_DELAY Kconfig parameters have passed on: Use the rcutree.gp_cleanup_delay kernel boot parameter instead.
There will probably be a few more simplifications in the near future, but this should be at least enough for one merge window!

June 14, 2017 09:03 PM

June 12, 2017

Linux Plumbers Conference: RDMA Microconference Accepted into the Linux Plumbers Conference

Following on from the successful RDMA Microconference last year, which resulted in a lot of fruitful discussions we’re pleased to announce there will be a follow on at Plumbers in Los Angeles this year.

In addition to looking at the usual kernel core gaps and ABI issues, Documentation and testing, we’ll also be looking at new fabrics (including NVME), challenges to implement virtual RDMA device and integration possibilities with netdev.

For more details on this, please see this microconference’s wiki page.

June 12, 2017 09:53 PM

Arnaldo Carvalho de Melo: Article about ‘perf annotate’

Just found out about Ravi’s article about ‘perf annotate’, concise yet covers most features, including cross-annotation, go read it!


June 12, 2017 07:21 PM

June 09, 2017

Paul E. Mc Kenney: Stupid RCU Tricks: rcutorture Accidentally Catches an RCU Bug

With the Linux-kernel v4.13 merge window coming up, it is time to do at least a little heavy-duty testing of the patches destined for v4.14, which had been but lightly tested on my laptop. An overnight run on a larger test machine looked very good—with the exception of scenario TREE01 (defined by tools/testing/selftests/rcutorture/configs/rcu/TREE01{.boot,} in the Linux-kernel source tree), which got no fewer than 190 failures in a half-hour run. In other words, rcutorture saw 190 too-short grace periods in 30 minutes, for about one every 20 seconds.

This is not just bad. This is RCU completely and utterly failing to be RCU.

My first action was to re-run the tests on the commits slated for v4.13. You can imagine my relief to see them pass on all scenarios, including TREE01.

Then it was time for bisection. I have been burned many times by false bisections due to RCU's probabilistic failure modes, so I ran 24 30-minute tests on each commit. Fortunately, I could run six in parallel, so that each commit only consumed about two hours of test time. The bisection converged on a commit that adds a --kconfig argument to the rcutorture scripts, which allow me to do things like force lockdep to run in all scenarios. However, this commit should have absolutely no effect on the inner workings of RCU.

OK, perhaps this commit managed to fatally mess up the .config file. But no, the .config files from this commit compare equal to those from the preceding commit. Some additional poking gives me confidence that the kernels being built are also identical. Still, the one fails and the other does not.

The next step is to look very carefully at the console output from the failing runs, most of which contain many complaints about RCU grace periods being too short. Except that one of them also contains RCU CPU stall warnings. In fact, one of the stall warnings lists no fewer than 26 CPUs as stalling the current RCU grace period.

This came as a bit of a surprise, partly because I don't ever recall ever seeing that many CPUs stalling a single grace period, but mostly because the test was only supposed to use eight CPUs.

A look at the beginning of the console output showed that RCU was inexplicably prepared to deal with 43 CPUs instead of the expected eight. A bit more digging showed that the qemu command used to run the failing test had “-smp 43”, while the qemu command for the successful test instead had “-smp 8”. In both cases, the qemu command also included the kernel boot parameter “maxcpus=8”. And a very stupid bug in the --kconfig change to the scripts turned out to be responsible for the bogus -smp argument.

The next step is to swap the values of qemu's -smp argument. And the failure follows the “-smp 43” setting. This means that it is possible that the RCU failures are due to a latent timing bug in RCU. After all, the test system has only 64 CPUs, and I was running 43*6=258 CPUs worth of tests on it. But running six concurrent rcutorture tests with both -smp and maxcpus set to 43 passes with flying colors. So RCU must be suffering from some other problem.

The next question is exactly what is supposed to happen when qemu and the kernel have very different ideas of how many CPUs there are. The ever-helpful Documentation/admin-guide/kernel-parameters.txt file states that maxcpus= limits not the overall number of CPUs, but rather the number that are brought up at boot time. Another look at the console output confirms that in the failing case, eight CPUs are brought up at boot time. However, the other 35 come online some time after boot, sometimes taking a few minutes to come up. Which explains another anomaly I noticed while bisecting, namely that about half the tests ran 30 minutes without failure, but the ones that failed did so within the first five minutes of the run. Apparently the RCU failures are connected somehow to the late arrival of the extra 35 CPUs.

Except that RCU configured itself for the full 43 CPUs, and RCU is supposed to be able to handle CPUs coming and going. In fact, RCU has repeatedly demonstrated its ability to handle CPUs coming and going for more than a decade. So it is time to enable event tracing on a failure scenario (thank you, Steve!). One of the traces shows that there is no RCU callback connected with the first failure, which points the finger of suspicion at RCU expedited grace periods.

A quick inspection of the expedited code shows missing synchronization for the case where a CPU makes its very first appearance just as an expedited grace period starts. Oh, the leaf rcu_node structure's ->lock is held both when updating the number of CPUs that have ever been seen (which is the rcu_state structure's ->ncpus field) and when updating the bitmasks indicating exactly which CPUs have ever been seen (which is the leaf rcu_node structure's ->expmaskinitnext field), but it drops that lock between those two updates.

This means that the expedited grace period might sample the ->ncpus field, notice the change, and therefore check all the ->expmaskinitnext fields—but before those fields had been updated. Not a problem for this grace period, since the new CPUs haven't yet started and thus cannot yet be running any RCU read-side critical sections, which means that there is no reason whatsoever for this grace period to pay any attention to them. However, the next expedited grace period would again sample the ->ncpus field, see no change, and thus not bother checking the ->expmaskinitnext fields. Thus, this grace period would also ignore the new CPUs, which by this time could be very much alive and running RCU read-side critical sections. Hence the too-short grace periods, and hence them showing up within the first few minutes of the run, during the time that the extra 35 CPUs are in the process of coming online.

The fix is easy: Just move the update of ->ncpus to the same critical section as the update of ->expmaskinitnext. With this fix, rcutorture passes the TREE01 scenario even with bogus -smp arguments to qemu. There is therefore once again a bug in rcutorture: There are still bugs in RCU somewhere, and rcutorture is failing to find them!

Strangely enough, I might never have noticed the bug in expedited grace periods had I not made a stupid mistake in the scripting. Sometimes it takes a bug to locate a bug!

June 09, 2017 08:49 PM

June 07, 2017

Paul E. Mc Kenney: Verification Challenge 6: Linux-Kernel Tree RCU

It has been more than two years since I posted my last verification challenge, so it is only natural to ask what, if anything, has happened in the meantime. The answer is “Quite a bit!”

I had the privilege of attending The Royal Society Verified Trustworthy Software Systems Meeting, where I was on a panel on “Verification in Industry”. I also attended the follow-on Verified Trustworthy Software Systems Specialist Meeting, where I presented on formal verification and RCU. There were many interesting presentations (see slides 9-12 of this presentation), the most memorable being a grand challenge to apply formal verification to machine-learning systems. If you think that challenge is not all that grand, you should watch this video, which provides an entertaining demonstration of a few of the difficulties.

Closer to home, in the past year there have been three successful applications of automated formal-verification tools to Linux-kernel RCU:


  1. Lihao Liang applied the C Bounded Model Checker (CBMC) to Tree RCU (draft paper). This was a bit of a tour de force, converting Linux-kernel C code (with a bit of manual preprocessing) to a logic expression, then invoking a SAT solver on that expression. The expression's variables correspond to the inputs to the code, the possible multiprocessor scheduling decisions, and the possible memory-model reorderings. The expression evaluates to true if there exists an execution that triggers an assertion. The largest expression had 90 million boolean variables, 450 million clauses, occupied some tens of gigabytes of memory, and, stunningly, was solved with less than 80 hours of CPU time. Pretty amazing considering that SAT is NP complete and that two to the ninety millionth power is an excessively large number!!!
  2. Michalis Kokologiannakis applied Nidhugg to Tree RCU (draft paper). Nidhugg might not make quite as macho an attack on an NP-complete problem as does CBMC, but there is some reason to believe that it can handle larger chunks of the Linux-kernel RCU code.
  3. Lance Roy applied CBMC to Classic SRCU. Interestingly enough, this verification could be carried out completely automatically, without manual preprocessing. This approach is therefore available in my -rcu tree (git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git) on the branch formal.2017.06.07a.


In all three cases, the tools verified portions of RCU and SRCU as correct, and in all three cases, the tools successfully located injected bugs. (Hey, any of us could write a program that consisted of “printf(“Validated/n");”, so you do have to validate the verifier!) And yes, I have given these guys trouble about the fact that their tools didn't find any bugs that I didn't already know about, but these results are nevertheless extremely impressive. Had you told me ten years ago that this would happen, I have no idea how I would have responded, but I most certainly would not have believed you.

In theory, Nidhugg is more scalable but less thorough than CBMC. In practice, it is too early to tell.

So what is the status of the first five verification challenges?


  1. rcu_preempt_offline_tasks(): Still open. That said, Michalis found the infamous RCU bug at Linux-kernel commit 281d150c5f88 and further showed that my analysis of the bug was incorrect, though my fixes did actually fix the bug. So this challenge is still open, but the tools have proven their ability to diagnose rather ornate concurrency bugs.
  2. RCU NO_HZ_FULL_SYSIDLE: Still open. Perhaps less pressing given that it will soon be removed from the kernel, but the challenge still stands!
  3. Apply CBMC to something: This is an ongoing challenge to developers to give CBMC a try on concurrent code. And why not also try Nidhugg?
  4. Tiny RCU: This was a self-directed challenge was “born surmounted”.
  5. Uses of RCU: Lihao, Michalis, and Lance verified some simple RCU uses as part of their work, but this is an ongoing challenge. If you are a formal-verification researcher and really want to prove your tool's mettle, take on the Linux kernel's dcache subsystem!


But enough about the past! Given the progress over the past two years, a new verification challenge is clearly needed!

And this sixth challenge is available on 20 branches whose names start with “Mutation” at https://github.com/paulmckrcu/linux.git. Some of these branches are harmless transformations, but others inject bugs. These bugs range from deterministic failures to concurrent data races to forward-progress failures. Can your tool tell which is which?

If you give any of these a try, please let me know how it goes!

June 07, 2017 08:41 PM

June 04, 2017

Kernel Podcast: Catching up on podcasts…new one drops Monday!

Sorry for the delay with getting podcasts out. I’m working on a new one! Coming Monday!

June 04, 2017 06:44 AM

May 31, 2017

Eric Sandeen: 2012 Nissan LEAF battery deathwatch

First of all – I think EVs are great.  They are the future of personal transportation.  But this is the story of a first-gen EV battery with some … issues.

I bought a used 2012 Nissan LEAF with about 38k miles for a great price – in part because it started life as a leased car in Texas, and the early LEAF batteries didn’t much like the heat.  As a result, the battery is not super healthy, with only about 60 miles of range on a full charge on a balmy day.  While this is enough to get me around on most days, there are times when a bit more range would be nice.  Thankfully, Nissan has retroactively warrantied LEAF batteries to retain 70% of their capacity (really, closer to 66%) for the first 5 years or 60,000 miles.

The LEAF dash shows remaining battery capacity (as opposed to current charge) on a 12-bar scale; when new, it showed 12 bars, and Nissan will warranty the battery if it gets to 8 bars or less.  My car currently has 9 bars.  1 to go.

So this was a gamble.  I’d actually like my battery to lose enough capacity before January 2018 to get a warranty replacement.

Thanks to a cool app called LeafSpy, I can monitor battery health,and correlate it to what others have said about when they dropped that 9th bar.  I’ll try to remember to update this periodically, but here are the readings so far, with trend lines and “target” values based on when The Internet said they lost their 9th bar, on average.  The aHr metric seems most relevant. With luck, it looks like I may make it, though I can’t explain the recent plateau after the initial steady decline…

I’ll try to remember to update this occasionally as time goes by.
Update: Here’s a constantly updated version of my stats:

May 31, 2017 01:41 AM

May 24, 2017

Pete Zaitcev: Community Meeting

<notmyname> first, the idea of having a regular meeting in addition to this one for people in different timezones
<cschwede_> +2!
<notmyname> specifically, mahatic and pavel/onovy/seznam. but of course we've all seen various chinese contributors too
<notmyname> but the point is that it's a place to bring up stuff that those in the other time zones are working on
<mattoliverau> Cool
<notmyname> I think it's a terrific idea
<tdasilva> i bet the guys working on tape would like that too
<notmyname> my goal is to find a time for it that is so horrible for US timezones that it will be obvious that not everyone needs to be there
<zaitcev> Yeah, if only there was a way to send a message... like a mail... to a list of people. And then it could be stored on a computer somewhere, ready to be read in any timezone recepient is in.
<notmyname> zaitcev: crazytown!
<mattoliverau> zaitcev: now your just talkin crazy

May 24, 2017 09:52 PM

May 23, 2017

Michael Kerrisk (manpages): Linux Shared Libraries course, Munich, Germany, 20 July 2017

I've scheduled a public instance of my "Building and Using Shared Libraries on Linux" course to take place in Munich, Germany on 20 July 2017.  This one-day course provides a thorough introduction to building and using shared libraries. covering topics such as: the basics of creating, installing, and using shared libraries; shared library versioning and naming conventions; the role of the dynamic linker; run-time symbol resolution; controlling symbol visibility; symbol versioning; preloading shared libraries; and dynamically loaded libraries (dlopen). The course format is a mixture of theory and practical.

The course is aimed at programmers who create and use shared libraries. Systems administrators who are managing and troubleshooting applications that use shared libraries will also find the course useful.

You can find out more about the course (such as expected background and course pricing) at http://man7.org/training/shlib/ and see a detailed course outline at
http://man7.org/training/shlib/shlib_course_outline.html.

May 23, 2017 02:14 PM

Michael Kerrisk (manpages): Cgroups/namespaces/seccomp/capabilities course

There are still some places available on my "Linux Security and Isolation APIs" that will take place in Munich, Germany on 17-19 July 2017.  This three-day course provides a deep understanding of the low-level Linux features (set-UID/set-GID programs, capabilities, namespaces, cgroups, and seccomp) used to implement privileged applications and build container, virtualization, and sandboxing technologies. The course format is a mixture of theory and practical.

The course is aimed at designers and programmers building privileged applications, container applications, and sandboxing applications. Systems administrators who are managing such applications are also likely to find the course of benefit.

You can find out more about the course (such as expected background and course pricing) at
http://man7.org/training/sec_isol_apis/
and see a detailed course outline at
http://man7.org/training/sec_isol_apis/sec_isol_apis_course_outline.html

May 23, 2017 02:01 PM

May 18, 2017

Linux Plumbers Conference: Linux Kernel Memory Model Workshop Accepted into Linux Plumbers Conference

A good understanding of the Linux kernel memory model is essential for a great many kernel-hacking and code-review tasks.  Unfortunately, the current documentation (memory-barriers.txt) has been said to frighten small children, so this workshop’s goal is to demystify this memory model, including hands-on demos of the tools, help installing/running the tools, and help constructing appropriate litmus tests.  These tools should go a long way toward the ultimate goal of automating the process of using memory models to frighten small children.

For more information, please see this microconference’s wiki page.  For those who like getting a head start, this page also includes information on downloading and installing the tools, the memory model, and thousands of pre-existing litmus tests.  (Collect the whole set!!!) We also welcome experience reports from early adopters of these tools.

We hope to see you there!

May 18, 2017 05:04 PM

May 15, 2017

Kernel Podcast: Linux Kernel Podcast for 2017/05/14

Audio: http://traffic.libsyn.com/jcm/20170514.mp3

In this week’s catchup mega-issue: Linux 4.12-rc1 (including a full summary of the 4.12 merge window), Linux 4.11 final is released, saving TLB flushes, various ongoing development, and a bunch of announcements.

Editorial Note

This podcast is a free service that I provide to the community in my spare time. It takes many, many hours to prepare and produce a single episode, much more during the merge window. This means that when I have major events (such as Red Hat Summit followed by OpenStack Summit) it will be delayed, as was the case this last week week. Over the coming months, I hope to automate the production in order to reduce the overhead but there will be some weeks where I need to skip a show. I am however covering the whole 4.12 merge window regardless. So while I would usually have just moved on, the circumstance warrants a mega-length catchup episode. I hope you’re still awake by the end.

Linux 4.12-rc1

Linus Torvalds announced Linux 4.12-rc1, “one day early, because I don’t like last-minute pull requests during the merge window anyway, and tomorrow is mother’s day [in the US], so I may end up being roped into various happenings”. He also noted “Besides, this has actually been a pretty large merge window, so despite there technically being time for one more day of pulls, I actually do have enough changes already. So there.” In his announcement, he says things look smooth so far, but calls those also “Famous last words”. Finally, he calls out the “odd” diffstat which is dominated by the AMD Vega10 headers. As was noted in the pull requests, unlike certain other graphics companies, AMD actually provides nice automatically generated headers and other information about their graphics chipsets, which is why the Vega10 update is plentiful.

Later in the day yesterday, following the 4.12-rc1 announcement, Guenter Roeck posted “watchdog updates for v4.12”, and Jon Mason posted “NTB bug fixes for vv4.12”, along with an apologies for tardiness.

Linux 4.11

Linus Torvalds announced Linux 4.11 noting that the extra week due a (rare-ish) “rc8” (Release Candidate 8) meant that he had felt “much happier releasing a final 4.11 now”. As usual, Linux Kernel Newbies has a writeup of 4.11, here: https://kernelnewbies.org/Linux_4.11

Announcements

Greg K-H (Kroah-Hartman) announced Linux 4.4.68, 4.9.28, 4.10.16, and 4.11.1. He later sent “Bad signatures on recent stable updates” in which he noted that “The stable kernels I just released have had signatures due to a mixup using pixz in the new kernel.org backend. It will be fixed soon…”, which were later corrected. He would like to hear from anyone still seeing problems.

Greg also announced (separately) Linux 3.18.52. While Jiri Slaby announced Linux 3.12.74.

Stephen Hemminger announced iproute2 4.11 matching the new kernel release.

Michael Kerrisk announced map-pages-4.11.

Steven Rostedt announced trace-cmd 2.6.1.

Steven also announced Linux 4.4.66-rt79, 3.18.51-rt57, and 3.12.73-rt98 (preempt-rt) kernels.

Con Kolivas posted an updated version of his (renamed) “MuQSS CPU scheduler” [renamed from the BFS – Brain F*** Scheduler] in Linux 4.11-ck1.

Karel Zak announced util-linux v2.30-rc1, which includes a fix to libblkid that “has been fixed to extract LABEL= and UUID= from UDF rather than ISO9660 header on hybrid CDROM/DVD media. This change[] makes UDF media on Linux user-space more compatible with another operation systems.” but he calls it out since it could also introduce regressions for some other users.

Junio C Hamano announced Git version 2.13.0. Separately, he released maintenance versions of “Git v2.12.3 and others” which include fixes for
“a recently disclosed problem with “git shell”, which may allow a user who comes over SSH to run an interactive pager by causing it to spawn “git upload-pack –help” (CVE-2017-8386).”

Jan Kiszka announced version 0.7 of the Jailhouse hypervisor, which includes various debug and regular console driver updates and gcov debug statistics.

Bartosz Golaszewski announced libgpiod v0.2: “The most prominent new feature is the test suite working together with the gpio-mockup module”.

Christoph Hellwig notes that the Open OSD [an in-kernel OSD – Object-Based Storage Device] SCSI initiator library for Linux seems to be dead. He does this by posting a patch to the MAINTAINERS file “update OSD entries” in which he removes the (now defunct) open-osd.org website, and the bouncing email address for Benny Halevy. Benny appeared and ACKed.

In a similar vain, Ben Hutchings pondered aloud about the “Future of liblockdep”, which apparently “hasn’t been buildable since (I think) Linux
4.6”. Sasha Levin said things would be cleaned up promptly. And they were, with a pull request soon following with fixes for Linux 4.12.

Masahiro Yamada posted an RFC patch entitled “Increase Minimal GNU Make version for Linux Kernel from 3.80 to 3.81” in which he essentially noted that the kernel hadn’t actually worked with 3.80 (which is 15 years old!) in a very long time, but instead actually really needs 3.81 (which was itself released in 2006). It was apparently “broken” 3 years ago, but nobody noticed. Neither Greg K-H (Kroah-Hartman) nor Linus seemed to lose any sleep over this, with Linus saying “you make a strong case of “it hasn’t worked for a while already and nobody even noticed””.

Paolo Bonzini posted “CFP: KVM Forum 2017” announcing that the KVM Forum will be held October 25-27 at the Hilton in Prague, CZ, and that all submissions for proposed topics must be made by midnight June 15.

Thomas Gleixner announced “[CFP] RT-Summit – Call for Presentations” noting that the Real-Time Summit 2017 is being organized by the Linux Foundation Real-Time Linux (RTL) collaborative project in cooperation with OSADL/RTLWS and will be held also in Prague on October 21st. The cutoff for submissions is July 14th via rt-cfp@linutronix.de.

4.12 Merge Window

In his 4.11 announcement, Linus reminded us that the release of 4.11 meant that “the merge window [for kernel 4.12] is obviously open. I already have two pull request[s] for 4.12 in my inbox, I expect that overnight I’ll get a lot more.” He wasn’t disappointed. The flood gates well and truly opened. And they continued going for the entire two week (less one day) period. Let’s dive into what has been posted so far for 4.12 during the (now closed) merge window.

Stephen Rothwell [linux-next pre-merge development kernel tree maintainer] noted in a head’s up that Linus was going to see a “Large new drm driver” [drm – Direct Rendering Manager, not the “digital rights” technology]. Dave Airlie (the drm maintainer) had a reply but Stephen said everything was just fine and he was simply seeking to avoid surprising Linus (again). Once the pull came in, and Linus had pulled it, he quickly followed up to note that he was getting a lot of warnings about “Atomic update on pipe (A) took”. Daniel Vetter followed up to say that “We [Intel] did improve evasion a lot to the point that it didn’t show up in our CI machines anymore, so we felt we could risk enabling this everywhere. But of course it pops up all over the place as soon as drm-next hits mainline”.

4.12 git Pulls for existing subsystems

Hans-Christian Noren Egtvedt posted “AVR32 change for 4.12 – architecture removal” in which he removes AVR32 and “clean away the most obvious architecture related parts”. He posted followups to pick off more leftovers.

Ingo Molnar posted “RCU changes for 4.12” which includes “Parallelize SRCU callback handling”, performance improvements, documentation updates, and various other fixes. Linus pulled it. But then “after looking at it, ended up un-pulling it again”. He posted a rant about a new header file (linux/rcu_segcblist.h) which was a “header file from hell”, saying “I see absolutely no point in taking a heade file of several hundred lines of code”, along with more venting about the use of too much inline code (code that is always expanded in-place rather than called as a function – leading to a larger footprint sometimes). Finally, Linus said “The RCU code needs to start showing some good taste”. Sir Paul McKenney, the one and only author of RCU followed up swiftly, apologizing for the transgression in attempting to model “the various *list*.h header files”, proposing a fix, which Linus liked. Ingo Molnar implemented the suggestions, in “srcu: Debloat the <linux/rcu_segcblist.h> head”, which Paul provided a minor fix against for the case of !SMP (non-multi-processor kernel) builds.

Ingo Molnar also posted “EFI changes for 4.12” including fixes to the BGRT ACPI table (used for boottime graphics information) to allow it to be shared between x86 and ARM systems, an update to the arm64 boot protocol, improvements to the EFI stub’s command line parsing, and support for randomizing the virtual mapping of UEFI runtime services on arm64. The latter means that the function pointers for UEFI Runtime Services callbacks will be placed into random virtual address locations during the call to ExitBootServices which sets up the new mappings – it’s a good way to look for problems with platforms containing broken firmware that doesn’t correctly handle the change in location of runtime service calls.

Ingo Molnar also posted “x86/process changes for 4.12” which includes a new ARCH_[GET|SET]_CPUID prctl (process control) ABI extension that a running process can use in order to determine whether it has access to call the CPUID instruction directly. This is to support a userspace debugger known as “rr” that would like to trap and emulate calls to “CPUID” which are otherwise normally unprivileged on x86 systems.

Separately, Ingo posted “x86 fixes”, which includes “mostly misc fixes” for such things as “two boot crash fixes”, etc.

Ingo Molnar also posted “perf changes for 4.12” which includes updates to K and uprobes, making their trampolines (the codepaths jumped through when executing the probe sequence) read-only while they are used, changing UPROBES_EVENTS to be default yes in the Kconfig (since distros do this), and various other fixes. He also includes support for AMD IOMMU events, and new events for Intel Goldmont CPUs. The perf tooling itself gets many more updates, including PERF_RECORD_NAMESPACES, which allows the kernel to record information “required to associate samples to namespaces”.

Separately, Ingo posted “perf fixes”, which includes “mostly tooling updates”.

Ingo Molnar also posted “RAS changes for v4.12” which includes a “correct Errors Collector” kernel feature that will gather statistics aout correctable errors affecting physical memory pages. Once a certain watermark is reached, pages generating many correctable errors will be permanently offlined [this is useful both for DDR and NV-DIMMs]. Finally, he deprecates the existing /dev/mcelog driver and includes cleanups for MCE (Machine Check Exception) errors during kexec on x86 (which we covered in previous editions of this podcast).

Ingo Molnar also posted “x86/asm changes for v4.12”, which includes various fixes, among which are cleanups to stack trace unwinding.

Ingo Molanr also posted “x86/cpu changes for v4.12”, which includes support for “an extension of the Intel RDT code to extend it with Intel Memory Bandwidth Allocation CPU support: MBA allows bandwidth allocation between cores, while CBM (already upstream) allows CPU cache partitioning”. Effectively, Intel incorporate changes to their memory controller’s hardware scheduling algorithms as part of RDT. These allow the DDR interface to manage bandwidth for specific cores, which will almost certainly include both explict data operations, as well as separate algorithms for prefetching and speculative fetching of instructions and data. [This author has spent many hours reading about memory controller scheduling over the past year]

Ingo Molnar also posted “x86/debug changes for v4.12”, which includes support for the USB3 “debug port” based early console. As we have mentioned previously, USB3 includes a built-in “debug port” which no longer requires a special dongle to connect a remote machine for debug. It’s common in Windows kernel development to use a debug port, and since USB3 includes baseline support with the need for additional hardware, serial over USB3 is likely to become more common when developing for Linux – especially with the demise of DB9 headers on systems or even IDC10 headers on motherboards internally (to say nothing of laptop systems). As a reminder, with debug ports, usually only one USB port will support debug mode. I
guess my old USB debug port dongle can go in the pile of obsolete gear.

Ingo Molnar also posted “x86/platform changes for v4.12” which includes “continued SGI UV4 hardware-enablement changes, plus there’s also new Bluetooth support for the Intel Edison [a low cost IoT board] platform”.

Ingo Molnar also posted “x86/vdso changes for v4.12” which includes support for a “hyper-V TSC page” which is what it sounds like – a special shared page made available to guests under Microsoft’s Hyper-V hypervisor and providing a fast means to enumerate the current time. This is plumbed into the kernel’s vDSO mechanism (Virtual Dynamic Shared Objects look a bit like software libraries that are automatically linked against every running program when it launches) to allow fast clock reading.

Ingo Molnar also posted “x86/mm changes for v4.12”, which includes yet more work toward Intel 5-level paging among many other updates.

Separately Ingo posted a single “core kernel fix” to “increase stackprotector canary randomness on 64-bit kernels with very little cost”.

Thomas Gleixner posted “irq updates for 4.12”, which include a new driver for a MediaTek SoC, ACPI support for ITS (Interrupt Translation Services) when using a GICv3 on ARM systems, support for shared nested
interrupts, and “the usual pile of fixes and updates all over t[h]e place”.

Thomas Gleixner also posted “timer updates for 4.12” that include more reworking of year 2038 support (the infamous wrap of the Unix epoch), a “massive rework of the arm architected timer”, and various other work.

Separately, Ingo Molnar followed up with “timer fix” including “A single ARM Juno clocksource driver fix”.

Corey Minyard posted “4.12 for IPMI” including a watchdog fix. He “switched over to github at Stephen Rothwell’s [linux-next maintainer] request”.

Jonathan Corbet posted “Docs for 4.12” which includes “a new guide for user-space API documents” along with many other updates. Anil Nair noted “Missing File REPORTING-BUGS in Linux Kernel” which suggests that the Debian kernel package tools need to be taught about the recent changes in the kernel’s documentation layout. Separately, Jonathan replied to a thread entitled “Find more sane first words we have to say about Linux” noting that the kernel’s documentation files might not be the first place that someone completely new to Linux is going to go looking for information: “So I don’t doubt we could put something better there, but can we think for a moment about who the audience is here? If you’re “completely new to Linux”, will you really start by jumping into the kernel source tree?” The guy should do kernel standup in addition to LWN. It’d be hilarious.

Later, Jon posted “A few small documentation updates” which “Connect the newly RST-formatted documentation to the rest; this had to wait until the input pull was done. There’s also a few small fixes that wandered in”.

Tejun Heo posted “libata changes for 4.12-rc1” which includes “removal of SCT WRITE SAME support, which never worked properly”. SCT stands for “SMART [Self Monitoring And Reporting Technology – an error management mechanism common in contemporary disks] Command Transport”. The “write same” part means to set the drive content to a specific pattern (e.g. to zero it out) in cases that TRIM is not available. One wonders if that is also a feature used during destruction, though apparently the only (NSA) trusted way to destroy disks today is shredding and burning after zeroing.

Tejun Heo also posted “workqueue changes for v4.12-rc1”, which includes “One trivial patch to use setup_deferrable_timer() instead of open-coding the initialization”.

Tejun Heo also posted “cgroup changes for v4.12-rc1”, which includes a “second stab at fixing the long-standard race condition in the mount path and suppression of spurious warning from cgroup_get”.

Rafael J. Wysocki posted “Power management updates for v4.12-rc1, part 1” which includes many updates to the cpufreq subsystem and “to the intel_pstate driver in particular”. Its sysfs interface has apparently also been reworked to be more consistent with general expectations. He adds “Apart from that, the AnalyzeSuspend utility for system suspend profiling gets a companion called AnalyzeBoot for the analogous profiling of system boot and they both go into one place”.

Separately, he posted “Power management updates for v4.12-rc1, part 2”, which “add new CPU IDs [Intel Gemini Lake] to a couple of drivers [intel_idle and intel_rapl – Running Average Power Limit], fix a possible NULL pointer deference in the cpuidle core, update DT [DeviceTree]-related things in the generic power domains framwork and finally update the suspend/resume infrastructure to improve the handling of wakeups from suspend-to-idle”.

Rafael J. Wysocki also posted “ACPI updates for v4.12-rc1, part 1”, which includes a new Operation Region driver for the Intel CHT [Cherry Trail] Whiskey Cove PMIC [Power Management Integrated Circuit], and new sysfs entries for CPPC [Collaborative Processor Performance Control], which is a much more fine grained means for OS and firmware to coordinate on power management and CPU frequency/performance state transitions.

Separately, he posted “ACPI updates for v4.12-rc1, part 2”, which “update the ACPICA [ACPI – Advanced Configuration and Power Interface – Component Architecture, the cross-Operating System reference code]” to “add a few minor fixes and improvements”, and also “update ACPI SoC drivers with new device IDs, platform-related information and similar, fix the register information in xpower PMIC [Power Management IC] driver, introduce a concept of “always present” devices to the ACPI device enumeration code and use it to fix a problem with one platform [INT0002, Intel Cherry Trail], and fix a system resume issue related to power resources”.

Separately, Benjamin Tissories posted a patch reverting some ACPI laptop lid logic that had been introduced in Linux 4.10 but was breaking laptops from booting with the lid closed (a feature folks especially in QE use).

Rafael J. Wysocki also posted “Generic device properties framework updates for v4.12-rc1”, which includes various updates to the ACPI _DSD [Device Properties] method call to recognize “ports and endpoints”.

Shaohua Li posted “MD update for 4.12” which includes support for the “Partial Parity Log” feature present on the Intel IMSM RAID array, and a rewrite of the underlying MD bio (the basic storage IO concept used in Linux) handling. He notes “Now MD doesn’t directly access bio bvec, bi_phys_segments and uses modern bio API for bio split”.

Ulf Hansson posted “MMC for v[.]4.12” which includes many driver updates as well as refactoring of the code to “prepare for eMMC CMDQ and blkmq”. This is the planned transition to blkmq (block-multiqueue) for such storage devices. Previously it had stalled due to the performance hit when trying to use a multi-queue approach on legacy and contemporary non-mq devices.

Linus Walleij posted “pin control bulk changes for v4.12” in which he notes that “The extra week before the merge window actually resulted in some of the type of fixes that usually arrive after the merge window already starting to trickle in from eager developers using -next, I’m impressed”. He’s also impressed with the new “Samsung subsystem maintainer” (Krzysztof). Of the many changes, he says “The most pleasing to see is Julia Cartwright[‘]s work to audit the irqchip-providing drivers for realtime locking compliance. It’s one of those “I should really get around to looking into that” things that have been on my TODO list since forever”.

Linus Walliej also posted “Bulk GPIO changes for v4.12”, which has “Nothing really exciting goes on here this time, the most exciting for me is the same as for pin control: realtime is advancing thanks [t]o Julia Cartwright”.

Petr Mladek posted “printk for 4.12” which includes a fix for the “situation when early console is not deregistered because the preferred one matches a wrong entry. It caused messages to appear twice”.

Jiri Kosina posted “HID for 4.12” which includes various fixes, amongst them being an inversion of the HID_QUIRK_NO_INIT_REPORTS to the opposite due to the fact that it is appearently easier to whitelist working devices.

Jiri Kosina also posted “livepatching for 4.12” which includes a new “per-task consistency model” that is “being added for architectures that support reliable stack dumping”, which apparently “extends the nature of the types of patches than can be applied by live patching”.

Lee Jones posted “Backlight for 4.12” which includes various fixes.

Lee Jones also posted “MFD for v4.12” which includes some new drivers, new device support, and various new functionality and fixes.

Juergen Gross posted “xen: fixes and features for 4.12” which includes support for building the kernel with Xen enabled but without enabling paravirtualization, a new 9pfs xen frontend driver(!), and support for EFI “reset_sytem” (needed for ARMv8 Dom0 host to reboot), among various other fixes and cleanups.

Alex Williamson posted “VFIO updates for v4.12-rc1”.

Joerg Roedel posted “IOMMU Updates for Linux v4.12”, which includes “Some code optimizations for the Intel VT-d driver, code to “switch off a previously enabled Intel IOMMU” (presumably in order to place it into bypass mode for performance or other reasons?), “ACPI/IORT updates and fixes” (which enables full support for the ACPI IORT on 64-bit ARM).

Dmitry Torokhov posted “Input updates for v.4.11-rc0” which includes a documentation converstion to ReST (RST, the new kernel doc format), an update to the venerable Synaptics “PS/2” driver to be aware of companion “SMBus” devices and various other miscellaneous fixes.

Darren Hart posted “platform-drivers-x86 for 4.12-1” which includes “a significantly larger and more complex set of changes than those of prior merge windows”. These include “several changes with dependencies on other subsytems which we felt were best managed through merges of immutable branches”.

James Bottomley posted “first round of SCSI updates for the 4.11+ merge window”, which includes many driver updates, but also comes with a warning to Linus that “The major thing you should be aware of is that there’s a clash between a char dev change in the char-misc tree (adding the new cdev_device_add) and the make checking the return value of scsi_device_get() mandatory”. Linus and Greg would later clarify what cdev_device_add does in response to Greg’s request to pull “Char/Misc driver patches for 4.12-rc1”.

David Miller posted “Networking” which includes many fixes.

David also posted “Sparc”, which includes a “bug fix for handling exceptions during bzero on some sparc64 cpus”.

David also posted “IDE”, which includes “two small cleanups”.

Greg K-H (Kroah-Hartman) posted “USB driver patches for 4.12-rc1”, which includes “Lots of good stuff here, after many many many attempts, the kernel finally has a working typeC interface, many thanks to Heikki and Guenter and others who have taken the time to get this merged. It wasn’t an easy path for them at all.” It will be interesting to test that out!

Greg K-H also posted “Driver core patches for 4.12-rc1”, which is “very tiny” this time around and consists mostly of documentation fixes, etc.

Greg K-H also posted “Char/Misc driver patches for 4.12-rc1” which features “lots of new drivers” including Google firmware drivers, FPGA drivers, etc. This lead to a reaction from Linus about how the tree conflicted with James Bottomley’s tree (which he had already pulled, “as per James’ suggestion”, and a back and forth between James and Greg about how to better handle such a conflict next time, and Linus noting that he prefers to fix merge conflicts himself but “*also* really really prefer the two sides of the conflict having been more aware of the clash” and providing him with a head’s up in the pull.

Greg K-H also posted “Staging/IIO driver fixes for 4.12-rc1”, which adds “about 350k new lines of crap^Wcode, mostly all in a big dump of media drivers from Intel”. He notes that the Android low memory killer driver has finally been deleted “much to the celebration of the -mm developers”.

Greg K-H also posted “TTY patches for 4.12-rc1”, which wasn’t big.

Dan Williams posted “libnvdimm for 4.12” which includes “Region media error reporting [a generic interface more friendly to use with multiple namespaces]”, a new “struct dax_device” to allow drivers to have their own custom direct access operations, and various other updates. Dan also posted “libnvdimm: band aid btt vs clear posion locking”, a patch which “continues the 4.11 status quo of disabling of error clearing from the BTT [Block Translation Table] I/O path” and notes that “A solution for tracking and handling media errors natively in the BTT is needed”. The BTT or Block Translation Table is a mechanism used by NV-DIMMs to handle “torn sectors” (partially complete writes) in hardware during error or power failure. As the “btt.txt” in the kernel documentation notes, NV-DIMMs do not have the same atomicity guarantees as regular flash drives do. Flash drives have internal logic and store enough energy in capacitors to complete outstanding writes during a power failure (rotational drives have similar for flushing their memory based caches and syncing remap block state) but NV-DIMMs are designed differently. Thus the BTT provides a level of indirection that is used to provide for atomic sector semantics.

Separately, Dan posted “libnvdimm fixes for 4.12-rc1” which includes “incremental fixes and a small feature addition relative to the libnvdimm 4.12 pull request”. Gert had “noticed hat tinyconfig was bloated by BLOCK selecting DAX [Direct Acess Execution]”, while “Vishal adds a feature that missed the initial pull due to pending review feedback. It allows the kernel to clear media errors when initializing a BTT (atomic sector update driver) instance on a pmem namespace”.

Dave Airlie posted “drm tegra for 4.12-rc1” containing additional updates due because he missed a pull from Thierry Reding for NVidia Tegra patches. He also followed up with a “drm document code of conduct” patch that describes a code of conduct for graphics written by freedesktop.org.

Stafford Horne posted “Initramfs fix for 4.12-rc1” containing a fix “for an issue that has caused 4.11 to not boot on OpenRISC”.

Catalin Marinas posted “arm64 updates for 4.12” including kdump support, “ARMv8.3 HWCAP bits for JavaScript conversion instructions, complex numbers and weaker release consistency [memory ordering]”, and support for platform (non-enumerated buses) MSI support when using ACPI, among other patches. He also removes support for ASID-tagged VIVT [Virtually Indexed, Virtually Tagged] caches since “no ARMv8 implementation using it and deprecated in the architecture” [caches are PIPT – Physically Indexed, Physically Tagged – except that an implementation might do VIPT or otherwise internally using various CAM optimizations].

Catalin later posted “arm64 2nd set of updates for 4.12”, which include “Silence module allocation failures when CONFIG_ARM*_MODULE_PLTS is enabled”.

Olof Johansson posted “ARM: SoC contents for 4.12 merge window”. In his pull request, Olof notes that “It’s been a relatively quiet release cycle here. Patch count is about the usual (818 commits, which includes merges).”
He goes on to add, “Besides dts [DeviceTree files], the mach-gemini cleanup by Linus Walleij is the only platform that pops up on its own”. He called out the separate post for the TEE [Trusted Execution Environment] subsystem. Olof also removed Alexandre Courbot and Stephen Warren from NVidia Tega maintainership, and added Jon Hunter in their place.

Rob Herring posted “DeviceTree for 4.12”, which includes updates to the Device Tree Compiler (dtc), and more DeviceTree overlay unit tests, among various other changes.

Darrick J. Wong posted “xfs: updates for 4.12”, which includes the “big new feature for this release” of a “new space mapping ioctl that we’ve been discussing since LSF2016 [Linux Storage and Filesystem conference]”.

Max Filippov posted “Xtensa improvements for 4.12”.

Ted Ts’o posted “ext4 updates for 4.12”, which adds “GETFSMAP support” (discussed previously in this podcast) among other new features.

Ted also posted “fscrypt updates for 4.12” which has “only bug fixes”.

Paul Moore posted “Audit patches for 4.12” which includes 14 patches that “span the full range of fixes, new featuresm and internal cleanups”. These include a move to 64-bit timestamps, converting refcounts to the new refcount_t type from atomic_t, and so on.

Wolfram Sang posted “i2c for 4.12”.

Mark Brown posted “regulator updates for 4.12”, which includes “Quite a lot going on with the regulator API for this release, much more in the core than in the drivers for a change”. This includes “Fixes for voltage change propagation through dumb power switches, a notification when regulators are enabled, a new settling time property for regulators where the time taken to move to a new voltage is not related to the size of the change”, etc.

Mark also posted “SPI updates for 4.12”, which includs “quite a lot of small
driver specific fixes and enhancements”.

Jessica Yu posted “module updates for 4.12”, containing minor fixes.

Mauro Carvalho Chehab posted “media updates” including mostly driver updates and the removal of “two staging LIRC drivers for obscure hardware”. He also posted a 5 part patch series entitled “Conver more books to ReST”, which converted three kernel DocBook format documentation file sets to RST, the new format being used for kernel documentation (on the kernel-doc mailing list, and maintained by Jonathan Corbet of LWN): librs, mtdnand, and sh. He noted that “After this series, there will be just one DocBook pending conversion: ” lsm (Linux Security Modules)”. He also notes that the existing LSM documentation is very out of date and no longer describes the current API.

Michael Ellerman posted “Please pull powerpc/linux.git powerpc-4.12-1 tag”, which includes suppot for “Larger virtual address space on 64-bit server CPUs. By default we use a 128TB virtual address space, but a process can request access to the full 512TB by passing a hint to mmap() [this seems very similar to the 56-bit la57 feature from Intel]”. It also includes “TLB flushing optimisations for the radix MMU on Power9” and “Support for CAPI cards on Power9, using the “Coherent Accelerator Interface Architecture 2.0″ [which definitely sounds like juicy reading]”.

Separately Michael Ellerman posted “Please pull powerpc-linux.git powerpc-4.12-2 tag” which includes “rework the Linux page table geometry to lower memory usage on 64-bit Book3S (IBM chips) using the Hash MMU [IBM uses a special inverse page tables “reverse lookup” hashing format]”.

Eric W. Biederman posted “namespace related changes for v4.12-rc1”, which includes a “set of small fixes that were mostly stumbled over during more significant development. This proc fix and the fix to posix-timers are the most significant of the lot. There is a lot of good development going on but unfortunately it didn’t quite make the merge window”.

Takashi Iwai posted “sound updates for 4.12-rc1”, noting that it was “a relatively calm development cycle, no scaring changes are seen”.

Steven Rostedt posted “tracing: Updates for v4.12” which includes “Pretty much a full rewrite of the process of function probes”. He followed up with “Three more updates for 4.12” that contained “three simple changes”.

Martin Schwidefsky posted “s390 patches for 4.12 merge window” which includes improvements to VFIO support on mainframe(!) [this author was recently amazed to see there are also DPDK ports for s390x], a new true random number generator, perf counters for the new z13 CPU, and many others besides.

Geert Uytterhoeven posted “m68k updates for 4.12” with a couple fixes.

Jacek Anaszewski posted “LED updates for 4.12” with various fixes.

Kees Cook posted “usercopy updates for v4.12-rc1” with a couple fixes.

Kees also posted “pstore updates for v4.12-rc1”, which included “large
internal refactoring along with several smaller fixes”.

James Morris posted “Security subsystem updates for v4.12”.

Sebastian Reichel posted “hsi changes for hsi-4.12”.

Sebastian also posted “power-supply changes for 4.12”, which includes a couple of new drivers and various fixes.

Separately, Sebastian poted “power-supply changes for 4.12 (part 2), which includes some new drivers and some fixes.

Paolo Bonzini posted “First batch of KVM changes for 4.12 merge window” which includes kexec/kdump support on 32-bit ARM, support for a userspace virtual interrupt controller to handle the “weird” Raspberry Pi 3, in-kernel acceleration for VFIO on POWER, nested EPT support for accessed and dirty bits on x86, and many other fixes and improvements besides.

Separately Paolo posted “Second round of KVM changes for 4.12”, which include various ARM (32 and 64-bit) cleanups, support for PPC [POWER] XIVE (eXternal Interrupt Virtualization Engine), and “x86: nVMX improvements, including emulated page modification logging (PML) which brings nice performance improvements [under nested virtualization] on some workloads”.

Ilya Dryomov posted “Ceph updates for 4.12-rc1”, which include “support for disabling automatic rbd [resilent block device] exclusive lock transfers” and “the long awaited -ENOSPC [no space] handling series”. The latter finally handles out of space situations by aborting with -ENOSPC rather than “having them [writers] block indefinitely”.

Miklos Szeredi posted “fuse updates for 4.12”, which “contains support for pid namespaces from Seth and refcount_t work from Elena”.

Miklos also posted “overlayfs update for 4.12”, which includes “making st_dev/st_ino on the overlay behave like a normal filesystem”. “Currently this only wokrs if all layers are on the same filesystem, but future work will move the general case towards more sane behavior”.

Bjorn Helgaas posted “PCI changes for v4.12” which includes a framework for supporting PCIe devices in Endpoint mode from Kishon Vjiay Abraham, fixes for using non-posted PCI config space on ARM from Lorenzo Pieralisi, allowing slots below PCI-to-PCIe “reverse bridges”, a bunch of quirks, and many other fixes and enhancements.

Jaegeuk Kim posted “f2fs for 4.12-rc1”, which “focused on enhancing performance with regards to block allocation, GC [Garbage Collection], and discard/in-place-update IO controls”.

Shuah Khan posted “Kselftest update for 4.12-rc1” with a few fixes.

Richard Weinberg posted “UML changes for v4.12-rc1” which includes “No new stuff, just fixes” to the “User Mode Linux” architecture in 4.12. Separately, Masami Hiramatsu posted an RFC patch entitled “Output messages to stderr and support quiet option” intended to “fix[] some boot time printf output to stderr by adding os_info() and os_warn(). The information-level messages via os_info() are suppressed when “quiet” kernel option is specified”.

Richard also postd “UBI/UBIFS updates for 4.12-rc1”, which “contains updates for both UBI and UBIFS”. It has a new CONFIG_UBIFS_FS_SECURITY option, among “minor improvements” and “random fixes”.

Thierry Reding posted “pwm: Changes for v4.12-rc1”, which amongst other things includes “a new driver for the PWM controller found on MediaTek SoCs”.

Vinod Koul posted “dmaengine updates” which includes “a smaller update consisting of support for TI DA8xx dma controller” among others.

Chris Mason posted “Btrfs” which “Has fixes and cleanups” as well as “The biggest functional fixes [being] between btrfs raid5/6 and scrub”.

Trond Myklebust posted “Please pull NFS client fixes for 4.12”, which includes various fixes, and new features (such as “Remove the v3-only data server limitation on pNFS/flexfiles”).

J. Bruce Fields posted “nfsd changes for 4.12”, which includes various RDMA updates from Chuck Lever.

Stephen Boyd posted “clk changes for v4.12”. Of the changes, the “biggest things are the TI clk driver rework to lay the groundwork for clkctrl support in the next merge window and the AmLogic audio/graphics clk support”.

Alexandre Belloni posted “RTC [Real Time Clock] for 4.12”, which uses a new GPG subkey that he also let Linus know about at the same time.

Nicholas A. Bellinger posted “target updates for v4.12-rc1”, which was “a lot more calm than previously expected. It’s primarily fixes in various areas, with most of the new functionality centering around TCMU [TCM – Linux iSCSI Target Support in Userspace] backend work with Xiubo Li has been driving”.

Zhang Rui posted “Thermal management updates for v4.12-rc1”, which includes a number of fixes, as well as some new drivers, and a new interface in “thermal devfreq_cooling code so that the driver can provide more precise data regarding actual power to the thermal governor every time the power budget is calculated”.

4.12 git pulls for new subsystems and features

David Howells posted “Hardware module parameter annotation for secure boot” in which he requested that Linus pull in new “kmod” macros (the same name is used for the userspace module tooling, but in this case refers to the in-kernel kernel module infrastructure of the same name). The new macros add annotations to “module_param” of the new form “module_param_hw” with a “hwtype” such as “ioport” or “iomem”, and so forth. These are used by the kernel to prevent those parameters from being used under a UEFI Secure Boot situation in which the kernel is “locked down” (to prevent someone from loading a signed kernel image and then compromising it to circumvent the secure boot mechanism).

Arnd Bergmann sent a special pull request to Linus Torvalds for “TEE driver infrastructure and OP-TEE drivers”, which “introduces a generic TEE [Trusted Execution Environment] framework in the kernel, to handle trusted environ[ments] (security coprocessor or software implementations such as OP-TEE/TrustZone)”. He sent the pull separately from the other arm-soc pull specifically to call it out, and to make sure everyone knew that this was finally headed upstream, but he noted it would probably be maintained through the arm-soc kernel tree. He included a lengthy defense of why now was the right time to merge TEE support into upstream Linux.

Saving TLB flushes on Intel x86 Architecture

Andy Lutomirski posted an RFC patch series entitled “x86 TLB flush cleanups, moving toward PCID support”. Modern (non-legacy) architectures implement a per-process context identifier that can be used in order to tag VMA (Virtual Memory Area) translations that end up in the TLB (Translation Lookaside Buffer) caches within the microprocessor core. The processor’s hardware (or in some mostly embedded cases, software) (page table) “walkers” will navigate the page tables for a process and populate the TLBs (except in the embedded software case, such as on certain PowerPC and MIPS processors, in which the kernel contains special assembly routines to perform this in software). On legacy architectures, the TLB is fairly simple, containing a simple virtual address to physical (or intermediate, in the case of virtualization) address. But on more sophisticated architectures, the TLB includes address space identification information that allows the TLB to distinguish between hits to the same virtual address that are from two different processes (known as tasks from within the kernel). Using additional tagging in the TLB avoids the traditional need to invalidate the entire TLB on process context switch.

Modern architectures, such as AArch64, have implemented context tagging support in their architecture code for some time, and now x86 is finally set to follow, enabling a feature that has actually been present in x86 for some time (but was not wired up), thanks to Andy’s work on PCID (Process Context IDentifier) support. In his patch series, Andy notes that as he has been “polishing [his] PCID code, a major problem [he’s] encountered is that there are too many x86 TLB flushing code paths and that they have too many inconsequential differences”. This patch series aims to “clean up the mess”. Now if x86 finally gains hardware broadcast TLB invalidations it will also be able to remove the wasted IPIs (Inter-Processor-Interrupts) that it implements to cause remote processors to invalidate TLB entries, too. Linus liked Andy’s initial work, but said he is “always a bit nervous about TLB changes like this just because any potential bugs tend to be really really hard to see and catch”. Those of us who have debugged nasty TLB issues on other architectures would be inclined to agree with him.

Ongoing Development

Laurent Dufour posted version 3 of a patch series entitled “Speculative page faults”. This is a contemporary development inspired by Peter Zijstra’s earlier work, which was based upon ideas of still others. The whole concept dates back to at least 2009 and generally involves removing the traditional locking constraints of updates to VMAs (Virtual Memory Areas) used by Linux tasks (processes) to represent the memory of running programs. Essentially, a “speculative fault” means “not holding mmap_sem” (a semaphore guarding a tasks’ current memory map). Laurent (and Peter) make VMA lookups lockless, and perform updats speculatively, using a seqlock to detect a change to the underlying VMA during the fault. “Once we’ve obtained the page and are ready to update the PTE, we validate if the state we started the fault with is still valid, if not, we’ll fail the fault with VM_FAULT_RETRY, otherwise we update the PTE and we’re done”. Earlier testing showed very significant performance upside to this work due to the reduced lock contention.

Aaron Lu posted “smp: do not send IPI if call_single_queue not empty”. The Linux kernel (and most others) uses a construct known as an IPI – or Inter-Processor-Interrupt – a form of software generated interrupt that a processor will send to one or more others when it needs them to perform some housekeeping work on the kernel’s behalf. Usually, this is to handle such things as a TLB shootdown (invalidating a virtual address translation in a remote processor due to a virtual address space being removed), especially on less sophisticated legacy architectures that do not feature invalidation of TLBs through hardware broadcast, though there are many other uses for IPIs. Aaron’s patch realizes, effectively, that if a remote processor is already going to process a queue of CSD (call_single_data) function calls it has been asked to via IPI then there is no need to send another IPI and generate additional interrupts – the queue will be drained of this entry as well as existing entries by the IPI management code.

Romain Perier posted version 8 of “Replace PCI pool by DMA pool API” which realizes that the current PCI pool API uses “simple macro functions direct expanded to the appropriate dma pool functions”, so it simply replaces them with a direct use of the corresponding DMA pool API instead.

Sandhya Bankar posted “vfs: Convert file allocation code to use the IDR”. This replaces existing filesystem code that allocates file descriptors using a
custom allocator with Matthew (Willy) Wilcox’s idr (ID Radix) tree allocator.

Serge E. Hallyn posted a resend of version 2 of a patch series entitled “Introduce v3 namespaced file capabilities”. We covered this last time.

Heinrich Schuchardt posted “arm64: Always provide “model name” in /proc/cpuinfo”, which was quickly shot down (for the moment).

Christian König posted verision 5 of his “Resizeable PCI BAR support” patch series. We have featured this in a previous episode of the podcast.

Prakash Sangappa posted “hugetlbfs ‘noautofill’ mount option” which aims to allow (optionally) for hugetlbfs pseudo-filesystems to be mounted with an option which will not automatically populate holes in files with zeros during a page fault when the file is accessed though the mapped address. This is intended to benefit applications such as Oracle databases, which make heavy use of such mechanisms but don’t take kindly to the kernel having side effects that change on-disk files even if only zero fill. Dave Hansen pushed back against this change saying that it was “further specializing hugetlbfs” and that Oracle should be using userfaultfd or “an madvise() option that disallows backing allocations”. Prakash replied that they had considered those but with a database there are such a large number of single threaded processes that “The concern with using userfaultfs is the overhead of setup and having an additional thread per process”.

Sameer Goel posted “arm64: Add translation functions for /dev/mem read/write” which “Port architecture specific xlate [translate] and unxlate [untranslate] functions for /dev/mem read/write. This sets up the mapping for a valid physical address if a kernel direct mapping is not alread present”. Depending upon the ARM platform, access to a bad address in /dev/mem could result in a synchronous exception in the core, or a System Error (SError) generated by a system memory controller interface. In either case, it is handled as a fatal error where the same is not true on x86. While access to /dev/mem is restricted, increasingly being deprecated, and has other semantics to prevent its used on 64-bit ARM systems, it still exists and is used. In this case, to read the ACPI FPDT table which provides performance pointer records. Nevertheless, both Will Deacon and Leif Lindholm objected to the reasoning given here, saying that the kernel should instead be taught how to parse this table and expose its information via /sys rather than having userspace tools go poking in /dev/mem to try to read from the table directly.

Minchan Kim posted “vmscan: scan pages until it f[inds] eligible pages” in which he notes that “There are premature OOM [Out Of Memory killer invocations] happening. Although there are ton of free swap and anonymous LRU list of eligible zones, OOM happened. With investigation, skipping page of isolate_lru_pages makes reclaim void because it returns zero nr_taken easily so LRU shrinking is effectively nothing and just increases priority aggressively. Finally, OOM happens”.

Julius Werner posted version 3 of his “Memconsole changes for new coreboot format” which teaches the Google firmware driver for their memconsole to deal with the newer type of persistent ring buffer console they introduced.

Olliver Schinagl and Jamie Iles had a back and forth about the latter’s work on “glue-code” (generic handling code) for the DW (DesignWare) 8250 (a type of serial port interface made popular by PC) IP block as used in many different designs. Depending upon how the block is configured, it can behave differently, and there was some discussion about how to handle that. In particular the location of the UART_USR register.

Xiao Guangrong posted “KVM: MMU: fast write protect” which “introduces a[n] extremely fast way to write protec all the guest memory. Comparing with the ordinary algorthim which write protects last level sptes [the page table entries used by the guest] based on the rmap [the “reverse” map, the means that Linux uses to encode page table information within the kernel] one by one, it just simply updates the generation number to ask all vCPUs to reload its root page table, particularly it can be done out of mmu-lock”. The idea was apparently originally proposed by Avi (Kivity). Paolo Bonzini thought “This is clever” and wondered “how the alternative write protection mechanism would affect performance of the dirty page ring buffer patches”. Xiao thought it could be used to speed up those patches after merging, too [Paolo noted that he aims to merge these early in 4.13 development].

Bogdan Mirea posted version 2 of”Add “Preserve Boot Time Support””, which follows up on a previous discussion about retaining “Boot Time Preservation between Bootloader and Linux Kernel. It is based on the idea that the Bootloader (or any other early firmware) will start the HW Timer and Linux Kernel will count the time starting with the cycles elapsed since timer start”. By “Bootloader” he means “firmware” to those who live in x86-land.

Igor Stoppa posted “post-init-read-only protection for data allocated dynamically” which aims to provide a mechanism for dynamically allocated data which is similar to the “__read_only” special linker section that certain annotated (using special GCC directives) code will be placed into. That works great for read-only data (which is protected by the MMU locking down the corresponding region early in boot). His “wish” is to start with the “policy DB of SE Linux and the LSM Hooks, but eventually I would like to extend the protection also to other subsystems, in a way that can merged into mainline.” His patch includes an analysis of how he feels he can be as “little invasive as possible”, noting that “In most, if not all, the cases that could be enhanced, the code will be calling kmalloc/vmalloc, including GFP_KERNEL [Get Free Pages of Kernel Type Memory] as the desired type of memory”. Consequently, he says, “I suspect/hope that the various maintainer[s] won’t object too much if my changes are limited to replacing GFP_KERNEL with some other macro, for example what I previously called GFP_LOCKABLE”. Michal Hocko had some feedback, largely along the lines of a “master toggle” (tha would allow protection to be disabled for small periods in order to make changes to “read only” data) was largely pointless – due to it re-exposing the data. Instead, he wanted to see the protection being done at the kmem_cache_create time by adding a “SLAB_SEAL” parameter that would later be enabled on a per kmem_cache basis using “kmem_cache_seal(cache)” or a similar mechanism.

Bharat Bhushan posted “ARM64/PCI: Allow userspace to mmap PCI resources”, which Lorenzo Pieralisi noted was already implemented by another patch.

A lengthy, and “spirited” discussion took place between Timur Tabi and the various maintainers of the 64-bit ARM Architecture and SoC platform trees over the desire for the maintainers to have changes to “defconfigs” for the architecture go through a special “arm@kernel.org” alias. Except that after they had told Timur to use that, they objected to him posting a patch informing others of this alias in the kernel documentation. Instead, as Timur put it “without a MAINTAINERS entry, how would anyone know to CC: that address? I posted 3 versions of my defconfig patchset before someone told me that I had to send it to arm@kernel.org.” The discussion thread is entitled “MAINTAINERS: add arm@kernel.org as the list for arm64 defconfig changes”.

Xunlei Pang posted version 3 of his “x86/mm/ident_map: Add PUD level 1GB page support” which helps “kernel_ident_mapping_init” to create a single and very large identitiy page mapping in order to reduce TLB (Translation Lookaside Buffer – the caches that store virtual to physical memory lookups performed by hardware) pressure on an architecture that is currently using many 2MB (PMD – Page Middle Directory) level pages for this process.

Anju T Sudhakar posted version 8 of “IMC Instrumentation Support”, which provides support for POWER9’s “In-Memory-Collection” or IMC infrastructure, which “contains various Performance Monitoring Units (PMUs) at Nest level (these are on-chip but off-core), Core level and Thread level.”

Greg K-H (Kroah-Hartman) posted an RFC patch entitled “add more new kernel pointer filter options” which “implemnt[s] some new restrictions when printing out kernel pointers, as well as the ability to whitelist kernel pointers where needed.”

Kees Cook posted “x86/refcount: Implement fast refcount overflow protection”, which seeks to upstream a “modified version of the x86 PAX_REFCOUNT defense from PaX/grsecurity. This speeds up the refcount_t API by duplicating the existing atomic_t implementation with a single instruction added to detect if the refcount has wrapped past INT_MAC (or below 0) resuling in a negative value, where the handler then restores the refcount_t to INT_MAX”.

David Howlls posted an RFC patch entitled “VFS: Introduce superblock configuration context” which is a “set of patches to create a superblock configuration contenxt prior to setting up a new mount, populating it with the parsed options/binary data, creating the superblock and then effecting the mount. This allows namespaces and other information to be conveyed through the mount procedure. It also allows extra error information”.

The Google Chromebook team let folks know that they were (rarely, like one in a million) seeing “Threads stuck in zap_pid_ns_processes()”. Guenter Roeck noted that the “Problem is that if the main task [which has children that are being ptraced] doesn’t exit, it [the child] hangs forever. Chrome OS (where we see the problem in the field, and the application is chrome) is configured to reboot on hung tasks – if a task is hung for 120 seconds on those systems, it tends to be in a bad shape. This makes it a quite severe problem for us”. He asked “Are there other conditions besides ptrace where a task isn’t reaped?”. Reaping refers to the behavior in which tasks, when they exit are reparented to the init task, which “reaps” them (cleans up and makes sure the state that exit with is seen), except under ptrace in this case where the parent task spawning the children “was outside of the pid namespace and was choosing not to reap the child”. Various proposals as to how to deal with this in the namespace code were discussed.

Mahesh Bandewar posted “kmod: don’t load module unless req process has CAP_SYS_MODULE” which notes that “A process inside random user-ns [a user namespace] should not load a module, which is currently possible”. He shows how a user namespace can be created that causes the kernel to load a module upon access to a file node indirectly. This could be a security risk if this approach were used to cause a host kernel to load a vulnerable but otherwise not loaded kernel driver through the privileged permissions in the namespace.

Marc Zyngier posted “irqdomain: Improve irq_domain_mapping facility” in which he “Update[s] IRQ-domain.txt to document irq_domain_mapping” among otherwise seeking to make it easier to access and understand this kernel feature.

Jens Axboe accepted a patch from Ulf Hansson adding Paolo Valente as a MAINTAINER of the BFQ I/O scheduler.

Cyrille Pitchen updated the git repos for the SPI NOR subsystem, which is “now hosted on MTD repos, spi-nor/next is on l2-mtd and spi-nor/fixes will be on linux-mtd”.

Alexandre Courbot posted “MAINTAINERS: remove self from GPIO maintainers”.

The folks at Codeaurora posted a lengthy analysis of the Linux kernel scheduler and specific problems with load_balance that will be covered next time around, along with work by Peter Zijlstra on the “cgroup/PELT overhaul (again).

Finally, Paul McKenney previously posted “Make SRCU be once again optional”, after having noted that the need to build it in by default (caused by other recent changes in header files) increased the kernel by 2K. Nico(las) Pitre was happy to hear this, saying “If every maintainer finds a way to (optionally) reduce the size of the code they maintain by 2K then we’ll get a much smaller kernel pretty soon”.

May 15, 2017 04:07 AM