Kernel Planet

October 16, 2018

Matthew Garrett: Initial thoughts on MongoDB's new Server Side Public License

MongoDB just announced that they were relicensing under their new Server Side Public License. This is basically the Affero GPL except with section 13 largely replaced with new text, as follows:

If you make the functionality of the Program or a modified version available to third parties as a service, you must make the Service Source Code available via network download to everyone at no charge, under the terms of this License. Making the functionality of the Program or modified version available to third parties as a service includes, without limitation, enabling third parties to interact with the functionality of the Program or modified version remotely through a computer network, offering a service the value of which entirely or primarily derives from the value of the Program or modified version, or offering a service that accomplishes for users the primary purpose of the Software or modified version.

“Service Source Code” means the Corresponding Source for the Program or the modified version, and the Corresponding Source for all programs that you use to make the Program or modified version available as a service, including, without limitation, management software, user interfaces, application program interfaces, automation software, monitoring software, backup software, storage software and hosting software, all such that a user could run an instance of the service using the Service Source Code you make available.


MongoDB admit that this license is not currently open source in the sense of being approved by the Open Source Initiative, but say:We believe that the SSPL meets the standards for an open source license and are working to have it approved by the OSI.

At the broadest level, AGPL requires you to distribute the source code to the AGPLed work[1] while the SSPL requires you to distribute the source code to everything involved in providing the service. Having a license place requirements around things that aren't derived works of the covered code is unusual but not entirely unheard of - the GPL requires you to provide build scripts even if they're not strictly derived works, and you could probably make an argument that the anti-Tivoisation provisions of GPL3 fall into this category.

A stranger point is that you're required to provide all of this under the terms of the SSPL. If you have any code in your stack that can't be released under those terms then it's literally impossible for you to comply with this license. I'm not a lawyer, so I'll leave it up to them to figure out whether this means you're now only allowed to deploy MongoDB on BSD because the license would require you to relicense Linux away from the GPL. This feels sloppy rather than deliberate, but if it is deliberate then it's a massively greater reach than any existing copyleft license.

You can definitely make arguments that this is just a maximalist copyleft license, the AGPL taken to extreme, and therefore it fits the open source criteria. But there's a point where something is so far from the previously accepted scenarios that it's actually something different, and should be examined as a new category rather than already approved categories. I suspect that this license has been written to conform to a strict reading of the Open Source Definition, and that any attempt by OSI to declare it as not being open source will receive pushback. But definitions don't exist to be weaponised against the communities that they seek to protect, and a license that has overly onerous terms should be rejected even if that means changing the definition.

In general I am strongly in favour of licenses ensuring that users have the freedom to take advantage of modifications that people have made to free software, and I'm a fan of the AGPL. But my initial feeling is that this license is a deliberate attempt to make it practically impossible to take advantage of the freedoms that the license nominally grants, and this impression is strengthened by it being something that's been announced with immediate effect rather than something that's been developed with community input. I think there's a bunch of worthwhile discussion to have about whether the AGPL is strong and clear enough to achieve its goals, but I don't think that this SSPL is the answer to that - and I lean towards thinking that it's not a good faith attempt to produce a usable open source license.

(It should go without saying that this is my personal opinion as a member of the free software community, and not that of my employer)

[1] There's some complexities around GPL3 code that's incorporated into the AGPLed work, but if it's not part of the AGPLed work then it's not covered

comment count unavailable comments

October 16, 2018 10:44 PM

October 15, 2018

Davidlohr Bueso: Linux v4.18: Performance Goodies

Linux v4.18 has been out a two months now; making this post a bit late, but still in time before the next release. Also so much drama in the CoC to care about performance topics :P As always comes with a series of performance enhancements and optimizations across subsystems.

locking: avoid pointless TEST instructions

A number of places within locking primitives have been optimized to avoid superfluous test instructions for the CAS return by relying on try_cmpxchg, generating slightly better code for x86-64 (for arm64 there is really no difference). Such have been the cases for mutex fastpath (uncontended case) and queued spinlocks.
[Commit c427f69564e2, ae75d9089ff7]

locking/mcs: optimize cpu spinning

Some architectures, such as arm64,  can enter low-power standby state (spin-waiting) instead of purely spinning on a condition. This is applied to the MCS spin loop, which in turn directly helps queued spinlocks. On x86, this can also be cheaper than spinning on smp_load_acquire().
[Commit 7f56b58a92aa]

mm/mremap: reduce amount of TLB shootdowns

It was discovered that on a heavily dominated mremap workload, the amount of TLB flushes was excessive causing overall performance issues. By removing the LATENCY_LIMIT magic number to handle TLB flushes on a PMD boundary instead of every 64 pages,  the amount of shootdowns can be redced by a factor of 8 in the ideal case.  The LATENCY_LIMIT was almost certainly used originally to limit the PTL hold times but the latency savings are likely shadowed by the cost of IPIs in many cases.
[Commit 37a4094e828f]

mm: replace mmap_sem to protect cmdline and environ procfs files

Reducing (ab)users of the mmap_sem is always good for general address space performance. Introduce a new mm->arg_lock to protect against races when handling /proc/$PID/{cmdline,environ} files, this removes (mostly) the semaphore's requirements.
[Commit 88aa7cc688d4]

mm/hugetlb: make better use of page clearing optimization

Pass the fault address (address of the sub-page to access) to the nopage fault handler to better use the general huge page clearing optimization. This allows the sub-page to access to be cleared last to avoid the cache lines of to access sub-page to be evicted when clearing other sub-pages. Performance improvements were reported for vm-scalability.anon-w-seq  workload under hugetlbfs, reducing ~30% throughput.
[Commit 285b8dcaacfc]

sched: don't schedule threads on pre-empted vCPUs

It can be determined whether a vCPU is running to prioritize CPUs when scheduling threads. If a vCPU has been pre-empted, it will incur the extra cost of VMENTER and the time it actually spends to be running on the host CPU. If we had other vCPUs which were actually running on the host CPU and idle we should schedule threads there.
[Commit 247f2f6f3c70, 943d355d7fee]

sched/numa: Stagger NUMA balancing scan periods for new threads

It is redundant and counter productive for threads sharing an address space to change the protections to trap NUMA faults. Potentially only one thread is required but that thread may be idle or it may not have any locality concerns and pick an unsuitable scan rate. This patch uses independent scan period but they are staggered based on the number of address space users when the thread is created.

The intent is that threads will avoid scanning at the same time and have a chance to adapt their scan rate later if necessary. This reduces the total scan activity early in the lifetime of the threads. The different in headline performance across a range of machines and workloads is marginal but the system CPU usage is reduced as well as overall scan activity.
[Commit 137844759843]

block/bfq: postpone rq preparation to insert or merge

A lock contention point is removed (see patch for details and justification) by postponing request preparation to insertion or merging, as lock needs to be grabbed any longer in the prepare_request hook.
[Commit 18e5a57d7987]

btrfs: improve rmdir performance for large directories

When checking if a directory can be deleted, instead of ensuring all its children have been processed,  this optimization keeps track of the directory index offset of the child last checked in the last call to can_rmdir(), and then use it as the starting point for future calls. The changes were shown to yield massive performance benefits; for test directory with two million files being deleted the runtime is reduced from half an hour to less than two seconds.
[Commit 0f96f517dcaa]



KVM: VMX: Optimize tscdeadline timer latency

Add the advance tscdeadline expiration support to which the tscdeadline timer is emulated by VMX preemption timer to reduce the hypervisor lantency (handle_preemption_timer -> vmentry). The guest can also set an expiration that is very small in that case we set delta_tsc to 0, leading to an immediately vmexit when delta_tsc is not bigger than advance ns. This patch can reduce ~63% latency for kvm-unit-tests/tscdeadline_latency when testing busy waits.
[Commit c5ce8235cffa]

net/sched: NOLOCK qdisc performance enhancements and  fixes

There have been various performance related core changes to the NOLOCK qdisc code. The first begins with reducing the atomic operations of __QDISC_STATE_RUNNING. The bit is flipped twice per packet in the uncontended scenario with packet rate below the line rate: on packed dequeue and on the next, failing dequeue attempt. The changes simplify the qdisc. The changes moves the bit manipulation into the qdisc_run_{begin,end} helpers, so that the bit is now flipped only once per packet, with measurable performance improvement in the uncontended scenario.

Later, the above is actually replaced by using a sequence spinlock instead of the atomic approach address pfifo_fast performance regressions. There is also a reduction in the Qdisc struct memory footprint (spanning a cacheline less).
[Commit 96009c7d500e, 021a17ed796b, e9be0e993d95]

lib/idr: improve scalability by reducing IDA lock granularity

Improve the scalability of the IDA by using the per-IDA xa_lock rather than the global simple_ida_lock.  IDAs are not typically used in performance-sensitive locations, but since we have this lock anyway, we can use it.
[Commit b94078e69533

x86-64: micro-optimize __clear_put()

Use immediate constants and saves two registers.
[Commit 1153933703d9]

arm64: select ARCH_HAS_FAST_MULTIPLIER

It is probably safe to assume that all Armv8-A implementations have a multiplier whose efficiency is comparable or better than a sequence of three or so register-dependent arithmetic instructions. Select ARCH_HAS_FAST_MULTIPLIER to get ever-so-slightly nicer codegen in the few dusty old corners which care.
[Commit e75bef2a4fe2]

October 15, 2018 08:19 PM

October 11, 2018

Pete Zaitcev: I'd like to interject for a moment

In a comment on the death of G+, elisteran brought up something that long annoyed me out of all proportion with its actual significance. What do you call a collection of servers communicating through NNTP? You don't call them "INN", you call them "Usenet". The system of hosts communicating through SMTP is not called "Exim", it is called "e-mail". But when someone wants to escape G+, they often consider "Mastodon". Isn't it odd?

Mastodon is merely an implementation of Fediverse. As it happens, only one of my Fediverse channels runs on Mastodon (the Japanese language one at Pawoo). Main one still uses Gnusocial, the anime one was on Gnusocial and migrated to Pleroma a few months ago. All of them are communicating using the OStatus protocol, although a movement is afoot to switch to ActivityPub. Hopefully it's more successful than the migration from RSS to Atom was.

Yet, I noticed that a lot of people fall to the idea that Mastodon is an exclusive brand. Rarely one has to know or care what MTA someone else uses. Microsoft was somewhat successful in establishing Outlook as such a powerful brand to the exclusion of the compatible e-mail software. The maintainer of Mastodon is doing his hardest to present it as a similar brand, and regrettably, he's very successful at that.

I guess what really drives me mad about this is how Eugen uses his mindshare advanage to drive protocol extensions. All of Fediverse implementations generaly communicate freely with one another, but as Pleroma and Mastodon develop, they gradually leave Gnusocial behind in features. In particular, Eugen found a loophole in the protocol, which allows to attach pictures without using up the space in the message for the URL. When Gnusocial displays a message with attachment, it only displays the text, not the picture. This acutally used to be a server setting, in case you want to safe your instance from NSFW imagery and save a little bandwidth. But these days pictures are so prevalent, that it's pretty much impossible to live without receiving them. In this, Eugen has completed the "extend" phase and is moving onto "extinguish".

I'm not sure if this a lost cause by now. At least I hope that members of my social circle migrate to Fediverse in general, and not to Mastodon from the outset. Of course, the implementation does matter when they make choices. As I mentioned, for anything but Linux discussions, pictures are essential, so one cannot reasonably use a Gnusocial instance for anime, for example. And, I can see some users liking Mastodon's UI. And, Mastodon's native app support is better (or not). So yes, by all means, if you want to install Mastodon, or join an instance that's running Mastodon, be my guest. Just realize that Mastodon is an implementation of Fediverse and not the Fediverse itself.

October 11, 2018 01:16 PM

October 09, 2018

Pete Zaitcev: Ding-dong, the witch is dead

Reactions by G+ inhabitants were better than expected at times. Here's Jon Masters:

For the three people who care about G+: it's closing down. This is actually a good thing. If you work in kernel or other nerdy computery circles and this is your social media platform, I have news for you...there's a world outside where actual other people exist. Try it. You can then follow me on Twitter at @jonmasters when you get bored.

Rock on. Although LJ was designed as a shitty silo, it wasn't powerful enough to make itself useless. For example, outgoing links aren't limited. That said, LJ isn't bulletproof: the management is pushing the "new" editor that does not allow HTML. The point is though, there's a real world out there.

And, some people are afraid of it, and aren't ashamed to admit it. Here's Steven Rostedt in Jon's comments:

In other words, we are very aware of the world outside of here. This is where we avoided that world ;-)

So weak. Jon is titan among his entourage.

Kir enumerated escape plans thus (in my translation):

Where to run, unclear. Not wanting to Facebook, Telegram is kinda a marginal platform (although Google+ marginal too), too lazy to stand up a standalone. Nothing but LJ comes to mind.

One thing that comes across very strongly is how reluctant people are to run their own infrastructure. For one thing, the danger of a devastating DDoS is absolutely real. And then you have to deal with spam. Those who do not have the experience also tend to over-estimate the amount of effort you have to put into running "dnf update" once in a while.

Personally, I think that although of course it's annoying, the time wasted on the infra is not that great, or at least it wasn't for me. The spam can be kept under control with a minimal effort. Or, could be addressed in drastic ways. For example, my anime blog simply does not have comments at all. As far as DoS goes, yes, it's a lottery. But then the silo platform can easily die (like G+), or ban you. This actually happens a lot more than those hiding their heads in the sand like to admit. And you don't need to go as far as to admit to your support of President Trump in order to get banned. Anything can trigger it, and the same crazies that DoS you will also try to deplatform you.

One other idea I was very successful with, and many people have trouble accepting, is having several channels for social posting (obviously CKS was ahead of the time with separating pro and hobby). Lots and lots of G+ posters insist on dumping all the garbage into one bin, instead of separating the output. Perhaps now they'll find a client or device that allows them switch accounts easily.

October 09, 2018 02:46 PM

October 07, 2018

Pete Zaitcev: Python and journalism

Back in July, Economist wanted to judge a popularity of programming languages and used ... Google Trends. Python is rocketing up, BTW. Go is not even mentioned.

October 07, 2018 08:04 PM

October 04, 2018

Andy Grover: Stratis 1.0 released!

We just tagged Stratis 1.0.

I can’t believe I haven’t blogged about Stratis before, although I’ve written in other places about it. We’ve been working on it for two years.

Basically, it’s a fancy manager of device-mapper and XFS configuration, to provide a similar experience as ZFS and Btrfs, but completely different under the hood.

Four things that took the most development time (so far)
  1. Writing the design doc. Early on, much of the work was convincing people the approach we wanted was a good one. We spent a lot of time discussing details among ourselves and winning over internal stakeholders (or not), but most of all, showing that we had given serious thought to various alternatives, and had spent some time to comprehend the consequences of initial design choices. Having the design doc made these discussions easier, and solicited feedback that resulted in a much better design than what we started with.
  2. Implementing on-disk metadata formats and algorithms to protect maximally against corruption and over-write. People said it would take more time than we thought and they…weren’t wrong! I still think implementing this was the right call, however.
  3. The hordes of range lists Stratis manages internally. It was probably inevitable that using multiple device-mapper layers involves a lot of range mapping. Stratis does a lot of it now, and it will be doing way more in the future, once we start using DM devices like integrity, raid, and compression. Rust really came through for us here I think. Rust’s functional aspects work very well for things like mapping and allocating.
  4. The D-Bus interface was a big effort in the pre-0.5 timeframe, but now that it is up and running it’s easy to maintain and update. We owe much of this to the quality of the dbus-rs library, and the receptivity of its author, diwic, to help us understand how to use it, and also helping to add small bits that aided our usage of D-Bus.
People to thank

Thanks to Igor Gnatenko and Josh Stone, two people who played a large part in making Rust on Fedora a reality. When I started writing the prototype for Stratis, this was a big question mark! I just hoped that the value of Rust would ensure that sooner or later Rust would be supported on Fedora and RHEL, and thanks to these two (and others, and, oh, you know, Firefox needing it…) it worked out.

I’d also like to thank the Rust community, for making such a compelling, productive systems language through friendliness and respect, sweating the details, and sharing! Like I alluded to before, Rust’s functional style was a good match for our problem space, and Rust’s intense focus on error handling also was perfect for a critical piece of software like stratisd, where what to do about errors is the most important part of what it does.

Finally, I’d like to thank the other members of the Stratis core team: Todd, Mulhern, and Tony. Stratis 1.0 is immeasurably better because of the different backgrounds and strengths we each brought to bear on developing this new piece of software. Thanks, everybody. You made 1.0 happen.

The Future

The 1.0 release marks the end of the beginning, so to speak. We just left the Shire, Frodo! Stratis is a viable product, but there’s so much more to do. Integrating more high-value device-mapper layers, more integration with other storage APIs (both “above” and “below”), more flexibility around adding and removing storage devices, while keeping the UI clean and the admin work low, is the challenge.

Stratis is going to need some major help to get there. For people interested in doing development, testing, packaging, or using Stratis, I invite you to visit our website and GitHub, or just keep tabs by following the project on Google Plus or Twitter.

October 04, 2018 10:44 PM

October 02, 2018

Linux Plumbers Conference: 2018 Linux Plumbers Conference is almost completely full

Due to overwhelming demand for tickets to the Linux Plumbers Conference, there are no additional registrations available at this time.

As we finalize the makeup of microconferences, refereed talks, and so on, there will be some spots available. We will be making them available to those who have expressed interest as fairly as we can and as soon as we can. We plan to contact the recipients of the first batch of released slots by October 8. There may be another, likely smaller, batch notified thereafter.

Those interested in attending the conference, should send a request to contact@linuxplumbersconf.org to get on the waiting list. In the unlikely event that the waiting list has been exhausted, we will release any remaining registrations on a first-come-first-served basis by mid-late October.

LPC [1] will be held in Vancouver, British Columbia, Canada from Tuesday, November 13 through Thursday, November 15.

[1] https://linuxplumbersconf.org/

October 02, 2018 09:34 PM

Linux Plumbers Conference: CLANG/GCC/GLIBC Toolchain Microconference Accepted into 2018 Linux Plumbers Conference

The interaction between toolschain components such as GCC, GLIBC, and CLANG/LLVM with the Linux kernel and with underlying hardware has evolved rapidly. The corresponding communities continue to push on the limits of what is possible, due to new silicon as well as the performance and security changes of the past year.

Specific topics include support for control-flow enforcement technologies (CET), loop-nest optimization flag changes, optimized x86_64 math functions, unified API for new ports, emulation fallback for system calls, handling deprecated kernel support (such as PowerPC HTM support), building the Linux kernel with CLANG, and ARMv8.5 features.

If you would like to contribute to this discussion, please feel free to contact Victor Rodriguez (vm.rod25atgmail.com), H.J. Lu (hjl.toolsatgmail.com), Adhemerval Zanella (adhemerval.zanellaatlinaro.org), David Edelsohn (dje.gccatgmail.com), or Siddhesh Poyarekar (siddheshatgotplt.org).

We hope to see you there!

October 02, 2018 03:46 AM

September 27, 2018

James Morris: 2018 Linux Security Summit North America: Wrapup

The 2018 Linux Security Summit North America (LSS-NA) was held last month in Vancouver, BC.

Attendance continued to grow this year, with a record of 220+ attendees.  Our room was upgraded as a result, with spectacular views.

LSS-NA 2018 Vancouver BC

Linux Security Summit NA 2018, Vancouver,BC

We also had many great proposals and the schedule ended up being a very tight fit.  We’ve asked for an extra day for LSS-NA next year — here’s hoping.

Slides of all presentations are available here: https://events.linuxfoundation.org/events/linux-security-summit-north-america-2018/program/slides/

Videos may be found in this youtube playlist.

Once again, as is typical, the conference was focused around development, somewhat uniquely in the world of security conferences.  It’s interesting to see more attention seemingly being paid to the lower parts of the stack: secure booting, firmware, and hardware roots of trust, as well as the continued efforts in hardening the kernel.

LWN provided some excellent coverage of LSS-NA:

Paul Moore has a brief writeup here.

Thanks to everyone involved in the event for 2018: the speakers, attendees, the program committee, the sponsors, and the organizing team at the Linux Foundation.  LSS-NA would not be possible without all of you!

September 27, 2018 08:06 PM

September 26, 2018

Pete Zaitcev: Postres vs MySQL

Unexpectedly in the fediverse:

[...] in my experience, postgres crashes less, and the results are less devastating if it does crash. I've had a mysql crash make the data unrecoverable. On the other hand I have a production postgres 8.1 installation (don't ask) that has been running without problems for over 10 years.

There is more community information and more third-party tools that require mysql, it has that advantage. the client tools for mysql are easier to use because the commands are in plain english ("show tables") unlike postgres that are commands like "\dt+". but if I'm doing my own thing though, I use postgres.

Reiser, move over. There's a new game in town.

September 26, 2018 01:15 AM

September 25, 2018

Linux Plumbers Conference: Regular Registration Quota Reached

Thank you all for the extremely strong interest in participation to the 2018 Linux Plumbers Conference this year.

At this point, all of the regular registration slots for LPC 2018 have sold out.

There will be a very limited number of registrations available on a first come first serve basis going forward.

Those interested in attending the conference, should send a request to contact@linuxplumbersconf.org to get on the waiting list.

We will process people as quickly as possible as slots initially allocated to sponsors, microconferences and speakers get released.

 

 

September 25, 2018 08:52 PM

Pete Zaitcev: Huawei UI/UX fail

The Huawei M3 gave me an unpleasant surprise a short time ago. I had it in my hands while doing something and my daughter (age 30) offered to hold it for me. When I received it back and turned it on, it was factory reset. What happened?

It turned out that it's possible to reset the blasted thing merely by holding it. If someone grabs it and pays no attention to what's on the screen, then it's easy to press and hold the edge power button inadvertently. That brings up a dialog that has 2 touch buttons for power off and reset. The same hand that's holding the power touches the screen and causes the reset (a knuckle where the finger meets the palm does that perfectly).

The following combination of factors makes this happen. 1. Power button is on the edge, and it sticks out. Some tablets like Kindle or Nexus have edges somewhat slanted, so buttons are somewhat protected. Holding the tablet across the face engages the power button. They could at least place the power button on the short edge of the device. 2. The size of the tablet is just large enough that a normal person can hold it with one hand, but has to stretch. Therefore, the base knuckles touch the surface. On a larger tablet, a human hand is not large enough to hold it like that, and on a phone sized device the palm cups, so it does not touch the center of the screen. 3. The protection against accidental reset is essentially absent.

Huawei, not even once.

Google, bring back the Nexus 7, please.

September 25, 2018 01:20 AM

September 20, 2018

Valerie Aurora: Something is rotten in the Linux Foundation

When I agreed to talk about the management problems at the Linux Foundation to Noam Cohen, the reporter who wrote this story on Linux for the New Yorker, I expected to wait at least a year to see any significant change in the Linux community.

Instead, before the story was even published, the Linux project leader Linus Torvalds suddenly announced that he was temporarily stepping down from his leadership role. He also instituted a new code of conduct for the Linux kernel community after resisting years of requests for one.

I was (and am) astonished. So is everyone else. Now that I’ve read the New Yorker story, I am even more surprised–everything in it is public knowledge. Here’s why I don’t think the story explains why he stepped down.

Torvalds has been in charge of Linux for 27 years, and he’s been verbally abusive most of that time. I know, I personally spent more than 15 years struggling to change the Linux community for the better, first as a Linux kernel developer for more than 7 years, then as co-founder and executive director of a non-profit working to make things better for my fellow kernel developers. In 2016 I sent a letter to the Linux Foundation board of directors detailing pervasive mismanagement at the foundation. Nothing I or anyone else did changed the culture of Linux.

I finally realized why the Linux community was enduringly toxic and resistant to change: because Torvalds likes it that way, and he can inflict millions of dollars of losses on anyone who tries to stop him.

How? Well, if Torvalds’ employer, the Linux Foundation, pressures him, he can quit and they will lose millions of dollars in revenue, because paying Torvalds is the main reason sponsors give the foundation money. If a Linux Foundation sponsor tries to make Torvalds change, he can retaliate by refusing to integrate the sponsor’s code into the Linux kernel, forcing that sponsor to pay millions of dollars in software maintenance costs. If an individual Linux developer confronts Torvalds about his abusive behavior, their Linux career will end.

Torvalds also fostered a cult of personality whose central tenet is that Linux will fail if Torvalds is not its leader. In this system, Torvalds has little incentive to stop doing anything he enjoys, including verbally abusing other Linux developers.

My hope was that if a news story exposed this underlying power structure and showed how Linux Foundation sponsors such as Google, Intel, and HP are paying millions of dollars to fund toxic harassment of their own employees, the sponsors would act in concert to force some change, hopefully sometime in the next year. Instead, the usually intractable Torvalds abruptly stepped down before the story was even published.

I can’t think of anything I told Cohen that would result in anyone risking millions of dollars to confront Torvalds this quickly and forcefully. Maybe it’s a coincidence; when the New Yorker reached out for comment, Linux developers were also angry about another issue. It’s possible Torvalds took other developers’ feedback about his abusive behavior seriously for the first time–in 27 years. But the announcement seemed weirdly rushed even to the developers asking for change.

I don’t know what the real explanation is. I suspect the foundation’s board of directors doesn’t know either; a 22 person board is usually purely ceremonial. (Did you know that the larger a board is, the less likely it is to fire the CEO?)

But you know who probably does know the explanation? Senior former Linux Foundation employees, and with the recent high turnover rate at the foundation there are quite a few.

Here’s what I suggest: Linux Foundation sponsors should demand that the Linux Foundation release all former employees from their non-disparagement agreements, then interview them one-on-one, without anyone currently working at the foundation present. At a minimum, the sponsors should insist on seeing a complete list of ex-employee NDAs and all funds paid to them during and after their tenure. If current Linux Foundation management balks at doing even that, well, won’t that be interesting?

If you’d like to support people working to fix harmful workplace conditions, please donate to BetterBrave, which helps employees fight workplace harassment, including sexual harassment and discrimination. Thank you!

If you’re being abused at work, I hope you will keep meticulous documentation, pay attention to statutes of limitation, talk to a lawyer, and reach out to a reporter sooner rather than later. As the stories about Uber, CBS, and The Weinstein Company show, many boards of directors just rubber-stamp the abuses of the CEO and upper management until someone talks to a reporter. I’m also happy to listen to your story, confidentially.

September 20, 2018 07:04 PM

Linux Plumbers Conference: Thermal Microconference Accepted into 2018 Linux Plumbers Conference

As the energy density of computer systems has increased, thermal issues have become an increasingly hot topic across the spectrum from hand-held systems to internet datacenters. Because the need for thermal management is relatively new, there is a wide variety of hardware and firmware mechanisms, to say nothing of a wide variety of independently developed software to interact with these mechanisms. This in turn results in complex and almost-duplicate code to manage and control thermal excursions. This microconference will therefore look to see if it is possible to consolidate or at least to better align the Linux kernel’s thermal subsystems.

This microconference will therefore discuss better handling of low ambient temperatures, userspace thermal control, improvements to thermal zone mode, better support for indirect (virtual) temperature measurement, sensor hierarchy, scheduler interactions with thermal management, and improvements to idle injection as a way to cool a core.

If you are hacking on thermal related topics and would like to contribute in the discussion, feel free to contact Eduardo Valentin (edubezval@gmail.com) or Amit Kucheria (amit.kucheria@gmail.com).

Please join us for an interesting and important discussion!

September 20, 2018 04:55 PM

September 17, 2018

Pete Zaitcev: Robots on TV

Usually I do not watch TV, but I traveled and saw a few of them in public food intake places and such. What caught my attention were ads for robotics companies, aimed at business customers. IIRC, the companies were called generic names like "Universal Robotics" and "Reach Robotics". Or so I recall, but on second thought, Reach Robotics is a thing, but it focises on gaming, not traditional robotics. But the ads depicted robots doing some unspecified stuff: moving objects from place to place. Not dueling bots. Anyway, what's up with this? Is there some sort of revolution going on? What was the enabler? Don't tell me, it's all the money released by end of the Moore's Law, seeking random fields of application.

P.S. I know about the "Pentagon's Evil Mechanical Dogs" by Boston Dynamics. These were different, manipulating objects in the environment.

September 17, 2018 07:35 PM

Linux Plumbers Conference: RISC-V microconference accepted for Linux Plumbers Conference

The open nature of the RISC-V ecosystem has allowed contributions from both academia and industry to lead to an unprecedented number of new hardware design proposals in a very short time span. Linux support is the key to enabling these new hardware options.

The primary objective of the RISC-V microconference at Plumbers is to initiate a community-wide discussion about the design problems/ideas for different Linux kernel features that will lead to a better, stable kernel for RISC-V.

Topics for this microconference include:

If you’re interested in participating in this microconference or have other topics to propose, please contact Palmer Dabbelt (palmer@sifive.com) or Atish Patra (atish.patra@wdc.com).

LPC will be held in Vancouver, British Columbia, Canada from Tuesday, November 13 through Thursday, November 15.

We hope to see you there!

September 17, 2018 04:16 PM

September 12, 2018

Gustavo F. Padovan: linuxdev-br: a Linux international conference in Brazil

linuxdev-br second edition just happened end of last month in Campinas, Brazil. We have put a nice write-up about the conference on the link below. Soon we will start planning next year’s event. Come and join our community!

linuxdev-br: a Linux international conference in Brazil

The post linuxdev-br: a Linux international conference in Brazil appeared first on Gustavo Padovan.

September 12, 2018 11:48 AM

September 11, 2018

Linux Plumbers Conference: Looking forward to the Kernel Summit at LPC 2018

The LPC 2018 program committee would like to reiterate that the Kernel Summit is going ahead as planned as a track within the Linux Plumbers Conference in Vancouver, BC, November 13th through 15th. However, the Maintainers Summit half day, which is by invitation only, has been rescheduled to be colocated with OSS Europe in Edinburgh, Scotland on October 22nd. Attendees of the Maintainers Summit, once known, will still receive free passes to LPC and thus will probably be present in Vancouver as well.

Also a reminder that the CFP for the Kernel Summit is still open until September 21st 2018: to submit a discussion topic, please use a separate email for each topic with each subject line tagged with [TECH TOPIC], and send these emails to:  ksummit-discuss@lists.linuxfoundation.org

Looking forward to seeing you all in Vancouver!

 

September 11, 2018 08:20 PM

Linux Plumbers Conference: Tech Topics for Kernel Summit

If you missed the refereed-track deadline and you have a kernel-related topic (or, for that matter, if you just now thought of a kernel-related topic), please consider submitting it for the Kernel Summit.  To do this, please use a separate email for each topic with each subject line tagged with [TECH TOPIC], and send these emails to:

ksummit-discuss@lists.linuxfoundation.org

If you submit your topic suggestions before September 21st, and if one of your suggestions is accepted, then you will be given free admission to the Linux Plumbers Conference.

September 11, 2018 04:40 PM

September 10, 2018

Matthew Garrett: The Commons Clause doesn't help the commons

The Commons Clause was announced recently, along with several projects moving portions of their codebase under it. It's an additional restriction intended to be applied to existing open source licenses with the effect of preventing the work from being sold[1], where the definition of being sold includes being used as a component of an online pay-for service. As described in the FAQ, this changes the effective license of the work from an open source license to a source-available license. However, the site doesn't go into a great deal of detail as to why you'd want to do that.

Fortunately one of the VCs behind this move wrote an opinion article that goes into more detail. The central argument is that Amazon make use of a great deal of open source software and integrate it into commercial products that are incredibly lucrative, but give little back to the community in return. By adopting the commons clause, Amazon will be forced to negotiate with the projects before being able to use covered versions of the software. This will, apparently, prevent behaviour that is not conducive to sustainable open-source communities.

But this is where things get somewhat confusing. The author continues:

Our view is that open-source software was never intended for cloud infrastructure companies to take and sell. That is not the original ethos of open source.

which is a pretty astonishingly unsupported argument. Open source code has been incorporated into proprietary applications without giving back to the originating community since before the term open source even existed. MIT-licensed X11 became part of not only multiple Unixes, but also a variety of proprietary commercial products for non-Unix platforms. Large portions of BSD ended up in a whole range of proprietary operating systems (including older versions of Windows). The only argument in favour of this assertion is that cloud infrastructure companies didn't exist at that point in time, so they weren't taken into consideration[2] - but no argument is made as to why cloud infrastructure companies are fundamentally different to proprietary operating system companies in this respect. Both took open source code, incorporated it into other products and sold them on without (in most cases) giving anything back.

There's one counter-argument. When companies sold products based on open source code, they distributed it. Copyleft licenses like the GPL trigger on distribution, and as a result selling products based on copyleft code meant that the community would gain access to any modifications the vendor had made - improvements could be incorporated back into the original work, and everyone benefited. Incorporating open source code into a cloud product generally doesn't count as distribution, and so the source code disclosure requirements don't trigger. So perhaps that's the distinction being made?

Well, no. The GNU Affero GPL has a clause that covers this case - if you provide a network service based on AGPLed code then you must provide the source code in a similar way to if you distributed it under a more traditional copyleft license. But the article's author goes on to say:

AGPL makes it inconvenient but does not prevent cloud infrastructure providers from engaging in the abusive behavior described above. It simply says that they must release any modifications they make while engaging in such behavior.

IE, the problem isn't that cloud providers aren't giving back code, it's that they're using the code without contributing financially. There's no difference between what cloud providers are doing now and what proprietary operating system vendors were doing 30 years ago. The argument that "open source" was never intended to permit this sort of behaviour is simply untrue. The use of permissive licenses has always allowed large companies to benefit disproportionately when compared to the authors of said code. There's nothing new to see here.

But that doesn't mean that the status quo is good - the argument for why the commons clause is required may be specious, but that doesn't mean it's bad. We've seen multiple cases of open source projects struggling to obtain the resources required to make a project sustainable, even as many large companies make significant amounts of money off that work. Does the commons clause help us here?

As hinted at in the title, the answer's no. The commons clause attempts to change the power dynamic of the author/user role, but it does so in a way that's fundamentally tied to a business model and in a way that prevents many of the things that make open source software interesting to begin with. Let's talk about some problems.

The power dynamic still doesn't favour contributors

The commons clause only really works if there's a single copyright holder - if not, selling the code requires you to get permission from multiple people. But the clause does nothing to guarantee that the people who actually write the code benefit, merely that whoever holds the copyright does. If I rewrite a large part of a covered work and that code is merged (presumably after I've signed a CLA that assigns a copyright grant to the project owners), I have no power in any negotiations with any cloud providers. There's no guarantee that the project stewards will choose to reward me in any way. I contribute to them but get nothing back in return - instead, my improved code allows the project owners to charge more and provide stronger returns for the VCs. The inequity has shifted, but individual contributors still lose out.

It discourages use of covered projects

One of the benefits of being able to use open source software is that you don't need to fill out purchase orders or start commercial negotiations before you're able to deploy. Turns out the project doesn't actually fill your needs? Revert it, and all you've lost is some development time. Adding additional barriers is going to reduce uptake of covered projects, and that does nothing to benefit the contributors.

You can no longer meaningfully fork a project

One of the strengths of open source projects is that if the original project stewards turn out to violate the trust of their community, someone can fork it and provide a reasonable alternative. But if the project is released with the commons clause, it's impossible to sell any forked versions - anyone who wishes to do so would still need the permission of the original copyright holder, and they can refuse that in order to prevent a fork from gaining any significant uptake.

It doesn't inherently benefit the commons

The entire argument here is that the cloud providers are exploiting the commons, and by forcing them to pay for a license that allows them to make use of that software the commons will benefit. But there's no obvious link between these things. Maybe extra money will result in more development work being done and the commons benefiting, but maybe extra money will instead just result in greater payout to shareholders. Forcing cloud providers to release their modifications to the wider world would be of benefit to the commons, but this is explicitly ruled out as a goal. The clause isn't inherently incompatible with this - the negotiations between a vendor and a project to obtain a license to be permitted to sell the code could include a commitment to provide patches rather money, for instance, but the focus on money makes it clear that this wasn't the authors' priority.

What we're left with is a license condition that does nothing to benefit individual contributors or other users, and costs us the opportunity to fork projects in response to disagreements over design decisions or governance. What it does is ensure that a range of VC-backed projects are in a better position to improve their returns, without any guarantee that the commons will be left better off. It's an attempt to solve a problem that's existed since before the term "open source" was even coined, by simply layering on a business model that's also existed since before the term "open source" was even coined[3]. It's not anything new, and open source derives from an explicit rejection of this sort of business model.

That's not to say we're in a good place at the moment. It's clear that there is a giant level of power disparity between many projects and the consumers of those projects. But we're not going to fix that by simply discarding many of the benefits of open source and going back to an older way of doing things. Companies like Tidelift[4] are trying to identify ways of making this sustainable without losing the things that make open source a better way of doing software development in the first place, and that's what we should be focusing on rather than just admitting defeat to satisfy a small number of VC-backed firms that have otherwise failed to develop a sustainable business model.

[1] It is unclear how this interacts with licenses that include clauses that assert you can remove any additional restrictions that have been applied
[2] Although companies like Hotmail were making money from running open source software before the open source definition existed, so this still seems like a reach
[3] "Source available" predates my existence, let alone any existing open source licenses
[4] Disclosure: I know several people involved in Tidelift, but have no financial involvement in the company

comment count unavailable comments

September 10, 2018 11:38 PM

September 08, 2018

Paul E. Mc Kenney: Ancient Hardware I Have Hacked: Back to Basics!

My return to the IBM mainframe was delayed by my high school's acquisition of a a teletype connected via a 110-baud serial line to a timesharing system featuring the BASIC language. I was quite impressed with this teletype because it could type quite a bit faster than I could. But this is not as good as it might sound, given that I came in dead last in every test of manual dexterity that the school ever ran us through. In fact, on a good day, I might have been able to type 20 words a minute, and it took decades of constant practice to eventually get above 70 words a minute. In contrast, one of the teachers could type 160 words a minute, more than half again faster than the teletype could!

Aside from output speed, I remained unimpressed with computers compared to paper and pencil, let alone compared to my pocket calculator. And given that this was old-school BASIC, there was much to be unimpressed about. You could name your arrays anything you wanted, as long as that name was a single upper-case character. Similarly, you could name your scalar variables anything you wanted, as long as that name was either a single upper-case character or a single upper-case character followed by a single digit. This allowed you to use up to 286 variables, up to 26 of which could be arrays. If you felt that GOTO was harmful, too bad. If you wanted a while loop, you could make one out of IF statements. Not only did IF statements have no else clause, the only thing that could be in the THEN clause was the number of the line to which control would transfer when the IF condition evaluated to true. And each line had to be numbered, and the numbers had to be monotonically increasing, that is, in the absence of control-flow statements, the program would execute the lines of code in numerical order, regardless of the order in which you typed those lines of code. Definitely a step down, even from FORTRAN.

But then the teacher showed the class a documentary movie showing several problems that could be solved by computer. I was unimpressed by most of the problems: Printing out prime numbers was impressive but pointless, and maximizing the volume of a box given limited materials was a simple pencil-and-paper exercise in calculus. But the finite-element analysis fluid-flow problem did get my attention. This featured a rectangular aquarium with a glass divider, so that initially the right-hand half of the aquarium was full of water and the left-hand half was full of air. They abruptly removed the glass divider, causing the water to slosh back and forth. They then showed a video of a computer simulation of the water flow, which matched the actual water flow quite well. There was no way I could imagine doing anything like that by hand, and was thus inspired to continue studying computer programming.

We students therefore searched out things that the computer could do that we were unwilling or unable to. One of my classmates ran the teletype's punch-tape output through its punch-tape reader, thus giving us all great insight as to why teletypes on television shows appeared to be so busy. For some reason, our teacher felt that this project was a waste of both punched tape and paper. He was more impressed with the work of another classmate, who calculated and ASCII-art printed magnetic lines of force. Despite the teletype's use of eight-bit ASCII, its print head was quite innocent of lower-case characters.

I coded up a project that plotted the zeroes of functions of two variables as ASCII art on the teletype. My teacher expressed some disappointment in my brute-force approach to locating the zeroes, but as far as I could see the bottleneck was the teletype, not the CPU. Besides, the timesharing service charged only for connect time, so CPU time was free, and why conserve a zero-cost resource?

I worked around the computer's limited arithmetic using crude multi-precision code with the goal of computing one thousand factorial. In this case, CPU was definitely the bottleneck, especially given my naive multiplication algorithm. The largest timeslot I could reserve on the teletype was an hour, and during that time, the computer was only able to make it to 659 factorial. In contrast, Maxima takes a few tens of milliseconds to compute 1000 factorial on my laptop. What a difference four decades makes!

I wrote my first professional program on this computer, a pro bono effort for a charity fundraiser. This charity was the work of the local branch of the National Honor Society, and the fundraiser was a computer-dating dance. Given that I was 160 pounds (73 kilograms) of computer-geeky social plutonium, I felt the need to consult an expert. The expert I chose was the home-economics teacher, who unfortunately seemed much more interested in working out why I was such a hopeless geek than in helping with matching criteria. I nevertheless extracted sufficient information to construct a simple Hamming-distance matcher. Fortunately most people seemed reasonably satisfied with their computer-chosen dance partners, the most notable exception being a senior girl who objected strenuously to having been matched only with freshmen boys. Further investigation determined that this mismatch was due to a data-entry error. Apparently, even Cupid is subject to Murphy's Law.

I also did my one and (thus far) only stint of white-hat hacking. In those trusting times, the school-administration software printed the user's password in cleartext as it was typed. But it was not necessary to memorize the characters that the user typed. You see, this teletype had what is called a ``HERE IS'' key. When this key was pressed, the teletype would send a 20-character sequence recorded on a mechanical drum located inside the teletype. And the sequence recorded on this particular teletype's mechanical drum was, you guessed it, the password to the school-administration software. I demonstrated this to my teacher, which resulted in the teletype being under continuous guard by a school official until such time as the mechanical drum could be replaced with one containing 20 ASCII NUL characters. (And here you thought that security theater was a recent phenomenon!)

Despite its limitations, my two years with this system were quite entertaining and educational. But then it was time to move on to college.

September 08, 2018 10:29 PM

September 06, 2018

Linux Plumbers Conference: Devicetree Microconference Accepted into 2018 Linux Plumbers Conference

We are pleased to announce the the Devicetree Microconference has been accepted into the 2018 Linux Plumbers Conference!

Devicetree provides hardware description for many platforms, such as Linux [1], U-Boot [2], BSD [3], and Zephyr [4]. Devicetree continues to evolve to become more robust and attempt to provide the features desired by the varied users.

Some of the overlay related needs are now being addressed by U-boot, but there remain use cases for run time overlay management in the Linux kernel. Support for run time overlay management in the Linux kernel is slowly moving forward, but significant issues remain [5].

Devicetree verification has been an ongoing project for several years, with the most recent in person discussion occurring at the Devicetree Workshop [6] at Kernel Summit 2017. Progress continues on mail lists, and will be an important topic at the microconference.

Other Devicetree related tools, such as the dtc compiler and libfdt [7] continue to see active development.

Additional possible issues to be discussed may include potential changes to the Flattened Device Tree (FDT) format, reducing the Devicetree memory and storage size in the Linux kernel, creating new architecture to provide solutions to current problems, updating the Devicetree Specification, and using devicetrees in constrained contexts.

If you would like to contribute to the discussion, please feel free to contact Frank (frowand@gmail.com) or Sean (darknighte@linux.com).

LPC [8] will be held in Vancouver, British Columbia, Canada from Tuesday, November 13 through Thursday, November 15.

[1] https://elinux.org/Device_Tree_Reference
[2] https://github.com/lentinj/u-boot/blob/master/doc/README.fdt-control
[3] https://wiki.freebsd.org/FlattenedDeviceTree
[4] http://docs.zephyrproject.org/devices/dts/device_tree.html
[5] https://elinux.org/Frank%27s_Evolving_Overlay_Thoughts
[6] https://elinux.org/Device_tree_future#Kernel_Summit_2017.2C_Devicetree_Workshop
[7] https://elinux.org/Device_Tree_Reference#dtc_.28upstream_project.29
[8] https://linuxplumbersconf.org/

September 06, 2018 12:50 PM

September 04, 2018

Paul E. Mc Kenney: Ancient Hardware I Have Hacked: My First Computer

For the first couple of decades of my life, computers as we know them today were exotic beasts that filled rooms, each requiring the care of a cadre of what were then called systems programmers. Therefore, in my single-digit years the closest thing to a computer that I laid my hands on was a typewriter-sized electromechanical calculator that did addition, subtraction, multiplication, and division. I had the privilege of using this captivating device when helping out with accounting at the small firm at which my mother and father worked.

I was an early fan of hand-held computing devices. In fact, I was in the last math class in my high school that was required to master a slide rule, of which I still have several. I also learned how to use an abacus, including not only addition and subtraction, but multiplication and division as well. Finally, I had the privilege of living through the advent of the electronic pocket calculator. My first pocket calculator was a TI SR-50, which put me firmly on the infix side of the ensuing infix/Polish religious wars.

But none of these qualified as “real computers”.

Unusually for an early 1970s rural-Oregon high school, mine offered computer programming courses. About the only thing I knew about computers were that they would be important in the future, so I signed up. Even more unusually for that time and place, we got to use a real computer, namely an IBM 360. This room-filling monster was located fourteen miles (23 kilometers) away at Chemeketa Community College. As far as I know, this was the closest computer to my home and school. Somehow my math teacher managed to wangle use of this machine on Tuesday and Thursday evenings, and he bussed us there and back.

This computer used punched cards and a state-of-the-art chain lineprinter. We were allowed to feed the card reader ourselves, but operating the lineprinter required special training. This machine's console had an attractive red button labeled EMERGENCY PULL. The computer's operator, who would later distinguish himself by creating a full-motion video on an Apple II, quite emphatically stated that this button should be pulled only in case of a bona fide emergency. He also gave us a simple definition of “emergency” that featured flames shooting out of the top of the computer. I never did see any flames anywhere near the computer, much less shooting out of its top, so I never had occasion to pull that button. But perhaps the manufacturers of certain incendiary laptops should have equipped each of them with an attractive red EMERGENCY PULL button.

Having provided us the necessary hardware training, the operator then gave us a sample card deck. We were to put our program at one specific spot in the deck, and our input data in another. Those of us wishing more information about how this worked were directed to an impressively large JCL manual.

The language of the class was FORTRAN, except that FORTRAN was deemed to difficult an initial language for our tender high-school minds. They therefore warmed us up with assembly language. Not IBM's celebrated Basic Assembly Language (BAL), but a simulated assembly language featuring base-10 arithmetic. After a couple of sessions with the simulated assembly, we moved up to FORTRAN, and even used PL/1 for one of our assignments. There were no error messages: There were instead error numbers that you looked up in a thick printed manual located in the same bookcase containing the JCL manual.

I was surprised by the computer's limitations, especially the 6-to-7 digit limits for single-precision floating point. After all, even my TI SR-50 pocket calculator did ten digits! That said, the computer could also do alphabetic characters (but only upper case) and a few symbols—though the exclamation point was notably missing. The state-of-the-art 029 keypunches were happy to punch an exclamation mark, but alas! It printed as “0” (zero) on the lineprinter.

I must confess that I was not impressed with the computer. In addition to its arithmetic limitations, its memory was quite small. Most of our assignments were small exercises in arithmetic that I could complete much more quickly using paper and pencil. In retrospect, this is not too surprising, given that my early laissez-faire programming methodology invariably resulted in interminable debugging sessions. However, it was quite clear that computers were becoming increasingly important, and I therefore resolved to take the class again the following year.

So, the last time I walked out of that machine room in Spring of 1974, I fully expected to walk back the following Fall. Little did I know that it would be almost 30 years before I would once again write code for an IBM mainframe. Nor did I suspect that it would be more than 15 years before work started on the operating system that was to be running on that 30-years-hence mainframe.

My limited foresight notwithstanding, somewhere in Finland a small boy was growing up.

September 04, 2018 12:55 AM

September 03, 2018

Linux Plumbers Conference: CfP extended to Sunday September 9th

Happy Labor Day to those celebrating today!

We have had great response to our call for 2018 Linux Plumbers Conference refereed-track submissions.

However it would seem that we are attracting a lot of procrastinators given the number of emails we have received requesting for an extension.

With the long weekend in North America, we are moving the deadline to Sunday September 9th at 10:59 PM (PST).

Now really is your last chance to make your great submission! Do not delay, submit your proposal now!

 

September 03, 2018 01:59 PM

Pete Zaitcev: gai.conf

A couple Fedora releases back, I noticed that my laptop stopped using IPv6 to access dual-hosted services. I gave RFC-6724 a read, but it was much too involved for my small mind. Fortunately, it contained a simplified explanation:

Another effect of the default policy table is to prefer communication using IPv6 addresses to communication using IPv4 addresses, if matching source addresses are available.

My IPv6 is NAT-ed, so the laptop sees an RFC-4193 address fc00::/7. This does not match the globally assigned address of the external service. Therefore, a matching source address is not available, and things develop from there.

For now, I forced RFC-3484 with gai.conf. Basically, reverted to Fedora 26 behavior.

September 03, 2018 03:01 AM

September 01, 2018

Pete Zaitcev: Vladimir Butenko 1962-2018

Butenko was simply the most capable programmer that I've ever worked with. He was also very accomplished. I'm sure everyone has an idea what UNIX v7 was. Although BSD, sockets, and VFS were still in the future, it was a sophisticated OS for its time. Butenko wrote his own OS that was about a peer for the v7 in features (including vi). He also wrote a Fortran 77 compiler with IDE, an SQL database, and a myriad other things. Applications, too: games, communications, industrial control.

I still remember one of our first meetings in the late 1983. I wanted someone to explain me the instruction set of Mitra-15, a French 16-bit mini. Documentation was practically impossible to get back then, especially for undergrads. Someone referred me, and I received a lecture at a smoking area near an elevator, which founded my understanding of computer architecture.

The only time I ever got one up, was when I wrote a utility to monitor processes (years later, top(1) does the same thing). Apparently the concept never occurred to Butenko, who was perfectly capable to analyzing the system with a debugger and profiler. Seeing just my UI, he knocked out a clone in a couple of days. Of course, it was superior in every respect.

Butenko worked a lot. The combination of genius and workaholic was unstoppable. Or maybe they were sides of the same coin.

Unfortunately, Butenko was not in with the open source. He used to post to Usenet, lampooning and dismissing Linux. I suspect once you can code your own Linux any time you want, your perspective changes a bit. This was a part of the way we drifted apart later on. I was plugging on my little corner of Linux, while Butenko was somewhere out in the larger world, revolutionizing computer-intermediated communications.

He died suddenly, from a heart failure. Way too early, I think.

September 01, 2018 04:17 AM

August 31, 2018

Linux Plumbers Conference: Android Microconference Accepted into 2018 Linux Plumbers Conference

Android continues to find interesting new applications and problems to solve, both within and outside the mobile arena.  Mainlining continues to be an area of focus, as do a number of areas of core Android functionality, including the kernel.  Other topics include low memory killer [1], dynamically-allocated Binder devices [2], kernel namespaces [3], EAS [4], userdata filesystem checkpointing and DT [5].

We hope to see you there!

[1]    https://lwn.net/Articles/761118/

[2]    https://developer.android.com/reference/android/os/Binder

[3]    https://lwn.net/Articles/531114/

[4]    https://lwn.net/Articles/749738/

[5]    https://source.android.com/devices/architecture/dto/

If you would like to contribute to the discussion, please feel free to contact Karim (karim.yaghmour@opersys.com), Todd (tkjos@google.com), Vishal Bhoj (vishal.bhoj@linaro.org), Amit Pundir (amit.pundir@linaro.org), or Kevin Brodsky (Kevin.Brodsky@arm.com).

August 31, 2018 09:49 PM

August 30, 2018

Pete Zaitcev: The shutdown of the project Hummingbird at Rackspace

Wait, wasn't this supposed to be our future?

The abridged history, as I recall was as follows. Mike Burton started the work to port Swift to Go in early 2016, inside the Swift tree. As such, it was a community effort. There was even a discussion at OpenStack Technical Committee about allowing development in Go (the TC disallowed it, but later posted some conditions). At the end of the year, I managed to write an object with MIME and collapsed the stock Swift auditor (the one in Python). That was an impetus for PUT+POST, BTW. But in 2017, the RAX cabal - creiht, redbo, and later gholt - weren't very happy with trying to supplicate to the TC, as well as the massive overhead of developing in the established community, and went their own way. In addition to the TC shenagians, the upstream Swift at SwiftStack and Red Hat needed a migration path. A Hummingbird without a support for Erasure Coding was a non-starter, and the RAX group wasn't interested in accomodating that. By the end of 2017, they were completely on their own, and started going off at the deep end by adding database servers and such. They managed to throw off some good ideas about what the next-generation replication ought to look like. But by cutting themselves off Swift they committed to re-capturng the lightning in the bottle anew, and they just could not pull it off.

On reflection, I suspect their chances would be better if they were serious about interoperating with Swift. The performance gains that they demonstrated were quite impressive. But their paymasters at RAX weren't into this community development and open-source toys (not that RAX went through the change of ownership while Hummingbird was going on).

I think a port of Swift to Go is still on the agenda, but it's unknown at this point if it's going to happen.

August 30, 2018 04:38 AM

August 29, 2018

Linux Plumbers Conference: Power Management and Energy-awareness Microconference Accepted into 2018 Linux Plumbers Conference

Use of Linux on battery-powered systems continues to grow, and general energy-efficiency concerns are not going away any time soon. The Power Management and Energy-awareness micro-conference therefore continues a Linux Plumbers Conference tradition of looking into ways to improve energy efficiency.

In spite of significant progress made over the last year on multiple fronts, including but not limited to the enhancements of the scheduler’s load-tracking facility with an improved awareness of the amount of time taken by realtime processes, deadline processes, and interrupt handling in order to improve CPU performance scaling, the work on implementing energy-aware scheduling on asymmetric systems in the kernel (https://lwn.net/Articles/749900/), and the process utilization clamping patch series (https://lwn.net/Articles/762043/), there still are open issues to be discussed and new ideas to consider. This year, the focus is on energy-optimized task scheduling, user space interfaces for passing power/performance hints to the kernel, platform power management mechanisms and power management frameworks.

Specific topics include energy-aware scheduling, per-task and per-cgroup performance hints, timer granularity issues in the runtime PM framework, generic power domains (genpd) framework enhancements, firmware-based and direct control of low-level power management features of computing platforms, a proposed on-chip interconnect API, and improving selection of CPU idle states.

If you would like to contribute to the discussion, please feel free to contact Rafael (rafael@kernel.org) or Morten (morten.rasmussen@arm.com).

We hope to see you there!

 

August 29, 2018 03:40 PM

August 27, 2018

Daniel Vetter: Why no 2D Userspace API in DRM?

The DRM (direct rendering manager, not the content protection stuff) graphics subsystem in the linux kernel does not have a generic 2D accelaration API. Despite an awful lot of of GPUs having more or less featureful blitter units. And many systems need them for a lot of use-cases, because the 3D engine is a bit too slow or too power hungry for just rendering desktops.

It’s a FAQ why this doesn’t exist and why it won’t get added, so I figured I’ll answer this once and for all.

Bit of nomeclatura upfront: A 2D engine (or blitter) is a bit of hardware that can copy stuff with some knowledge of the 2D layout usually used for pixel buffers. Some blitters also can do more like basic blending, converting color spaces or stretching/scaling. A 3D engine on the other hand is the fancy bit of high performance compute block, which run small programs (called shaders) on a massively parallel archicture. Generally with huge memory bandwidth and a dedicated controller to feed this beast through an asynchronous command buffer. 3D engines happen to be really good at rendering the pixels for 3D action games, among other things.

There’s no 2D Acceleration Standard

3D has it easy: There’s OpenGL and Vulkan and DirectX that require a certain feature set. And huge market forces that make sure if you use these features like a game would, rendering is fast.

Aside: This means the 2D engine in a browser actually needs to work like a 3D action game, or the GPU will crawl. The impendence mismatch compared to traditional 2D rendering designs is huge.

On the 2D side there’s no such thing: Every blitter engine is its own bespoke thing, with its own features, limitations and performance characteristics. There’s also no standard benchmarks that would drive common performance characteristics - today blitters are neeeded mostly in small systems, with very specific use cases. Anything big enough to run more generic workloads will have a 3D rendering block anyway. These systems still have blitters, but mostly just to help move data in and out of VRAM for the 3D engine to consume.

Now the huge problem here is that you need to fill these gaps in various hardware 2D engines using CPU side software rendering. The crux with any 2D render design is that transferring buffers and data too often between the GPU and CPU will kill performance. Usually the cliff is so steep that pure CPU rendering using only software easily beats any simplistic 2D acceleration design.

The only way to fix this is to be really careful when moving data between the CPU and GPU for different rendering operations. Sticking to one side, even if it’s a bit slower, tends to be an overall win. But these decisions highly depend upon the exact features and performance characteristics of your 2D engine. Putting a generic abstraction layer in the middle of this stack, where it’s guaranteed to be if you make it a part of the kernel/userspace interface, will not result in actual accelaration.

So either you make your 2D rendering look like it’s a 3D game, using 3D interfaces like OpenGL or Vulkan. Or you need a software stack that’s bespoke to your use-case and the specific hardware you want to run on.

2D Accelaration is Really Hard

This is the primary reason really. If you don’t believe that, look at all the tricks a browser employs to render CSS and HTML and text really fast, while still animating all that stuff smoothly. Yes, a web-browser is the pinnacle of current 2D acceleration tech, and you really need all the things in there for decent performance: Scene graphs, clever render culling, massive batching and huge amounts of pains to make sure you don’t have to fall back to CPU based software rendering at the wrong point in a rendering pipeline. Plus managing all kinds of assorted caches to balance reuse against running out of memory.

Unfortunately lots of people assume 2D must be a lot simpler than 3D rendering, and therefore they can design a 2D API that’s fast enough for everyone. No one jumps in and suggests we’ll have a generic 3D interface at the kernel level, because the lessons there are very clear:

There are a bunch of DRM drivers which have a support for 2D render engines exposed to userspace. But they all use highly hardware specific interfaces, fully streamlined for the specific engine. And they all require a decently sized chunk of driver code in userspace to translate from a generic API to the hardware formats. This is what DRM maintainers will recommend you to do, if you submit a patch to add a generic 2D acceleration API.

Exactly like a 3D driver.

If All Else Fails, There’s Options

Now if you don’t care about the last bit of performance, and your use-case is limited, and your blitter engine is limited, then there’s already options:

You can take whatever pixel buffer you have, export it as a dma-buf, and then import it into some other subsystem which already has some kind of limited 2D accelaration support. Depending upon your blitter engine, a v4l2 mem2m device, or for simpler things there’s also dmaengines.

On top, the DRM subsystem does allow you to implement the traditional accelaration methods exposed by the fbdev subsystem. In case you have userspace that really insists on using these; it’s not recommended for anything new.

What about KMS?

The above is kinda a lie, since the KMS (kernel modesetting) IOCTL userspace API is a fairly full-featured 2D rendering interface. The aim of course is to render different pixel buffers onto a screen. With the recently added writeback support operations targetting memory are now possible. This could be used to expose a traditional blitter, if you only expose writeback support and no other outputs in your KMS driver.

There’s a few downsides:

So all together this isn’t the high-speed 2D accelaration API you’re looking for either. It is a valid alternative to the options above though, e.g. instead of a v4l2 mem2m device.

FAQ for the FAQ, or: OpenVG?

OpenVG isn’t the standard you’re looking for either. For one it’s a userspace API, like OpenGL. All the same reasons for not implementing a generic OpenGL interface at the kernel/userspace apply to OpenVG, too.

Second, the Mesa3D userspace library did support OpenVG once. Didn’t gain traction, got canned. Just because it calls itself a standard doesn’t make it a widely adopted industry default. Unlike OpenGL/Vulkan/DirectX on the 3D side.

Thanks to Dave Airlie and Daniel Stone for reading and commenting on drafts of this text.

August 27, 2018 12:00 AM

August 24, 2018

Linux Plumbers Conference: Performance and Scalability Systems Microconference Accepted into 2018 Linux Plumbers Conference

Core counts keep rising, and that means that the Linux kernel continues to encounter interesting performance and scalability issues. Which is not a bad thing, since it has been fifteen years since the “free lunch” of exponential CPU-clock frequency increases came to an abrupt end. During that time, the number of hardware threads per socket has risen sharply, approaching 100 for some high-end implementations. In addition, there is much more to scaling than simply larger numbers of CPUs.

Proposed topics for this microconference include optimizations for mmap_sem range locking; clearly defining what mmap_sem protects; scalability of page allocation, zone->lock, and lru_lock; swap scalability; variable hotpatching (self-modifying code!); multithreading kernel work; improved workqueue interaction with CPU hotplug events; proper (and optimized) cgroup accounting for workqueue threads; and automatically scaling the threshold values for per-CPU counters.

We are also accepting additional topics. In particular, we are curious to hear about real-world bottlenecks that people are running into, as well as scalability work-in-progress that needs face-to-face discussion.

If you would like to contribute to the discussion, please feel free to contact Daniel (lkmldmj@gmail.com), Pavel (pavel.tatashin@microsoft.com), or Ying (ying.huang@intel.com).

We hope to see you there!

August 24, 2018 05:55 PM

Greg Kroah-Hartman: What stable kernel should I use?

I get a lot of questions about people asking me about what stable kernel should they be using for their product/device/laptop/server/etc. all the time. Especially given the now-extended length of time that some kernels are being supported by me and others, this isn’t always a very obvious thing to determine. So this post is an attempt to write down my opinions on the matter. Of course, you are free to use what ever kernel version you want, but here’s what I recommend.

As always, the opinions written here are my own, I speak for no one but myself.

What kernel to pick

Here’s the my short list of what kernel you should use, raked from best to worst options. I’ll go into the details of all of these below, but if you just want the summary of all of this, here it is:

Hierarchy of what kernel to use, from best solution to worst:

What kernel to never use:

To give numbers to the above, today, as of August 24, 2018, the front page of kernel.org looks like this:

So, based on the above list that would mean that:

Quite easy, right?

Ok, now for some justification for all of this:

Distribution kernels

The best solution for almost all Linux users is to just use the kernel from your favorite Linux distribution. Personally, I prefer the community based Linux distributions that constantly roll along with the latest updated kernel and it is supported by that developer community. Distributions in this category are Fedora, openSUSE, Arch, Gentoo, CoreOS, and others.

All of these distributions use the latest stable upstream kernel release and make sure that any needed bugfixes are applied on a regular basis. That is the one of the most solid and best kernel that you can use when it comes to having the latest fixes (remember all fixes are security fixes) in it.

There are some community distributions that take a bit longer to move to a new kernel release, but eventually get there and support the kernel they currently have quite well. Those are also great to use, and examples of these are Debian and Ubuntu.

Just because I did not list your favorite distro here does not mean its kernel is not good. Look on the web site for the distro and make sure that the kernel package is constantly updated with the latest security patches, and all should be well.

Lots of people seem to like the old, “traditional” model of a distribution and use RHEL, SLES, CentOS or the “LTS” Ubuntu release. Those distros pick a specific kernel version and then camp out on it for years, if not decades. They do loads of work backporting the latest bugfixes and sometimes new features to these kernels, all in a Quixote quest to keep the version number from never being changed, despite having many thousands of changes on top of that older kernel version. This work is a truly thankless job, and the developers assigned to these tasks do some wonderful work in order to achieve these goals. If you like never seeing your kernel version number change, then use these distributions. They usually cost some money to use, but the support you get from these companies is worth it when something goes wrong.

So again, the best kernel you can use is one that someone else supports, and you can turn to for help. Use that support, usually you are already paying for it (for the enterprise distributions), and those companies know what they are doing.

But, if you do not want to trust someone else to manage your kernel for you, or you have hardware that a distribution does not support, then you want to run the Latest stable release:

Latest stable release

This kernel is the latest one from the Linux kernel developer community that they declare as “stable”. About every three months, the community releases a new stable kernel that contains all of the newest hardware support, the latest performance improvements, as well as the latest bugfixes for all parts of the kernel. Over the next 3 months, bugfixes that go into the next kernel release to be made are backported into this stable release, so that any users of this kernel are sure to get them as soon as possible.

This is usually the kernel that most community distributions use as well, so you can be sure it is tested and has a large audience of users. Also, the kernel community (all 4000+ developers) are willing to help support users of this release, as it is the latest one that they made.

After 3 months, a new kernel is released and you should move to it to ensure that you stay up to date, as support for this kernel is usually dropped a few weeks after the newer release happens.

If you have new hardware that is purchased after the last LTS release came out, you almost are guaranteed to have to run this kernel in order to have it supported. So for desktops or new servers, this is usually the recommended kernel to be running.

Latest LTS release

If your hardware relies on a vendors out-of-tree patch in order to make it work properly (like almost all embedded devices these days), then the next best kernel to be using is the latest LTS release. That release gets all of the latest kernel fixes that goes into the stable releases where applicable, and lots of users test and use it.

Note, no new features and almost no new hardware support is ever added to these kernels, so if you need to use a new device, it is better to use the latest stable release, not this release.

Also this release is common for users that do not like to worry about “major” upgrades happening on them every 3 months. So they stick to this release and upgrade every year instead, which is a fine practice to follow.

The downsides of using this release is that you do not get the performance improvements that happen in newer kernels, except when you update to the next LTS kernel, potentially a year in the future. That could be significant for some workloads, so be very aware of this.

Also, if you have problems with this kernel release, the first thing that any developer whom you report the issue to is going to ask you to do is, “does the latest stable release have this problem?” So you will need to be aware that support might not be as easy to get as with the latest stable releases.

Now if you are stuck with a large patchset and can not update to a new LTS kernel once a year, perhaps you want the older LTS releases:

Older LTS release

These releases have traditionally been supported by the community for 2 years, sometimes longer for when a major distribution relies on this (like Debian or SLES). However in the past year, thanks to a lot of suport and investment in testing and infrastructure from Google, Linaro, Linaro member companies, kernelci.org, and others, these kernels are starting to be supported for much longer.

Here’s the latest LTS releases and how long they will be supported for, as shown at kernel.org/category/releases.html on August 24, 2018:

The reason that Google and other companies want to have these kernels live longer is due to the crazy (some will say broken) development model of almost all SoC chips these days. Those devices start their development lifecycle a few years before the chip is released, however that code is never merged upstream, resulting in a brand new chip being released based on a 2 year old kernel. These SoC trees usually have over 2 million lines added to them, making them something that I have started calling “Linux-like” kernels.

If the LTS releases stop happening after 2 years, then support from the community instantly stops, and no one ends up doing bugfixes for them. This results in millions of very insecure devices floating around in the world, not something that is good for any ecosystem.

Because of this dependency, these companies now require new devices to constantly update to the latest LTS releases as they happen for their specific release version (i.e. every 4.9.y release that happens). An example of this is the Android kernel requirements for new devices shipping for the “O” and now “P” releases specified the minimum kernel version allowed, and Android security releases might start to require those “.y” releases to happen more frequently on devices.

I will note that some manufacturers are already doing this today. Sony is one great example of this, updating to the latest 4.4.y release on many of their new phones for their quarterly security release. Another good example is the small company Essential which has been tracking the 4.4.y releases faster than anyone that I know of.

There is one huge caveat when using a kernel like this. The number of security fixes that get backported are not as great as with the latest LTS release, because the traditional model of the devices that use these older LTS kernels is a much more reduced user model. These kernels are not to be used in any type of “general computing” model where you have untrusted users or virtual machines, as the ability to do some of the recent Spectre-type fixes for older releases is greatly reduced, if present at all in some branches.

So again, only use older LTS releases in a device that you fully control, or lock down with a very strong security model (like Android enforces using SELinux and application isolation). Never use these releases on a server with untrusted users, programs, or virtual machines.

Also, support from the community for these older LTS releases is greatly reduced even from the normal LTS releases, if available at all. If you use these kernels, you really are on your own, and need to be able to support the kernel yourself, or rely on you SoC vendor to provide that support for you (note that almost none of them do provide that support, so beware…)

Unmaintained kernel release

Surprisingly, many companies do just grab a random kernel release, slap it into their product and proceed to ship it in hundreds of thousands of units without a second thought. One crazy example of this would be the Lego Mindstorm systems that shipped a random -rc release of a kernel in their device for some unknown reason. A -rc release is a development release that not even the Linux kernel developers feel is ready for everyone to use just yet, let alone millions of users.

You are of course free to do this if you want, but note that you really are on your own here. The community can not support you as no one is watching all kernel versions for specific issues, so you will have to rely on in-house support for everything that could go wrong. Which for some companies and systems, could be just fine, but be aware of the “hidden” cost this might cause if you do not plan for this up front.

Summary

So, here’s a short list of different types of devices, and what I would recommend for their kernels:

And as for me, what do I run on my machines? My laptops run the latest development kernel (i.e. Linus’s development tree) plus whatever kernel changes I am currently working on and my servers run the latest stable release. So despite being in charge of the LTS releases, I don’t run them myself, except in testing systems. I rely on the development and latest stable releases to ensure that my machines are running the fastest and most secure releases that we know how to create at this point in time.

August 24, 2018 04:11 PM

August 22, 2018

Linux Plumbers Conference: RT Microconference Accepted into 2018 Linux Plumbers Conference

We are pleased to announce that the RT Microconference has been accepted into the 2018 Linux Plumbers Conference! The Real-Time patch (also known as PREEMPT_RT) has been developed out of tree since 2004. Although it hasn’t yet been fully merged, several enhancements came to the Linux kernel directly as the result of the RT patch. These include, mutexes, high resolution timers, lockdep, ftrace, RT scheduling, SCHED_DEADLINE, RCU_PREEMPT, cross-arch generic interrupt logic, priority inheritance futexes, threaded interrupt handlers, to name a few. All that is left is the conversion of the kernel spinning locks into mutexes, and the transformation is complete. There’s talk about that happening by the end of this year or early next year.

Topics proposed for this year’s event include how PREEMPT_RT will be maintained when it gets into the kernel, who’s going to maintain it, how do we catch when it breaks, updates to lockdep, addition of selftests, discussions of RT related failures, stable backports, safety critical domains, and more.

We hope to see you there!

 

August 22, 2018 09:46 PM

August 20, 2018

Paul E. Mc Kenney: Performance and Scalability Systems Microconference Accepted into 2018 Linux Plumbers Conference

Core counts keep rising, and that means that the Linux kernel continues to encounter interesting performance and scalability issues. Which is not a bad thing, since it has been fifteen years since the ``free lunch'' of exponential CPU-clock frequency increases came to an abrupt end. During that time, the number of hardware threads per socket has risen sharply, approaching 100 for some high-end implementations. In addition, there is much more to scaling than simply larger numbers of CPUs.

Proposed topics for this microconference include optimizations for mmap_sem range locking; clearly defining what mmap_sem protects; scalability of page allocation, zone->lock, and lru_lock; swap scalability; variable hotpatching (self-modifying code!); multithreading kernel work; improved workqueue interaction with CPU hotplug events; proper (and optimized) cgroup accounting for workqueue threads; and automatically scaling the threshold values for per-CPU counters.

We are also accepting additional topics. In particular, we are curious to hear about real-world bottlenecks that people are running into, as well as scalability work-in-progress that needs face-to-face discussion.

We hope to see you there!

August 20, 2018 09:04 PM

Kees Cook: security things in Linux v4.18

Previously: v4.17.

Linux kernel v4.18 was released last week. Here are details on some of the security things I found interesting:

allocation overflow detection helpers
One of the many ways C can be dangerous to use is that it lacks strong primitives to deal with arithmetic overflow. A developer can’t just wrap a series of calculations in a try/catch block to trap any calculations that might overflow (or underflow). Instead, C will happily wrap values back around, causing all kinds of flaws. Some time ago GCC added a set of single-operation helpers that will efficiently detect overflow, so Rasmus Villemoes suggested implementing these (with fallbacks) in the kernel. While it still requires explicit use by developers, it’s much more fool-proof than doing open-coded type-sensitive bounds checking before every calculation. As a first-use of these routines, Matthew Wilcox created wrappers for common size calculations, mainly for use during memory allocations.

removing open-coded multiplication from memory allocation arguments
A common flaw in the kernel is integer overflow during memory allocation size calculations. As mentioned above, C doesn’t provide much in the way of protection, so it’s on the developer to get it right. In an effort to reduce the frequency of these bugs, and inspired by a couple flaws found by Silvio Cesare, I did a first-pass sweep of the kernel to move from open-coded multiplications during memory allocations into either their 2-factor API counterparts (e.g. kmalloc(a * b, GFP...) -> kmalloc_array(a, b, GFP...)), or to use the new overflow-checking helpers (e.g. vmalloc(a * b) -> vmalloc(array_size(a, b))). There’s still lots more work to be done here, since frequently an allocation size will be calculated earlier in a variable rather than in the allocation arguments, and overflows happen in way more places than just memory allocation. Better yet would be to have exceptions raised on overflows where no wrap-around was expected (e.g. Emese Revfy’s size_overflow GCC plugin).

Variable Length Array removals, part 2
As discussed previously, VLAs continue to get removed from the kernel. For v4.18, we continued to get help from a bunch of lovely folks: Andreas Christoforou, Antoine Tenart, Chris Wilson, Gustavo A. R. Silva, Kyle Spiers, Laura Abbott, Salvatore Mesoraca, Stephan Wahren, Thomas Gleixner, Tobin C. Harding, and Tycho Andersen. Almost all the rest of the VLA removals have been queued for v4.19, but it looks like the very last of them (deep in the crypto subsystem) won’t land until v4.20. I’m so looking forward to being able to add -Wvla globally to the kernel build so we can be free from the classes of flaws that VLAs enable, like stack exhaustion and stack guard page jumping. Eliminating VLAs also simplifies the porting work of the stackleak GCC plugin from grsecurity, since it no longer has to hook and check VLA creation.

Kconfig compiler detection
While not strictly a security thing, Masahiro Yamada made giant improvements to the kernel’s Kconfig subsystem so that kernel build configuration now knows what compiler you’re using (among other things) so that configuration is no longer separate from the compiler features. For example, in the past, one could select CONFIG_CC_STACKPROTECTOR_STRONG even if the compiler didn’t support it, and later the build would fail. Or in other cases, configurations would silently down-grade to what was available, potentially leading to confusing kernel images where the compiler would change the meaning of a configuration. Going forward now, configurations that aren’t available to the compiler will simply be unselectable in Kconfig. This makes configuration much more consistent, though in some cases, it makes it harder to discover why some configuration is missing (e.g. CONFIG_GCC_PLUGINS no longer gives you a hint about needing to install the plugin development packages).

That’s it for now! Please let me know if you think I missed anything. Stay tuned for v4.19; the merge window is open. :)

© 2018, Kees Cook. This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 License.
Creative Commons License

August 20, 2018 06:29 PM

Linux Plumbers Conference: Testing and Fuzzing Microconference Accepted into 2018 Linux Plumbers Conference

Testing and Fuzzing Microconference Accepted into 2018 Linux Plumbers Conference

Testing, fuzzing, and other diagnostics have greatly increased the robustness of the Linux ecosystem, but embarrassing bugs still escape to end users. Furthermore, a million-year bug would happen several tens of times per day across Linux’s installed base (said to number more than 20 billion), so the best we can possibly do is hardly good enough.

The Testing and Fuzzing Microconference intends to raise the bar with further progress on syzbot/syzkaller, distribution/stable testing, kernel continuous integration (https://kernelci.org/), and unit testing (https://media.readthedocs.org/pdf/ktf/latest/ktf.pdf and https://01.org/lkp). The best evidence of progress in these efforts will of course be the plethora of bug reports produced by these and similar tools!

Join us for an important and spirited discussion!

August 20, 2018 03:45 PM

August 17, 2018

Paul E. Mc Kenney: Testing & Fuzzing Microconference Accepted into 2018 Linux Plumbers Conference

Testing, fuzzing, and other diagnostics have greatly increased the robustness of the Linux ecosystem, but embarrassing bugs still escape to end users. Furthermore, a million-year bug would happen several tens of times per day across Linux's installed base (said to number more than 20 billion), so the best we can possibly do is hardly good enough.

The Testing and Fuzzing Microconference intends to raise the bar with further progress on syzbot/syzkaller, distribution/stable testing, kernel continuous integration, and unit testing. The best evidence of progress in these efforts will of course be the plethora of bug reports produced by these and similar tools!

Join us for an important and spirited discussion!

August 17, 2018 07:38 PM

August 11, 2018

Linux Plumbers Conference: Early Registration Ending Soon!

The early registration deadline is August 18, 2018, after which the regular-registration period will begin.  So to save $150, register for the Linux Plumbers Conference before August 18th!

August 11, 2018 06:25 PM

August 08, 2018

Linux Plumbers Conference: Containers Microconference Accepted into 2018 Linux Plumbers Conference

The Containers Micro-conference at Linux Plumbers is the yearly gathering of container runtime developers, kernel developers and container users. It is the one opportunity to have everyone in the same room to both look back at the past year in the container space and discuss the year ahead.

In the past, topics such as use of cgroups by containers, system call filtering and interception (Seccomp), improvements/additions of kernel namespaces, interaction with the Linux Security Modules (AppArmor, SELinux, SMACK), TPM based validation (IMA), mount propagation and mount API changes, uevent isolation, unprivileged filesystem mounts and more have been discussed in this micro-conference.

There will also no doubt be some discussions around performance to make up for the overhead caused by the recent Spectre and Meltdown set of mitigations that in some cases have had a significant impact on container runtimes.

This year’s edition will be combined with what was formerly the Checkpoint-Restart micro-conference. Expect continued discussion about integration of CRIU with the container runtimes, addressing performance issues of checkpoint and restart and possible optimizations, as well as (in)stability of rarely used kernel ABIs. Another hot new topic would be time namespacing and its usage for container snapshotting and migration.

August 08, 2018 04:23 PM

August 03, 2018

Linux Plumbers Conference: BPF Microconference Accepted into 2018 Linux Plumbers Conference

We are pleased to announce that the BPF Microconference has been accepted into the 2018 Linux Plumbers Conference!

BPF (Berkeley Packet Filter) is one of the fastest emerging technologies of the Linux kernel and plays a major role in networking (XDP (eXpress Data Path), tc/BPF, etc), tracing (kprobes, uprobes, tracepoints) and security (seccomp, landlock) thanks to its versatility and efficiency.

BPF has seen a lot of progress since last year’s Plumbers conference and many of the discussed BPF tracing Microconference improvements have been tackled since then such as the introduction of BPF type format (BTF) to name one. This year’s BPF Microconference event focuses on the core BPF infrastructure as well as its subsystems, therefore topics proposed for this year’s event include improving verifier scalability, next steps on BPF type format, dynamic tracing without on the fly compilation, string and loop support, reuse of host JITs for offloads, LRU heuristics and timers, syscall interception, microkernels, and many more.

August 03, 2018 09:17 PM

August 01, 2018

Dave Airlie (blogspot): virgl - exposes GLES3.1/3.2 and GL4.3

I'd had a bit of a break from adding feature to virgl while I was working on radv, but recently Google and Collabora have started to invest in virgl as a solution. A number of developers from both companies have joined the project.

This meant trying to get virgl to pass their dEQP suite and adding support for newer GL/GLES feature levels. They also have a goal for the renderer to run on a host GLES implementation whereas it currently only ran on a host GL.

Over the past few months I've worked with the group to add support for all the necessary features needed for guest GLES3.1 support (for them) and GL4.3 (for me).

The feature list was roughly:
tessellation shaders
fp64 support
ARB_gpu_shader5 support
Shader buffer objects
Shader image objects
Compute shaders
Copy Image
Texture views

With this list implemented we achieved GL4.3 and GLES3.1.

However Marek@AMD did some work on exposing ASTC for gallium drivers,
and with some extra work on EXT_shader_framebuffer_fetch, we now expose GLES3.2.

There was also plenty of work done on avoiding crashes from rogue guests (rewrote the whole feature/capability bit handling), and lots of bug fixes. There are still ongoing fixes to finish the dEQP tests, but it looks like all the feature work should be landed now.

What next?

Well there is one big problem facing virgl in exposing GL 4.4. GL_ARB_buffer_storage requires exposing coherent buffer memory, and with the virgl architecture, we currently don't have a way to map the pages behind a host GL buffer mapping into a guest GL buffer mapping in order to achieve coherency. This is going to require some thought and it may even require exposing some new GL extensions to export a buffer to a dma-buf.

There has also been a GSoC student Nathan working on vulkan/virgl support, he's made some iniital progess, however vulkan also has requirements on coherent memory so that tricky problems needs to be solved.

Thanks again to all the contributors to the virgl project.

August 01, 2018 04:26 AM

July 31, 2018

Matthew Garrett: Porting Coreboot to the 51NB X210

The X210 is a strange machine. A set of Chinese enthusiasts developed a series of motherboards that slot into old Thinkpad chassis, providing significantly more up to date hardware. The X210 has a Kabylake CPU, supports up to 32GB of RAM, has an NVMe-capable M.2 slot and has eDP support - and it fits into an X200 or X201 chassis, which means it also comes with a classic Thinkpad keyboard . We ordered some from a Facebook page (a process that involved wiring a large chunk of money to a Chinese bank which wasn't at all stressful), and a couple of weeks later they arrived. Once I'd put mine together I had a quad-core i7-8550U with 16GB of RAM, a 512GB NVMe drive and a 1920x1200 display. I'd transplanted over the drive from my XPS13, so I was running stock Fedora for most of this development process.

The other fun thing about it is that none of the firmware flashing protection is enabled, including Intel Boot Guard. This means running a custom firmware image is possible, and what would a ridiculous custom Thinkpad be without ridiculous custom firmware? A shadow of its potential, that's what. So, I read the Coreboot[1] motherboard porting guide and set to.

My life was made a great deal easier by the existence of a port for the Purism Librem 13v2. This is a Skylake system, and Skylake and Kabylake are very similar platforms. So, the first job was to just copy that into a new directory and start from there. The first step was to update the Inteltool utility so it understood the chipset - this commit shows what was necessary there. It's mostly just adding new PCI IDs, but it also needed some adjustment to account for the GPIO allocation being different on mobile parts when compared to desktop ones. One thing that bit me - Inteltool relies on being able to mmap() arbitrary bits of physical address space, and the kernel doesn't allow that if CONFIG_STRICT_DEVMEM is enabled. I had to disable that first.

The GPIO pins got dropped into gpio.h. I ended up just pushing the raw values into there rather than parsing them back into more semantically meaningful definitions, partly because I don't understand what these things do that well and largely because I'm lazy. Once that was done, on to the next step.

High Definition Audio devices (or HDA) have a standard interface, but the codecs attached to the HDA device vary - both in terms of their own configuration, and in terms of dealing with how the board designer may have laid things out. Thankfully the existing configuration could be copied from /sys/class/sound/card0/hwC0D0/init_pin_configs[2] and then hda_verb.h could be updated.

One more piece of hardware-specific configuration is the Video BIOS Table, or VBT. This contains information used by the graphics drivers (firmware or OS-level) to configure the display correctly, and again is somewhat system-specific. This can be grabbed from /sys/kernel/debug/dri/0/i915_vbt.

A lot of the remaining platform-specific configuration has been split out into board-specific config files. and this also needed updating. Most stuff was the same, but I confirmed the GPE and genx_dec register values by using Inteltool to dump them from the vendor system and copy them over. lspci -t gave me the bus topology and told me which PCIe root ports were in use, and lsusb -t gave me port numbers for USB. That let me update the root port and USB tables.

The final code update required was to tell the OS how to communicate with the embedded controller. Various ACPI functions are actually handled by this autonomous device, but it's still necessary for the OS to know how to obtain information from it. This involves writing some ACPI code, but that's largely a matter of cutting and pasting from the vendor firmware - the EC layout depends on the EC firmware rather than the system firmware, and we weren't planning on changing the EC firmware in any way. Using ifdtool told me that the vendor firmware image wasn't using the EC region of the flash, so my assumption was that the EC had its own firmware stored somewhere else. I was ready to flash.

The first attempt involved isis' machine, using their Beaglebone Black as a flashing device - the lack of protection in the firmware meant we ought to be able to get away with using flashrom directly on the host SPI controller, but using an external flasher meant we stood a better chance of being able to recover if something went wrong. We flashed, plugged in the power and… nothing. Literally. The power LED didn't turn on. The machine was very, very dead.

Things like managing battery charging and status indicators are up to the EC, and the complete absence of anything going on here meant that the EC wasn't running. The most likely reason for that was that the system flash did contain the EC's firmware even though the descriptor said it didn't, and now the system was very unhappy. Worse, the flash wouldn't speak to us any more - the power supply from the Beaglebone to the flash chip was sufficient to power up the EC, and the EC was then holding onto the SPI bus desperately trying to read its firmware. Bother. This was made rather more embarrassing because isis had explicitly raised concern about flashing an image that didn't contain any EC firmware, and now I'd killed their laptop.

After some digging I was able to find EC firmware for a related 51NB system, and looking at that gave me a bunch of strings that seemed reasonably identifiable. Looking at the original vendor ROM showed very similar code located at offset 0x00200000 into the image, so I added a small tool to inject the EC firmware (basing it on an existing tool that does something similar for the EC in some HP laptops). I now had an image that I was reasonably confident would get further, but we couldn't flash it. Next step seemed like it was going to involve desoldering the flash from the board, which is a colossal pain. Time to sleep on the problem.

The next morning we were able to borrow a Dediprog SPI flasher. These are much faster than doing SPI over GPIO lines, and also support running the flash at different voltage. At 3.5V the behaviour was the same as we'd seen the previous night - nothing. According to the datasheet, the flash required at least 2.7V to run, but flashrom listed 1.8V as the next lower voltage so we tried. And, amazingly, it worked - not reliably, but sufficiently. Our hypothesis is that the chip is marginally able to run at that voltage, but that the EC isn't - we were no longer powering the EC up, so could communicated with the flash. After a couple of attempts we were able to write enough that we had EC firmware on there, at which point we could shift back to flashing at 3.5V because the EC was leaving the flash alone.

So, we flashed again. And, amazingly, we ended up staring at a UEFI shell prompt[3]. USB wasn't working, and nor was the onboard keyboard, but we had graphics and were executing actual firmware code. I was able to get USB working fairly quickly - it turns out that Linux numbers USB ports from 1 and the FSP numbers them from 0, and fixing that up gave us working USB. We were able to boot Linux! Except there were a whole bunch of errors complaining about EC timeouts, and also we only had half the RAM we should.

After some discussion on the Coreboot IRC channel, we figured out the RAM issue - the Librem13 only has one DIMM slot. The FSP expects to be given a set of i2c addresses to probe, one for each DIMM socket. It is then able to read back the DIMM configuration and configure the memory controller appropriately. Running i2cdetect against the system SMBus gave us a range of devices, including one at 0x50 and one at 0x52. The detected DIMM was at 0x50, which made 0x52 seem like a reasonable bet - and grepping the tree showed that several other systems used 0x52 as the address for their second socket. Adding that to the list of addresses and passing it to the FSP gave us all our RAM.

So, now we just had to deal with the EC. One thing we noticed was that if we flashed the vendor firmware, ran it, flashed Coreboot and then rebooted without cutting the power, the EC worked. This strongly suggested that there was some setup code happening in the vendor firmware that configured the EC appropriately, and if we duplicated that it would probably work. Unfortunately, figuring out what that code was was difficult. I ended up dumping the PCI device configuration for the vendor firmware and for Coreboot in case that would give us any clues, but the only thing that seemed relevant at all was that the LPC controller was configured to pass io ports 0x4e and 0x4f to the LPC bus with the vendor firmware, but not with Coreboot. Unfortunately the EC was supposed to be listening on 0x62 and 0x66, so this wasn't the problem.

I ended up solving this by using UEFITool to extract all the code from the vendor firmware, and then disassembled every object and grepped them for port io. x86 systems have two separate io buses - memory and port IO. Port IO is well suited to simple devices that don't need a lot of bandwidth, and the EC is definitely one of these - there's no way to talk to it other than using port IO, so any configuration was almost certainly happening that way. I found a whole bunch of stuff that touched the EC, but was clearly depending on it already having been enabled. I found a wide range of cases where port IO was being used for early PCI configuration. And, finally, I found some code that reconfigured the LPC bridge to route 0x4e and 0x4f to the LPC bus (explaining the configuration change I'd seen earlier), and then wrote a bunch of values to those addresses. I mimicked those, and suddenly the EC started responding.

It turns out that the writes that made this work weren't terribly magic. PCs used to have a SuperIO chip that provided most of the legacy port functionality, including the floppy drive controller and parallel and serial ports. Individual components (called logical devices, or LDNs) could be enabled and disabled using a sequence of writes that was fairly consistent between vendors. Someone on the Coreboot IRC channel recognised that the writes that enabled the EC were simply using that protocol to enable a series of LDNs, which apparently correspond to things like "Working EC" and "Working keyboard". And with that, we were done.

Coreboot doesn't currently have ACPI support for the latest Intel graphics chipsets, so right now my image doesn't have working backlight control.Backlight control also turned out to be interesting. Most modern Intel systems handle the backlight via registers in the GPU, but the X210 uses the embedded controller (possibly because it supports both LVDS and eDP panels). This means that adding a simple display stub is sufficient - all we have to do on a backlight set request is store the value in the EC, and it does the rest.

Other than that, everything seems to work (although there's probably a bunch of power management optimisation to do). I started this process knowing almost nothing about Coreboot, but thanks to the help of people on IRC I was able to get things working in about two days of work[4] and now have firmware that's about as custom as my laptop.

[1] Why not Libreboot? Because modern Intel SoCs haven't had their memory initialisation code reverse engineered, so the only way to boot them is to use the proprietary Intel Firmware Support Package.
[2] Card 0, device 0
[3] After a few false starts - it turns out that the initial memory training can take a surprisingly long time, and we kept giving up before that had happened
[4] Spread over 5 or so days of real time

comment count unavailable comments

July 31, 2018 08:44 AM

July 29, 2018

Pavel Machek: Pretty big side-effect

Timing and side-channels are not normally considered side-effects, meaning compilers and cpus feel free to do whatever they want. And they do. Unfortunately, I consider leaking my passwords to remote attackers prety significant side-effect... Imagine simple function.

void handle(char secret) {}
That's obviously safe, right? And now
void handle(char secret) { int i; for (i=0; i<secret*1000000; i++) ; }
That's obviously bad idea, because now secret is exposed via timing. Now, that used to be the only sidechannel for a while, but then, caches were invented. These days,
static char font[16*256]; void handle(char secret) { font[secret*16]; }
may be bad idea. But C has not changed, it knows nothing about caches, and nothing about side-channels. Caches are old news. But today we have complex branch predictors, and speculative execution. It is called spectre. This is bad idea:
static char font[16*256]; void handle(char secret) { if (0) font[secret*16]; }
as is this:
static char small[16], big[256]; void foo(int untrusted) { if (untrusted<16) big[small[untrusted]]; }
CPU bug... unfortunately it is tricky to fix, and the bug only affects caches / timing, so it "does not exist" as far as C is concerned. Canonical fix is something like
static char small[16], big[256]; void foo(int untrusted) { if (untrusted<16) { asm volatile("lfence"); big[small[untrusted]]; }}
which is okay as long as compiler compiles it the obvious way. But again, compiler knows nothing about caches / side channels, so it may do something unexpected, and re-introduce the bug. Unfortunately, it seems that there's not even agreement whose bug it is. Is it time C was extended to know about side-channels? What about
void handle(int please_do_not_leak_this secret) {}
? Do we need new language to handle modern (speculative, multi-core, fast, side-channels all around) CPUs?
(Now, you may say that it is impossible to eliminate all the side-channels. I believe eliminating most of them is well possible, if we are willing to live with ... huge slowdown. You can store each variable twice, to at least detectrowhammer. Caches still can be disabled -- row buffer in DRAM will be more problematic, and if you disable hyperthreading and make every second instruction  lfence, you can get rid of Spectre-like problems. You may get 100x? 1000x? slowdown, but if that's applied only to software that needs it, it may be acceptable. You probably want your editors & mail readers protected. You probably don't need gcc to be protected. No, running modern web browser does not look sustainable).

July 29, 2018 06:20 AM

July 24, 2018

Linux Plumbers Conference: RDMA Microconference Accepted into 2018 Linux Plumbers Conference

We are pleased to announce that the RDMA Microconference has been accepted into the 2018 Linux Plumbers Conference!

RDMA (remote direct memory access) is a well-established technology that is used in environments requiring both maximum throughputs and minimum latencies. For a long time, this technology was used primary in high-performance computing, high frequency trading, and supercomputing. For example, the three most powerful computers are based on Linux and RDMA (in the guise of Infiniband).

However, the latest trends in cloud computing (more bandwidth at larger scales) and storage (more IOPS) makes RDMA increasingly important outside of its initial niches. Therefore, clean integration between RDMA and various kernel susbsystems is paramount. We are therefore looking to build on previous years’ successful RDMA microconferences, this year discussing our 2018-2019 plans and roadmap.

Topics proposed for this year’s event include the interaction between RDMA and DAX (direct access for files), how to solve the get_user_pages() problem (see https://lwn.net/Articles/753027/ and https://lwn.net/Articles/753272/), IOMMU and PCI-E issues, continuous integration, python integration, and Syzkaller testing.

July 24, 2018 05:40 PM

July 23, 2018

Paul E. Mc Kenney: RDMA Microconference Accepted into 2018 Linux Plumbers Conference

We are pleased to announce that the RDMA Microconference has been accepted into the 2018 Linux Plumbers Conference!

RDMA (remote direct memory access) is a well-established technology that is used in environments requiring both maximum throughputs and minimum latencies. For a long time, this technology was used primary in high-performance computing, high frequency trading, and supercomputing. For example, the three most powerful computers are based on Linux and RDMA (in the guise of Infiniband).

However, the latest trends in cloud computing (more bandwidth at larger scales) and storage (more IOPS) makes RDMA increasingly important outside of its initial niches. Therefore, clean integration between RDMA and various kernel susbsystems is paramount. We are therefore looking to build on previous years' successful RDMA microconferences, this year discussing our 2018-2019 plans and roadmap.

Topics proposed for this year's event include the interaction between RDMA and DAX (direct access for files), how to solve the get_user_pages() problem (see https://lwn.net/Articles/753027/ and https://lwn.net/Articles/753272/), IOMMU and PCI-E issues, continuous integration, python integration, and Syzkaller testing.

July 23, 2018 08:07 PM

Linux Plumbers Conference: Two-day Networking Track added to LPC

A two-day Networking Track will be featured at this year’s Linux Plumbers Conference; it will run the first two days of LPC, November 13-14. The track will consist of a series of talks, including a keynote from David Miller: “This talk is not about XDP: From Resource Limits to SKB Lists”. Talk proposals on a variety of networking topics are now under consideration; that page will be updated with the accepted talks soon. The Networking Track will be open to all LPC attendees.

LPC will be held in Vancouver, British Columbia, Canada from Tuesday, November 13 through Thursday, November 15. We look forward to the Networking Track as well as the rest of the LPC content (microconferences, Kernel Summit Track, refereed talks, and BoFs) and hope to see you there.

July 23, 2018 06:32 PM

Paul E. Mc Kenney: Verification Challenge 7: Heavy Modifications to Linux-Kernel Tree RCU

There was a time when I felt that Linux-kernel RCU was too low-level to possibly be the subject of a security exploit, but Rowhammer put paid to that naive notion. And it finally happened earlier this year. Now, I could claim that I did nothing wrong. After all, RCU worked as advertised. The issue was instead that RCU has multiple flavors:

 


  1. RCU-bh for code that is subject to network-based denial-of-service attacks.
  2. RCU-sched for code that must interact with interrupt/NMI handlers or with preemption-disabled regions of code, and for general-purpose use in CONFIG_PREEMPT=n kernels.
  3. RCU-preempt for general-purpose use in CONFIG_PREEMPT=y kernels.


The real problem was that someone used one flavor in one part of their RCU algorithm, and another flavor in another part. This has roughly the same effect on your kernel's health and well-being as does acquiring the wrong lock. And, as luck would have it, the resulting bug proved to be exploitable. To his credit, Linus Torvalds noted that having multiple RCU flavors was a root cause, and so he asked that I do something to prevent future similar security-exploitable confusion. After some discussion, it was decided that I try to merge the three flavors of RCU into “one flavor to rule them all”.

Which I have now done in the relative privacy of my -rcu git tree (as in “git clone https://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git” followed by “git checkout dev”).

So what has this got to do with validation in general or formal verification in particular?

Just this: Over the past few months, I have taken a meataxe to Linux-kernel RCU, which implies the injection of any number of bugs. If you would like your formal-verification tool/methodology to be the first to find a bug in Linux-kernel RCU that I don't already know about, this would be an excellent time to give it a try. And yes, all those qualifiers are necessary, as several groups have used formal-verification tools to find bugs in Linux-kernel RCU that I did already know about.

More generally, given the large number of swings I took with said meataxe, if your formal verification tool cannot find bugs in the current dev version of RCU, you might need to entertain the possibility that your formal verification tool cannot find bugs!

July 23, 2018 04:18 PM

July 16, 2018

Pete Zaitcev: Finally a use for code 451

Saw today at a respectable news site, which does not even nag about adblock:

451

We recognise you are attempting to access this website from a country belonging to the European Economic Area (EEA) including the EU which enforces the General Data Protection Regulation (GDPR) and therefore cannot grant you access at this time. For any issues, e-mail us at xxxxxxxx@xxxxxx.com or call us at xxx-xxx-4000.

What a way to brighten one's day. The phone without a country code is a cherry on top.

P.S. The only fly in this ointment is, I wasn't accessing it from the GDPR area. It was a geolocation failure.

July 16, 2018 12:46 AM

July 15, 2018

James Bottomley: Measuring the Horizontal Attack Profile of Nabla Containers

One of the biggest problems with the current debate about Container vs Hypervisor security is that no-one has actually developed a way of measuring security, so the debate is all in qualitative terms (hypervisors “feel” more secure than containers because of the interface breadth) but no-one actually has done a quantitative comparison.  The purpose of this blog post is to move the debate forwards by suggesting a quantitative methodology for measuring the Horizontal Attack Profile (HAP).  For more details about Attack Profiles, see this blog post.  I don’t expect this will be the final word in the debate, but by describing how we did it I hope others can develop quantitative measurements as well.

Well begin by looking at the Nabla technology through the relatively uncontroversial metric of performance.  In most security debates, it’s acceptable that some performance is lost by securing the application.  As a rule of thumb, placing an application in a hypervisor loses anywhere between 10-30% of the native performance.  Our goal here is to show that, for a variety of web tasks, the Nabla containers mechanism has an acceptable performance penalty.

Performance Measurements

We took some standard benchmarks: redis-bench-set, redis-bench-get, python-tornado and node-express and in the latter two we loaded up the web servers with simple external transactional clients.  We then performed the same test for docker, gVisor, Kata Containers (as our benchmark for hypervisor containment) and nabla.  In all the figures, higher is better (meaning more throughput):

The red Docker measure is included to show the benchmark.  As expected, the Kata Containers measure is around 10-30% down on the docker one in each case because of the hypervisor penalty.  However, in each case the Nabla performance is the same or higher than the Kata one, showing we pay less performance overhead for our security.  A final note is that since the benchmarks are network ones, there’s somewhat of a penalty paid by userspace networking stacks (which nabla necessarily has) for plugging into docker network, so we show two values, one for the bridging plug in (nabla-containers) required to orchestrate nabla with kubernetes and one as a direct connection (nabla-raw) showing where the performance would be without the network penalty.

One final note is that, as expected, gVisor sucks because ptrace is a really inefficient way of connecting the syscalls to the sandbox.  However, it is more surprising that gVisor-kvm (where the sandbox connects to the system calls of the container using hypercalls instead) is also pretty lacking in performance.  I speculate this is likely because hypercalls exact their own penalty and hypervisors usually try to minimise them, which using them to replace system calls really doesn’t do.

HAP Measurement Methodology

The Quantitative approach to measuring the Horizontal Attack Profile (HAP) says that we take the bug density of the Linux Kernel code  and multiply it by the amount of unique code traversed by the running system after it has reached a steady state (meaning that it doesn’t appear to be traversing any new kernel paths). For the sake of this method, we assume the bug density to be uniform and thus the HAP is approximated by the amount of code traversed in the steady state.  Measuring this for a running system is another matter entirely, but, fortunately, the kernel has a mechanism called ftrace which can be used to provide a trace of all of the functions called by a given userspace process and thus gives a reasonable approximation of the number of lines of code traversed (note this is an approximation because we measure the total number of lines in the function taking no account of internal code flow, primarily because ftrace doesn’t give that much detail).  Additionally, this methodology works very well for containers where all of the control flow emanates from a well known group of processes via the system call information, but it works less well for hypervisors where, in addition to the direct hypercall interface, you also have to add traces from the back end daemons (like the kvm vhost kernel threads or dom0 in the case of Xen).

HAP Results

The results are for the same set of tests as the performance ones except that this time we measure the amount of code traversed in the host kernel:

As stated in our methodology, the height of the bar should be directly proportional to the HAP where lower is obviously better.  On these results we can say that in all cases the Nabla runtime tender actually has a better HAP than the hypervisor contained Kata technology, meaning that we’ve achieved a container system with better HAP (i.e. more secure) than hypervisors.

Some of the other results in this set also bear discussing.  For instance the Docker result certainly isn’t 10x the Kata result as a naive analysis would suggest.  In fact, the containment provided by docker looks to be only marginally worse than that provided by the hypervisor.  Given all the hoopla about hypervisors being much more secure than containers this result looks surprising but you have to consider what’s going on: what we’re measuring in the docker case is the system call penetration of normal execution of the systems.  Clearly anything malicious could explode this result by exercising all sorts of system calls that the application doesn’t normally use.  However, this does show clearly that a docker container with a well crafted seccomp profile (which blocks unexpected system calls) provides roughly equivalent security to a hypervisor.

The other surprising result is that, in spite of their claims to reduce the exposure to Linux System Calls, gVisor actually is either equivalent to the docker use case or, for the python tornado test, significantly worse than the docker case.  This too is explicable in terms of what’s going on under the covers: gVisor tries to improve containment by rewriting the Linux system call interface in Go.  However, no-one has paid any attention to the amount of system calls the Go runtime is actually using, which is what these results are really showing.  Thus, while current gVisor doesn’t currently achieve any containment improvement on this methodology, it’s not impossible to write a future version of the Go runtime that is much less profligate in the way it uses system calls by developing a Secure Go using the same methodology we used to develop Nabla.

Conclusions

On both tests, Nabla is far and away the best containment technology for secure workloads given that it sacrifices the least performance over docker to achieve the containment and, on the published results, is 2x more secure even than using hypervisor based containment.

Hopefully these results show that it is perfectly possible to have containers that are more secure than hypervisors and lays to rest, finally, the arguments about which is the more secure technology.  The next step, of course, is establishing the full extent of exposure to a malicious application and to do that, some type of fuzz testing needs to be employed.  Unfortunately, right at the moment, gVisor is simply crashing when subjected to fuzz testing, so it needs to become more robust before realistic measurements can be taken.

July 15, 2018 05:54 AM

James Bottomley: A New Method of Containment: IBM Nabla Containers

In the previous post about Containers and Cloud Security, I noted that most of the tenants of a Cloud Service Provider (CSP) could safely not worry about the Horizontal Attack Profile (HAP) and leave the CSP to manage the risk.  However, there is a small category of jobs (mostly in the financial and allied industries) where the damage done by a Horizontal Breach of the container cannot be adequately compensated by contractual remedies.  For these cases, a team at IBM research has been looking at ways of reducing the HAP with a view to making containers more secure than hypervisors.  For the impatient, the full open source release of the Nabla Containers technology is here and here, but for the more patient, let me explain what we did and why.  We’ll have a follow on post about the measurement methodology for the HAP and how we proved better containment than even hypervisor solutions.

The essence of the quest is a sandbox that emulates the interface between the runtime and the kernel (usually dubbed the syscall interface) with as little code as possible and a very narrow interface into the kernel itself.

The Basics: Looking for Better Containment

The HAP attack worry with standard containers is shown on the left: that a malicious application can breach the containment wall and attack an innocent application.  This attack is thought to be facilitated by the breadth of the syscall interface in standard containers so the guiding star in developing Nabla Containers was a methodology for measuring the reduction in the HAP (and hence the improvement in containment), but the initial impetus came from the observation that unikernel systems are nicely modular in the libOS approach, can be used to emulate systemcalls and, thanks to rumprun, have a wide set of support for modern web friendly languages (like python, node.js and go) with a fairly thin glue layer.  Additionally they have a fairly narrow set of hypercalls that are actually used in practice (meaning they can be made more secure than conventional hypervisors).  Code coverage measurements of standard unikernel based kvm images confirmed that they did indeed use a far narrower interface.

Replacing the Hypervisor Interface

One of the main elements of the hypervisor interface is the transition from a less privileged guest kernel to a more privileged host one via hypercalls and vmexits.  These CPU mediated events are actually quite expensive, certainly a lot more expensive than a simple system call, which merely involves changing address space and privilege level.  It turns out that the unikernel based kvm interface is really only nine hypercalls, all of which are capable of being rewritten as syscalls, so the approach to running this new sandbox as a container is to do this rewrite and seccomp restrict the interface to being only what the rewritten unikernel runtime actually needs (meaning that the seccomp profile is now CSP enforced).  This vision, by the way, of a broad runtime above being mediated to a narrow interface is where the name Nabla comes from: The symbol for Nabla is an inverted triangle (∇) which is broad at the top and narrows to a point at the base.

Using this formulation means that the nabla runtime (or nabla tender) can be run as a single process within a standard container and the narrowness of the interface to the host kernel prevents most of the attacks that a malicious application would be able to perform.

DevOps and the ParaVirt conundrum

Back at the dawn of virtualization, there were arguments between Xen and VMware over whether a hypervisor should be fully virtual (capable of running any system supported by the virtual hardware description) or paravirtual (the system had to be modified to run on the virtualization system and thus would be incapable of running on physical hardware).  Today, thanks in a large part to CPU support for virtualization primtives, fully paravirtual systems have long since gone the way of the dodo and everyone nowadays expects any OS running on a hypervisor to be capable of running on physical hardware1.  The death of paravirt also left the industry with an aversion to ever reviving it, which explains why most sandbox containment systems (gVisor, Kata) try to require no modifications to the image.

With DevOps, the requirement is that images be immutable and that to change an image you must take it through the full develop build, test, deploy cycle.  This development centric view means that, provided there’s no impact to the images you use as the basis for your development, you can easily craft your final image to suit the deployment environment, which means a step like linking with the nabla tender is very easy.  Essentially, this comes down to whether you take the Dev (we can rebuild to suit the environment) or the Ops (the deployment environment needs to accept arbitrary images) view.  However, most solutions take the Ops view because of the anti-paravirt bias.  For the Nabla tender, we take the Dev view, which is born out by the performance figures.

Conclusion

Like most sandbox models, the Nabla containers approach is an alternative to namespacing for containment, but it still requires cgroups for resource management.  The figures show that the containment HAP is actually better than that achieved with a hypervisor and the performance, while being marginally less than a namespaced container, is greater than that obtained by running a container inside a hypervisor.  Thus we conclude that for tenants who have a real need for HAP reduction, this is a viable technology.

July 15, 2018 05:54 AM

July 12, 2018

Pete Zaitcev: Guido van Rossum steps down

See a mailing list message:

I would like to remove myself entirely from the decision process. // I am not going to appoint a successor.

July 12, 2018 06:01 PM

June 29, 2018

Pete Zaitcev: The Proprietary Mind

Regarding the Huston missive, two quotes jumped at me the most. The first is just beautiful:

It may be slightly more disconcerting to realise that your electronic wallet is on a device that is using a massive compilation of open source software of largely unknown origin [...]

Yeah, baby. This moldy canard is still operational.

The second is from the narrative of the smartphone revolution:

Apple’s iPhone, released in 2007, was a revolutionary device. [...] Apple’s early lead was rapidly emulated by Windows and Nokia with their own offerings. Google’s position was more as an active disruptor, using an open licensing framework for the Android platform [...]

Again, it's not like he's actually lying. He merely implies heavily that Nokia came next. I don't think the Nokia blunder even deserve a footnote, but to Huston, Google was too open. Google, Carl!

June 29, 2018 12:58 PM

June 26, 2018

James Morris: Linux Security Summit North America 2018: Schedule Published

The schedule for the Linux Security Summit North America (LSS-NA) 2018 is now published.

Highlights include:

and much more!

LSS-NA 2018 will be co-located with the Open Source Summit, and held over 27th-28th August, in Vancouver, Canada.  The attendance fee is $100 USD.  Register here.

See you there!

June 26, 2018 09:11 PM

June 25, 2018

Vegard Nossum: Compiler fuzzing, part 1

Much has been written about fuzzing compilers already, but there is not a lot that I could find about fuzzing compilers using more modern fuzzing techniques where coverage information is fed back into the fuzzer to find more bugs.

If you know me at all, you know I'll throw anything I can get my hands on at AFL. So I tried gcc. (And clang, and rustc -- but more about Rust in a later post.)

Levels of fuzzing


First let me summarise a post by John Regehr called Levels of Fuzzing, which my approach builds heavily on. Regehr presents a very important idea (which stems from earlier research/papers by others), namely that fuzzing can operate at different "levels". These levels correspond somewhat loosely to the different stages of compilation, i.e. lexing, parsing, type checking, code generation, and optimisation. In terms of fuzzing, the source code that you pass to the compiler has to "pass" one stage before it can enter the next; if you give the compiler a completely random binary file, it is unlikely to even get past the lexing stage, never mind to the point where the compiler is actually generating code. So it is in our interest (assuming we want to fuzz more than just the lexer) to generate test cases more intelligently than just using random binary data.

If we simply try to compile random data, we're not going to get very far.
 
In a "naïve" approach, we simply compile gcc with AFL instrumentation and run afl-fuzz on it as usual. If we give a reasonable corpus of existing C code, it is possible that the fuzzer will find something interesting by randomly mutating the test cases. But more likely than not, it is mostly going to end up with random garbage like what we see above, and never actually progress to more interesting stages of compilation. I did try this -- and the results were as expected. It takes a long time before the fuzzer hits anything interesting at all. Now, Sami Liedes did this with clang back in 2014 and obtained some impressive results ("34 distinct assertion failures in the first 11 hours"). So clearly it was possible to find bugs in this way. When I tried this myself for GCC, I did not find a single crash within a day or so of fuzzing. And looking at the queue of distinct testcases it had found, it was very clear that it was merely scratching the very outermost surface of the input handling in the compiler -- it was not able to produce a single program that would make it past the parsing stage.

AFL has a few built-in mutation strategies: bit flips, "byte flips", arithmetic on bytes, 2-bytes, and 4-bytes, insertion of common boundary values (like 0, 1, powers of 2, -1, etc.), insertions of and substitution by "dictionary strings" (basically user-provided lists of strings), along with random splicing of test cases. We can already sort of guess that most of these strategies will not be useful for C and C++ source code. Perhaps the "dictionary strings" is the most promising for source code as it allows you to insert keywords and snippets of code that have at least some chance of ending up as a valid program. For the other strategies, single bit flips can change variable names, but changing variable names is not that interesting unless you change one variable into another (which both have to exist, as otherwise you would hit a trivial "undeclared" error). They can also create expressions, but if you somehow managed to change a 'h' into a '(', source code with this mutation would always fail unless you also inserted a ')' somewhere else to balance the expression. Source code has a lot of these "correspondances" where changing one thing also requires changing another thing somewhere else in the program if you want it to still compile (even though you don't generate an equivalent program -- that's not what we're trying to do here). Variable uses match up with variable declarations. Parantheses, braces, and brackets must all match up (and in the right order too!).

These "correspondences" remind me a lot of CRCs and checksums in other file formats, and they give the fuzzer problems for the exact same reason: without extra code it's hard to overcome having to change the test case simultaneously in two or more places, never mind making the exact change that will preserve the relationship between these two values. It's a game of combinatorics; the more things we have to change at once and the more possibilities we have for those changes, the harder it will be to get that exact combination when you're working completely at random. For checksums the answer is easy, and there are two very good strategies: either you disable the checksum verification in the code you're fuzzing, or you write a small wrapper to "fix up" your test case so that the checksum always matches the data it protects (of course, after mutating an input you may not really know where in the file the checksum will be located anymore, but that's a different problem).

For C and C++ source code it's not so obvious how to help the fuzzer overcome this. You can of course generate programs with a grammar (and some heuristics), which is what several C random code generators such as Csmith, ccg, and yarpgen do. This is in a sense on the completely opposite side of the spectrum when it comes to the levels of fuzzing. By generating programs that you know are completely valid (and correct, and free of undefined behaviour), you will breeze through the lexing, the parsing, and the type checking and target the code generation and optimization stages. This is what Regehr et al. did in "Taming compiler fuzzers", another very interesting read. (Their approach does not include instrumentation feedback, however, so it is more of a traditional black-box fuzzing approach than AFL, which is considered grey-box fuzzing.)

But if you use a C++ grammar to generate C++ programs, that will also exclude a lot of inputs that are not valid but nevertheless accepted by the compiler. This approach relies on our ability to express all programs that should be valid, but there may also be programs non-valid programs that crash the compiler. As an example, if our generator knows that you cannot add an integer to a function, or assign a value to a constant, then the code paths checking for those conditions in the compiler would never be exercised, despite the fact that those errors are more interesting than mere syntax errors. In other words, there is a whole range of "interesting" test cases which we will never be able to generate if we restrict ourselves only to those programs that are actually valid code.

Please note that I am not saying that one approach is better than the other! I believe we need all of them to successfully find bugs in all the areas of the compiler. By realising exactly what the limits of each method are, we can try to find other ways to fill the gaps.

Fuzzing with a loose grammar


So how can we fill the gap between the shallow syntax errors in the front end and the very deep of the code generation in the back end? There are several things we can do.

The main feature of my solution is to use a "loose" grammar. As opposed to a "strict" grammar which would follow the C/C++ specs to the dot, the loose grammar only really has one type of symbol, and all the production rules in the grammar create this type of symbol. As a simple example, a traditional C grammar will not allow you to put a statement where an expression is expected, whereas the loose grammar has no restrictions on that. It does, however, take care that your parantheses and braces match up. My grammar file therefore looks something like this (also see the full grammar if you're curious!):
"[void] [f] []([]) { [] }"
"[]; []"
"{ [] }"
"[0] + [0]"
...
Here, anything between "[" and "]" (call it a placeholder) can be substituted by any other line from the grammar file. An evolution of a program could therefore plausibly look like this:
void f () { }           // using the "[void] [f] []([]) { [] }" rule
void f () { ; } // using the "[]; []" rule
void f () { 0 + 0; } // using the "[0] + [0]" rule
void f ({ }) { 0 + 0; } // using the "{ [] }" rule
...
Wait, what happened at the end there? That's not valid C. No -- but it could still be an interesting thing to try to pass to the compiler. We did have a placeholder where the arguments usually go, and according to the grammar we can put any of the other rules in there. This does quickly generate a lot of nonsensical programs that stop the compiler completely dead in its track at the parsing stage. We do have another trick to help things along, though...

AFL doesn't care at all whether what we pass it is accepted by the compiler or not; it doesn't distinguish between success and failure, only between graceful termination and crashes. However, all we have to do is teach the fuzzer about the difference between exit codes 0 and 1; a 0 means the program passed all of gcc's checks and actually resulted in an object file. Then we can discard all the test cases that result in an error, and keep a corpus of test cases which compile successfully. It's really a no-brainer, but makes such a big difference in what the fuzzer can generate/find.

Enter prog-fuzz


prog-fuzz output


If it's not clear by now, I'm not using afl-fuzz to drive the main fuzzing process for the techniques above. I decided it was easier to write a fuzzer from scratch, just reusing the AFL instrumentation and some of the setup code to collect the coverage information. Without the fork server, it's surprisingly little code, on the order of 15-20 lines of code! (I do have support for the fork server on a different branch and it's not THAT much harder to implement, but I simply haven't gotten around to it yet; and it also wasn't really needed to find a lot of bugs).

You can find prog-fuzz on GitHub: https://github.com/vegard/prog-fuzz

The code is not particularly clean, it's a hacked-up fuzzer that gets the job done. I'll want to clean that up at some point, document all the steps to build gcc with AFL instrumentation, etc., and merge a proper fork server. I just want the code to be out there in case somebody else wants to have a poke around.

Results


From the end of February until some time in April I ran the fuzzer on and off and reported just over 100 distinct gcc bugs in total (32 of them fixed so far, by my count):
Now, there are a few things to be said about these bugs.

First, these bugs are mostly crashes: internal compiler errors ("ICEs"), assertion failures, and segfaults. Compiler crashes are usually not very high priority bugs -- especially when you are dealing with invalid programs. Most of the crashes would never occur "naturally" (i.e. as the result of a programmer trying to write some program). They represent very specific edge cases that may not be important at all in normal usage. So I am under no delusions about the relative importance of these bugs; a compiler crash is hardly a security risk.

However, I still think there is value in fuzzing compilers. Personally I find it very interesting that the same technique on rustc, the Rust compiler, only found 8 bugs in a couple of weeks of fuzzing, and not a single one of them was an actual segfault. I think it does say something about the nature of the code base, code quality, and the relative dangers of different programming languages, in case it was not clear already. In addition, compilers (and compiler writers) should have these fuzz testing techniques available to them, because it clearly finds bugs. Some of these bugs also point to underlying weaknesses or to general cases where something really could go wrong in a real program. In all, knowing about the bugs, even if they are relatively unimportant, will not hurt us.

Second, I should also note that I did have conversations with the gcc devs while fuzzing. I asked if I should open new bugs or attach more test cases to existing reports if I thought the area of the crash looked similar, even if it wasn't the exact same stack trace, etc., and they always told me to file a new report. In fact, I would like to praise the gcc developer community: I have never had such a pleasant bug-reporting experience. Within a day of reporting a new bug, somebody (usually Martin Liška or Marek Polacek) would run the test case and mark the bug as confirmed as well as bisect it using their huge library of precompiled gcc binaries to find the exact revision where the bug was introduced. This is something that I think all projects should strive to do -- the small feedback of having somebody acknowledge the bug is a huge encouragement to continue the process. Other gcc developers were also very active on IRC and answered almost all my questions, ranging from silly "Is this undefined behaviour?" to "Is this worth reporting?". In summary, I have nothing but praise for the gcc community.

I should also add that I played briefly with LLVM/clang, and prog-fuzz found 9 new bugs (2 of them fixed so far):
In addition to those, I also found a few other bugs that had already been reported by Sami Liedes back in 2014 which remain unfixed.

For rustc, I will write a more detailed blog post about how to set it up, as compiling rustc itself with AFL instrumentation is non-trivial and it makes more sense to detail those exact steps apart from this post.

What next?


I mentioned the efforts by Regehr et al. and Dmitry Babokin et al. on Csmith and yarpgen, respectively, as fuzzers that generate valid (UB-free) C/C++ programs for finding code generation bugs. I think there is work to be done here to find more code generation bugs; as far as I can tell, nobody has yet combined instrumentation feedback (grey-box fuzzing) with this kind of test case generator. Well, I tried to do it, but it requires a lot of effort to generate valid programs that are also interesting, and I stopped before finding any actual bugs. But I really think this is the future of compiler fuzzing, and I will outline the ideas that I think will have to go into it:
I don't have the time to continue working on this at the moment, but please do let me know if you would like to give it a try and I'll do my best to answer any questions about the code or the approach.

Acknowledgements


Thanks to John Regehr, Martin Liška, Marek Polacek, Jakub Jelinek, Richard Guenther, David Malcolm, Segher Boessenkool, and Martin Jambor for responding to my questions and bug reports!

Thanks to my employer, Oracle, for allowing me to do part of this fuzzing effort using company time and resources.

June 25, 2018 07:35 AM

June 22, 2018

Paul E. Mc Kenney: Stupid RCU Tricks: Changes to -rcu Workflow

The -rcu tree also takes LKMM patches, and I have been handling these completely separately, with one branch for RCU and another for LKMM. But this can be a bit inconvenient, and more important, can delay my response to patches to (say) LKMM if I am doing (say) extended in-tree RCU testing. So it is time to try something a bit different.

My current thought is continue to have separate LKMM and RCU branches (or more often, sets of branches) containing the commits to be offered up to the next merge window. The -rcu branch lkmm would flag the LKMM branch (or, more often, merge commit) and a new -rcu branch rcu would flag the RCU branch (or, again more often, merge commit). Then the lkmm and rcu merge commits would be merged, with new commits on top. These new commits would be intermixed RCU and LKMM commits.

The tip of the -rcu development effort (both LKMM and RCU) would be flagged with a new dev branch, with the old rcu/dev branch being retired. The rcu/next branch will continue to mark the commit to be pulled into the -next tree, and will point to the merge of the rcu and lkmm branches during the merge window.

I will create the next-merge-window branches sometime around -rc1 or -rc2, as I have in the past. I will send RFC patches to LKML shortly thereafter. I will send a pull request for the rcu branch around -rc5, and will send final patches from the lkmm branch at about that same time.

Should continue to be fun! :–)

June 22, 2018 09:17 PM

June 21, 2018

James Bottomley: Containers and Cloud Security

Introduction

The idea behind this blog post is to take a new look at how cloud security is measured and what its impact is on the various actors in the cloud ecosystem.  From the measurement point of view, we look at the vertical stack: all code that is traversed to provide a service all the way from input web request to database update to output response potentially contains bugs; the bug density is variable for the different components but the more code you traverse the higher your chance of exposure to exploitable vulnerabilities.  We’ll call this the Vertical Attack Profile (VAP) of the stack.  However, even this axis is too narrow because the primary actors are the cloud tenant and the cloud service provider (CSP).  In an IaaS cloud, part of the vertical profile belongs to the tenant (The guest kernel, guest OS and application) and part (the hypervisor and host OS) belong to the CSP.  However, the CSP vertical has the additional problem that any exploit in this piece of the stack can be used to jump into either the host itself or any of the other tenant virtual machines running on the host.  We’ll call this exploit causing a failure of containment the Horizontal Attack Profile (HAP).  We should also note that any Horizontal Security failure is a potentially business destroying event for the CSP, so they care deeply about preventing them.  Conversely any exploit occurring in the VAP owned by the Tenant can be seen by the CSP as a tenant only problem and one which the Tenant is responsible for locating and fixing.  We correlate size of profile with attack risk, so the large the profile the greater the probability of being exploited.

From the Tenant point of view, improving security can be done in one of two ways, the first (and mostly aspirational) is to improve the security and monitoring of the part of the Vertical the Tenant is responsible for and the second is to shift responsibility to the CSP, so make the CSP responsible for more of the Vertical.  Additionally, for most Tenants, a Horizontal failure mostly just means they lose trust in the CSP, unless the Tenant is trusting the CSP with sensitive data which can be exfiltrated by the Horizontal exploit.  In this latter case, the Tenant still cannot do anything to protect the CSP part of the Security Profile, so it’s mostly a contractual problem: SLAs and penalties for SLA failures.

Examples

To see how these interpretations apply to the various cloud environments, lets look at some of the Cloud (and pre-Cloud) models:

Physical Infrastructure

The left hand diagram shows a standard IaaS rented physical system.  Since the Tenant rents the hardware it is shown as red indicating CSP ownership and the the two Tenants are shown in green and yellow.  In this model, barring attacks from the actual hardware, the Tenant owns the entirety of the VAP.  The nice thing for the CSP is that hardware provides air gap security, so there is no HAP which means it is incredibly secure.

However, there is another (much older) model shown on the right, called the shared login model,  where the Tenant only rents a login on the physical system.  In this model, only the application belongs to the Tenant, so the CSP is responsible for much of the VAP (the expanded red area).  Here the total VAP is the same, but the Tenant’s VAP is much smaller: the CSP is responsible for maintaining and securing everything apart from the application.  From the Tenant point of view this is a much more secure system since they’re responsible for much less of the security.  From the CSP point of view there is now a  because a tenant compromising the kernel can control the entire system and jump to other tenant processes.  This actually has the worst HAP of all the systems considered in this blog.

Hypervisor based Virtual Infrastructure

In this model, the total VAP is unquestionably larger (worse) than the physical system above because there’s simply more code to traverse (a guest and a host kernel).  However, from the Tenant’s point of view, the VAP should be identical to that of unshared physical hardware because the CSP owns all the additional parts.  However, there is the possibility that the Tenant may be compromised by vulnerabilities in the Virtual Hardware Emulation.  This can be a worry because an exploit here doesn’t lead to a Horizontal security problem, so the CSP is apt to pay less attention to vulnerabilities in the Virtual Hardware simply because each guest has its own copy (even though that copy is wholly under the control of the CSP).

The HAP is definitely larger (worse) than the physical host because of the shared code in the Host Kernel/Hypervisor, but it has often been argued that because this is so deep in the Vertical stack that the chances of exploit are practically zero (although venom gave the lie to this hope: stack depth represents obscurity, not security).

However, there is another way of improving the VAP and that’s to reduce the number of vulnerabilities that can be hit.  One way that this can be done is to reduce the bug density (the argument for rewriting code in safer languages) but another is to restrict the amount of code which can be traversed by narrowing the interface (for example, see arguments in this hotcloud paper).  On this latter argument, the host kernel or hypervisor does have a much lower VAP than the guest kernel because the hypercall interface used for emulating the virtual hardware is very narrow (much narrower than the syscall interface).

The important takeaways here are firstly that simply transferring ownership of elements in the VAP doesn’t necessarily improve the Tenant VAP unless you have some assurance that the CSP is actively monitoring and fixing them.  Conversely, when the threat is great enough (Horizontal Exploit), you can trust to the natural preservation instincts of the CSP to ensure correct monitoring and remediation because a successful Horizontal attack can be a business destroying event for the CSP.

Container Based Virtual Infrastructure

The total VAP here is identical to that of physical infrastructure.  However, the Tenant component is much smaller (the kernel accounting for around 50% of all vulnerabilities).  It is this reduction in the Tenant VAP that makes containers so appealing: the CSP is now responsible for monitoring and remediating about half of the physical system VAP which is a great improvement for the Tenant.  Plus when the CSP remediates on the host, every container benefits at once, which is much better than having to crack open every virtual machine image to do it.  Best of all, the Tenant images don’t have to be modified to benefit from these fixes, simply running on an updated CSP host is enough.  However, the cost for this is that the HAP is the entire linux kernel syscall interface meaning the HAP is much larger than then hypervisor virtual infrastructure case because the latter benefits from interface narrowing to only the hypercalls (qualitatively, assuming the hypercall interface is ~30 calls and the syscall interface is ~300 calls, then the HAP is 10x larger in the container case than the hypervisor case); however, thanks to protections from the kernel namespace code, the HAP is less than the shared login server case.  Best of all, from the Tenant point of view, this entire HAP cost is borne by the CSP, which makes this an incredible deal: not only does the Tenant get a significant reduction in their VAP but the CSP is hugely motivated to keep on top of all vulnerabilities in their part of the VAP and remediate very fast because of the business implications of a successful horizontal attack.  The flip side of this is that a large number of the world’s CSPs are very unhappy about this potential risks and costs and actually try to shift responsibility (and risk) back to the Tenant by advocating nested virtualization solutions like running containers in hypervisors. So remember, you’re only benefiting from the CSP motivation to actively maintain their share of the VAP if your CSP runs bare metal containers because otherwise they’ve quietly palmed the problem back off on you.

Other Avenues for Controlling Attack Profiles

The assumption above was that defect density per component is roughly constant, so effectively the more code the more defects.  However, it is definitely true that different code bases have different defect densities, so one way of minimizing your VAP is to choose the code you rely on carefully and, of course, follow bug reduction techniques in the code you write.

Density Reduction

The simplest way of reducing defects is to find and fix the ones in the existing code base (while additionally being careful about introducing new ones).  This means it is important to know how actively defects are being searched for and how quickly they are being remediated.  In general, the greater the user base for the component, the greater the size of the defect searchers and the faster the speed of their remediation, which means that although the Linux Kernel is a big component in the VAP and HAP, a diligent patch routine is a reasonable line of defence because a fixed bug is not an exploitable bug.

Another way of reducing defect density is to write (or rewrite) the component in a language which is less prone to exploitable defects.  While this approach has many advocates, particularly among language partisans, it suffers from the defect decay issue: the idea that the maximum number of defects occurs in freshly minted code and the number goes down over time because the more time from release the more chance they’ve been found.  This means that a newly rewritten component, even in a shiny bug reducing language, can still contain more bugs than an older component written in a more exploitable language, simply because a significant number of bugs introduced on creation have been found in the latter.

Code Reduction (Minimization Techniques)

It also stands to reason that, for a complex component, simply reducing the amount of code that is accessible to the upper components reduces the VAP because it directly reduces the number of defects.  However, reducing the amount of code isn’t as simple as it sounds: it can only really be done by components that are configurable and then only if you’re not using the actual features you eliminate.  Elimination may be done in two ways, either physically, by actually removing the code from the component or virtually by blocking access using a guard (see below).

Guarding and Sandboxing

Guarding is mostly used to do virtual code elimination by blocking access to certain code paths that the upper layers do not use.  For instance, seccomp  in the Linux Kernel can be used to block access to system calls you know the application doesn’t use, meaning it also blocks any attempt to exploit code that would be in those system calls, thus reducing the VAP (and also reducing the HAP if the kernel is shared).

The deficiencies in the above are obvious: if the application needs to use a system call, you cannot block it although you can filter it, which leads to huge and ever more complex seccomp policies.  The solution for the system call an application has to use problem can sometimes be guarding emulation.  In this mode the guard code actually emulates all the effects of the system call without actually making the actual system call into the kernel.  This approach, often called sandboxing, is certainly effective at reducing the HAP since the guards usually run in their own address space which cannot be used to launch a horizontal attack.  However, the sandbox may or may not reduce the VAP depending on the bugs in the emulation code vs the bugs in the original.  One of the biggest potential disadvantages to watch out for with sandboxing is the fact that the address space the sandbox runs in is often that of the tenant, often meaning the CSP has quietly switched ownership of that component back to the tenant as well.

Conclusions

First and foremost: security is hard.  As a cloud Tenant, you really want to offload as much of it as possible to people who are much more motivated to actually do it than you are (i.e. the Cloud Service Provider).

The complete Vertical Attack Profile of a container bare metal system in the cloud is identical to a physical system and better than a Hypervisor based system; plus the tenant owned portion is roughly 50% of the total VAP meaning that Containers are by far the most secure virtualization technology available today from the Tenant perspective.

The increased Horizontal Attack profile that containers bring should all rightly belong to the Cloud Service Provider.  However, CSPs are apt to shirk this responsibility and try to find creative ways to shift responsibility back to the tenant including spreading misinformation about the container Attack profiles to try to make Tenants demand nested solutions.

Before you, as a Tenant, start worrying about the CSP owned Horizontal Attack Profile, make sure that contractual remedies (like SLAs or Reputational damage to the CSP) would be insufficient to cover the consequences of any data loss that might result from a containment breach.  Also remember that unless you, as the tenant, are under external compliance obligations like HIPPA or PCI, contractual remedies for a containment failure are likely sufficient and you should keep responsibility for the HAP where it belongs: with the CSP.

June 21, 2018 05:31 AM

June 19, 2018

Pete Zaitcev: Slasti py3

Got Slasti 2.1 released today, the main feature being a support for Python 3. Some of the changes were somewhat... horrifying maybe? I tried to adhere to a general plan, where the whole of the application operates in unicode, and the UTF-8 data is encoded/decoded at the boundary. Unfortunately, in practice the boundary was rather leaky, so in several places I had to resort to isinstance(). I expected to always assign a type to all variables and fields, and then rigidly convert as needed. But WSGI had its own ideas.

Overall, the biggest source of issues was not the py3 model, but trying to make the code compatible. I'm not going to do that again if I can help it: either py2 or py3, but not both.

UPDATE: Looks like CKS agrees that compatible code is usually too hard. I'm glad the recommendation to avoid Python 3 entirely is no longer operational.

June 19, 2018 02:54 AM

June 18, 2018

James Morris: Linux Security BoF at Open Source Summit Japan

This is a reminder for folks attending OSS Japan this week that I’ll be leading a  Linux Security BoF session  on Wednesday at 6pm.

If you’ve been working on a Linux security project, feel welcome to discuss it with the group.  We will have a whiteboard and projector.   This is also a good opportunity to raise topics for discussion, and to ask questions about Linux security.

See you then!

June 18, 2018 08:26 AM

June 16, 2018

Linux Plumbers Conference: Registration for Linux Plumbers Conference is Now Open

The 2018 Linux Plumbers Conference organizing committee is pleased to announce that the registration for this year’s conference is now open. Information on how to register can be found here. Registration prices and cutoff dates are published in the ATTEND page. A reminder that we are following a quota system to release registration slots. Therefore the early registration rate will remain in effect until early registration closes on 10 August 2018, or the quota limit is reached, whatever comes earlier. As usual, contact us if you have questions.

June 16, 2018 05:57 PM

June 15, 2018

Pete Zaitcev: Fedora 28 and IPv6 Neighbor Discovery

Finally updated my laptop to F28 and ssh connections started hanging. They hang for 15-20 seconds, then unstuck for a few seconds, then hang, and so on, cycling. I thought it was a WiFi problem at first. But eventually I narrowed it down to IPv6 ND being busted.

A packet trace on the laptop shows that traffic flows until the laptop issues a neighbor solicitation. The router replies with an advertisement, which I presume is getting dropped. Traffic stops — although what's strange, tcpdump still captures outgoing packets that the laptop sends. In a few seconds, the router sends a neighbor solicitation, but the laptop never replies. Presumably, dropped as well. This continues until a router advertisement resets the cycle.

Stopping firewalld lets solicitations in and the traffic resumes, so obviously a rule is busted somewhere. The IPv6 ICMP appears allowed, but the ip6tables rules generated by Firewalld are fairly opaque, I cannot be sure. Ended filing bug 1591867 for the time being and forcing ssh -4.

UPDATE: Looks like the problem is a "reverse path filter". Setting IPv6_rpfilter=no in /etc/firewalld/firewalld.conf fixes the the issue (thanks to Victor for the tip). Here's an associated comment in the configuration file:

# Performs a reverse path filter test on a packet for IPv6. If a reply to the
# packet would be sent via the same interface that the packet arrived on, the
# packet will match and be accepted, otherwise dropped.
# The rp_filter for IPv4 is controlled using sysctl.

Indeed there's no such sysctl for v6. Obviously the problem is that packets with the source of fe80::/16 are mistakenly assumed to be martians and dropped. That's easy enough to fix, I hope. But it's fascinating that we have an alternative configuration method nowadays, only exposed by certain specialist tools. If I don't have firewalld installed, and want this setting changed, what then?

Remarkably, the problem was reported first in March (it's June now). This tells me that most likely the erroneous check itself is in the kernel somewhere, and firewalld is not at fault, which is why Erik isn't fixing it. He should've reassigned the bug to kernel, if so, but...

The commit cede24d1b21d68d84ac5a36c44f7d37daadcc258 looks like the fix. Unfortunately, it just missed the 4.17.

June 15, 2018 05:39 PM