Kernel Planet

May 30, 2016

Andi Kleen: Literature survey for practical TSX lock elision applications

Introduction

This is a short non comprehensive (and somewhat biased) literature survey on how lock elision with Intel TSX can be used to improve performance to programs by increasing parallelism. Focus is on practical incremental improvements in existing software.

A basic introduction of TSX lock elision is available in Scaling existing lock based applications with Lock elision.

Lock libraries

The papers below are on actual code using lock elision, not just how to implement lock elision itself. The basic rules of how to implement lock elision in a locking library are in chapter 12 of the Intel Optimization manual. Common anti-patterns (mistakes) while doing so are described in TSX Anti-patterns.

Existing lock libraries that implement lock elision using TSX include glibc pthreads, Thread Building Blocks, tsx-tools and ConcurrencyKit

Databases

Lock elision has been used widely to speed up both production and research databases. A lot of work has been done on the SAP HANA in memory data which uses TSX in production today to improve performance of the B+Tree index and the delta tree data structures. This is described in Improving In-Memory Database Index Performance with TSX.

Several research databases went beyond just using TSX to speed up specific data structures, but map complete database transactions to hardware transaction. This requires much more changes to the databases, and careful memory layout, but can also give more gains. It generally goes beyond simple lock elision. This approach has been implemented by Leis in Exploiting HTM in Main memory databases. A similar approach is used in Using Restricted Transactional Memory to Build a Scalable In-Memory Database. Both see large gains on TPC-C style workloads.

For non in memory databases, a group at Berkeley used lock elision to improve parallelism in LevelDB and getting a 25% speedup on a 4 core system. Standard LevelDB is essentially a single lock system, so this was a nice speedup with only minor work compared to other efforts that used manual fine grained locking to improve parallelism in LevelDB. However it required special handling of condition variables, which are used for commit. For simpler key value stores Diegues used an automatic tuner with TSX transactions to get a 2x gain with memcached.

DrTM uses TSX together with RDMA to implement a fast distributed transaction manager using 2PL. TSX provides local isolation, while RDMA (which aborts transactions) provides remote isolation.

Languages

An attractive use of lock elision is to speed up locks implicit in language runtimes. Yoo implemented transparent. support for using TSX for Java synchronized sections in Early Experience on Transactional
Execution of Java TM Programs
. The runtime automatically detects when a synchronized region does not benefit from TSX and disables lock elision then. This works transparently, but using it successfully may still need some changes in the program to minimize conflicts and other aborts, as described by Gil Tene in Understanding Hardware Transactional Memory. This support is available in JDK 8u40 and can be enabled with the -XX:+UseRTMLocking option.

Another interesting case for lock elision is improving parallelism for the Great Interpreter Lock (GIL) used in interpreters. Odaiara implemented this for Ruby in Eliminating Global Interpreter Locks in Ruby through HTM. They saw a 4.4x speedup on Ruby NPB, 1.6x in WEBrick and 1.2x speedup in Ruby on Rails on a 12 core system.

Hardware transactions can be used for auto-parallelizing existing loops, even when the compiler cannot prove that iterations are independent, by using the transactions to isolate individual iterations. Odaira implemented TSX loop speculation manually for workloads in SPECCpu and report a 11% speedup on a 4 core system. There are some limitations to this technique due to the lazy subscription problem described by Dice, but in principle it can be directly implemented in compilers.

Salamanca used TSX to implement tracing recovery code for Speculative Trace Optimization (STO) for loops. The basic principles are similar to the previous paper, but they implemented an automated prototype. They report 9% improvement on 4 cores with a number of benchmarks.

Murphy’s dissertation implements loop speculation using TSX for DOACCROSS loops in the HELIX research compiler.

An older paper, which predates TSX, by Dice describes how to use Hardware Transactional Memory to simplify Work Stealing schedulers.

High Performance Computing

Yoo.et.al. use TSX lock elision to benchmark a number of HPC applications in Performance Evaluation of Intel TSX for High-Performance Computing. They report an average 41% speedup on a 4 core system. They also report an average 31% improvement in bandwidth when applying TSX to a user space TCP stack.

Data structures

Lock elision has been used widely to speed up parallel data structures. Normally applying lock elision to an existing data structure is very simple — elide the lock protecting it — but some tweaking of the data structure can often give better performance. Dementiev explores TSX for general fast scalable hash tables. Li uses TSX to implement scalable Cuckoo hash tables. Using TSX for hash tables is generally very straight forward. For tree data structures one need to be careful that tree rebalancing does not overwhelm the write set capacity of the hardware transactions., Repetti uses TSX to scale Patricia Tries. Siakavas explores TSX usage for scalable Red-Black Trees, similar in this paper. Bonnichsen uses HTM to improve BT Trees, reporting a speed up of 2x to 3x compared to earlier implementations. The database papers described above describe the rules needed to successfully elide BTrees.

Memory allocation and Garbage Collection

One challenge with using garbage collection is that the worst case “stop the world” pauses from parallel garbage collectors limit the total heap size that can be supported. Ritson reports using TSX to implement the JIKES RVM parallel garbage collector, further improved in the Chihua GC paper They report upto 101% speedup in concurrent copying speed, and show that a simple parallel garbage collector can be implemented with limited efforts.

Another dog-themed GC, the Collie garbage collector (original paper predates TSX) presents a production quality parallel collector that minimizes pauses and allows scaling to large heaps. Opdahl has another description of the Collie algorithm It is presumably deployed now for TSX in Azul’s commercial ZING JVM product which claims to scale upto 2TB of heap memory.

StackTrack is an efficient algorithm to do automatic memory reclaim for parallel data structures using hardware transactions, out performing existing techniques such as hazard pointers.

Kuzmaul uses TSX to implement a scalable SuperMalloc and reports good performance combined with relatively simple code. Dice et all report how an cache index aware malloc can improve TSX performance by improving utilization of the L1 cache.

Other usages

Peters use lock elision to parallelize a micro kernel For a micro kernel a big lock is fine and finds that RTM lock elision outperforms fine grained locking due to less single thread overhead.

Hay explored lock elision for improving Parallel Discrete Event Simulations. He reports a speed ups of up to 28%.

May 30, 2016 04:40 AM

May 29, 2016

Pete Zaitcev: Encrypt everything? Please reconsider.

Somehow it became fashionable among site admins to set it up so that accessing over HTTP was immediately redirected to https://. But doing that adds new ways to fail, such as certificate expired:

Notice that Firefox provides no way to ignore the problem and access the website (which was supposed to be accessible over HTTP to begin with). Solution? Use Chrome, which does:

Or, disable NTP and change your PC's clock two days back (be careful with running make while doing so).

This was discussed by CKS previously (of course), and he seems to think that the benefits outweigh the downsides of an occasional fuck-up, like the only website in the world that has the information that I want right now is suddenly unavailable with no recourse.

May 29, 2016 05:16 PM

May 26, 2016

Vegard Nossum: Writing a reverb filter from first principles

WARNING/DISCLAIMER: Audio programming always carries the risk of damaging your speakers and/or your ears if you make a mistake. Therefore, remember to always turn down the volume completely before and after testing your program. And whatever you do, don't use headphones or earphones. I take no responsibility for damage that may occur as a result of this blog post!

Have you ever wondered how a reverb filter works? I have... and here's what I came up with.

Reverb is the sound effect you commonly get when you make sound inside a room or building, as opposed to when you are outdoors. The stairwell in my old apartment building had an excellent reverb. Most live musicians hate reverb because it muddles the sound they're trying to create and can even throw them off while playing. On the other hand, reverb is very often used (and overused) in the studio vocals because it also has the effect of smoothing out rough edges and imperfections in a recording.

We typically distinguish reverb from echo in that an echo is a single delayed "replay" of the original sound you made. The delay is also typically rather large (think yelling into a distant hill- or mountainside and hearing your HEY! come back a second or more later). In more detail, the two things that distinguish reverb from an echo are:

  1. The reverb inside a room or a hall has a much shorter delay than an echo. The speed of sound is roughly 340 meters/second, so if you're in the middle of a room that is 20 meters by 20 meters, the sound will come back to you (from one wall) after (20 / 2) / 340 = ~0.029 seconds, which is such a short duration of time that we can hardly notice it (by comparison, a 30 FPS video would display each frame for ~0.033 seconds).
  2. After bouncing off one wall, the sound reflects back and reflects off the other wall. It also reflects off the perpendicular walls and any and all objects that are in the room. Even more, the sound has to travel slightly longer to reach the corners of the room (~14 meters instead of 10). All these echoes themselves go on to combine and echo off all the other surfaces in the room until all the energy of the original sound has dissipated.

Intuitively, it should be possible to use multiple echoes at different delays to simulate reverb.

We can implement a single echo using a very simple ring buffer:

    class FeedbackBuffer {
    public:
        unsigned int nr_samples;
        int16_t *samples;

        unsigned int pos;

        FeedbackBuffer(unsigned int nr_samples):
            nr_samples(nr_samples),
            samples(new int16_t[nr_samples]),
            pos(0)
        {
        }

        ~FeedbackBuffer()
        {
            delete[] samples;
        }

        int16_t get() const
        {
            return samples[pos];
        }

        void add(int16_t sample)
        {
            samples[pos] = sample;

            /* If we reach the end of the buffer, wrap around */
            if (++pos == nr_samples)
                pos = 0;
        }
    };

The constructor takes one argument: the number of samples in the buffer, which is exactly how much time we will delay the signal by; when we write a sample to the buffer using the add() function, it will come back after a delay of exactly nr_samples using the get() function. Easy, right?

Since this is an audio filter, we need to be able to read an input signal and write an output signal. For simplicity, I'm going to use stdin and stdout for this -- we will read 8 KiB at a time using read(), process that, and then use write() to output the result. It will look something like this:

    #include <cstdio>
    #include <cstdint>
    #include <cstdlib>
    #include <cstring>
    #include <unistd.h>


    int main(int argc, char *argv[])
    {
        while (true) {
            int16_t buf[8192];
            ssize_t in = read(STDIN_FILENO, buf, sizeof(buf));
            if (in == -1) {
                /* Error */
                return 1;
            }
            if (in == 0) {
                /* EOF */
                break;
            }

            for (unsigned int j = 0; j < in / sizeof(*buf); ++j) {
                /* TODO: Apply filter to each sample here */
            }

            write(STDOUT_FILENO, buf, in);
        }

        return 0;
    }

On Linux you can use e.g. 'arecord' to get samples from the microphone and 'aplay' to play samples on the speakers, and you can do the whole thing on the command line:

    $ arecord -t raw -c 1 -f s16 -r 44100 |\
        ./reverb | aplay -t raw -c 1 -f s16 -r 44100

(-c means 1 channel; -f s16 means "signed 16-bit" which corresponds to the int16_t type we've used for our buffers; -r 44100 means a sample rate of 44100 samples per second; and ./reverb is the name of our executable.)

So how do we use class FeedbackBuffer to generate the reverb effect?

Remember how I said that reverb is essentially many echoes? Let's add a few of them at the top of main():

    FeedbackBuffer fb0(1229);
    FeedbackBuffer fb1(1559);
    FeedbackBuffer fb2(1907);
    FeedbackBuffer fb3(4057);
    FeedbackBuffer fb4(8117);
    FeedbackBuffer fb5(8311);
    FeedbackBuffer fb6(9931);

The buffer sizes that I've chosen here are somewhat arbitrary (I played with a bunch of different combinations and this sounded okay to me). But I used this as a rough guideline: simulating the 20m-by-20m room at a sample rate of 44100 samples per second means we would need delays roughly on the order of 44100 / (20 / 340) = 2594 samples.

Another thing to keep in mind is that we generally do not want our feedback buffers to be multiples of each other. The reason for this is that it creates a consonance between them and will cause certain frequencies to be amplified much more than others. As an example, if you count from 1 to 500 (and continue again from 1), and you have a friend who counts from 1 to 1000 (and continues again from 1), then you would start out 1-1, 2-2, 3-3, etc. up to 500-500, then you would go 1-501, 2-502, 3-504, etc. up to 500-1000. But then, as you both wrap around, you start at 1-1 again. And your friend will always be on 1 when you are on 1. This has everything to do with periodicity and -- in fact -- prime numbers! If you want to maximise the combined period of two counters, you have to make sure that they are relatively coprime, i.e. that they don't share any common factors. The easiest way to achieve this is to only pick prime numbers to start with, so that's what I did for my feedback buffers above.

Having created the feedback buffers (which each represent one echo of the original sound), it's time to put them to use. The effect I want to create is not simply overlaying echoes at fixed intervals, but to have the echos bounce off each other and feed back into each other. The way we do this is by first combining them into the output signal... (since we have 8 signals to combine including the original one, I give each one a 1/8 weight)

    float x = .125 * buf[j];
    x += .125 * fb0.get();
    x += .125 * fb1.get();
    x += .125 * fb2.get();
    x += .125 * fb3.get();
    x += .125 * fb4.get();
    x += .125 * fb5.get();
    x += .125 * fb6.get();
    int16_t out = x;

...then feeding the result back into each of them:

    fb0.add(out);
    fb1.add(out);
    fb2.add(out);
    fb3.add(out);
    fb4.add(out);
    fb5.add(out);
    fb6.add(out);

And finally we also write the result back into the buffer. I found that the original signal loses some of its power, so I use a factor 4 gain to bring it roughly back to its original strength; this number is an arbitrary choice by me, I don't have any specific calculations to support it:

    buf[j] = 4 * out;

That's it! 88 lines of code is enough to write a very basic reverb filter from first principles. Be careful when you run it, though, even the smallest mistake could cause very loud and unpleasant sounds to be played.

If you play with different buffer sizes or a different number of feedback buffers, let me know if you discover anything interesting :-)

May 26, 2016 07:05 PM

May 25, 2016

Pete Zaitcev: Russian Joke

In a quik translation from Bash:

XXX: Still writing that profiler?
YYY: Naah, reading books now
XXX: Like what books?
YYY: TCP Illustrated, Understanding the Linux kernel, Linux kernel development
XXX: And I read "The Never-ending Path Of Hatred".
YYY: That's about Node.js, right?

May 25, 2016 07:15 PM

May 22, 2016

Pete Zaitcev: Dell, why u no VPNC

Yo, we heard you liked remote desktops, so we put remote desktop into a remote desktop, now you can remote desktop while remote desktop.

I remember how IBM simply put VPNC interface in their Bladecenter. It was so nice. Unfortunately, vendors never want to be too nice to users, so their next release switched to a Java applet. Dell copied their approach for DRAC5. In theory, this should be future-proof, all hail WORA. In practice, it only worked with a specific version of Java, which was current when Dell shipped R905 ten years ago. You know, back then Windows XP was new and hot.

Fortunately, by a magic of KVM, libvirt, and Qemu, it's possible to create a virtual machine, install Fedora 10 on it, and then run Firefox with the stupid Java applet on it. Also, the Firefox and Java have to run in 32-bit mode.

When I did it for the first time, I ran Firefox through X11 redirection. That was quite inconvenient: I had to stop the Firefox running on the host desktop, because one cannot run 2 firefoxes painting to the same $DISPLAY. The reason that happens is, well, Mozilla Foundation is evil, basically. The remote Firefox finds the running Firefox through X11 properties and then some crapmagic happens and everything crashes and burns. So, it's much easier just to hook to the VM with Vinagre and run Firefox with DISPLAY=:0 in there.

Those old Fedoras were so nice, BTW. Funnily enough, that VM with 1 CPU and 1.5 GB starts quicker than the host laptop with the benefit of SystemD and its ability to run tasks in parallel. Of course, the handing of WiFi in Fedora 20+ is light years ahead of nm-applet in Fedora 10. There was some less noticeable progress elsewhere as well. But in the same time, the bloat was phenomenal.

UPDATE: Java does not work. Running the JNLP simply fails after downloading the applets, without any error messages. To set the plugin type of "native", ssh to DRAC, then "racadm config -g cfgRacTuning -o cfgRacTunePluginType 0". No kidding.

May 22, 2016 03:46 AM

May 20, 2016

Matthew Garrett: Your project's RCS history affects ease of contribution (or: don't squash PRs)

Github recently introduced the option to squash commits on merge, and even before then several projects requested that contributors squash their commits after review but before merge. This is a terrible idea that makes it more difficult for people to contribute to projects.

I'm spending today working on reworking some code to integrate with a new feature that was just integrated into Kubernetes. The PR in question was absolutely fine, but just before it was merged the entire commit history was squashed down to a single commit at the request of the reviewer. This single commit contains type declarations, the functionality itself, the integration of that functionality into the scheduler, the client code and a large pile of autogenerated code.

I've got some familiarity with Kubernetes, but even then this commit is difficult for me to read. It doesn't tell a story. I can't see its growth. Looking at a single hunk of this diff doesn't tell me whether it's infrastructural or part of the integration. Given time I can (and have) figured it out, but it's an unnecessary waste of effort that could have gone towards something else. For someone who's less used to working on large projects, it'd be even worse. I'm paid to deal with this. For someone who isn't, the probability that they'll give up and do something else entirely is even greater.

I don't want to pick on Kubernetes here - the fact that this Github feature exists makes it clear that a lot of people feel that this kind of merge is a good idea. And there are certainly cases where squashing commits makes sense. Commits that add broken code and which are immediately followed by a series of "Make this work" commits also impair readability and distract from the narrative that your RCS history should present, and Github present this feature as a way to get rid of them. But that ends up being a false dichotomy. A history that looks like "Commit", "Revert Commit", "Revert Revert Commit", "Fix broken revert", "Revert fix broken revert" is a bad history, as is a history that looks like "Add 20,000 line feature A", "Add 20,000 line feature B".

When you're crafting commits for merge, think about your commit history as a textbook. Start with the building blocks of your feature and make them one commit. Build your functionality on top of them in another. Tie that functionality into the core project and make another commit. Add client support. Add docs. Include your tests. Allow someone to follow the growth of your feature over time, with each commit being a chapter of that story. And never, ever, put autogenerated code in the same commit as an actual functional change.

People can't contribute to your project unless they can understand your code. Writing clear, well commented code is a big part of that. But so is showing the evolution of your features in an understandable way. Make sure your RCS history shows that, otherwise people will go and find another project that doesn't make them feel frustrated.

(Edit to add: Sarah Sharp wrote on the same topic a couple of years ago)

comment count unavailable comments

May 20, 2016 12:06 AM

May 17, 2016

Gustavo F. Padovan: Collabora contributions to Linux Kernel 4.6

Linux Kernel 4.6 was released this week, and a total of 9 Collabora engineers took part in its development, Collabora’s highest number of engineers contributing to a single Linux Kernel release yet. In total Collabora contributed 42 patches.

As part of Collabora’s continued commitment to further increase its participation to the Linux Kernel, Collabora is actively looking to expand its team of core software engineers. If you’d like to learn more, follow this link.

Here are some highlights of Collabora’s participation in Kernel 4.6:

Andrew Shadura fixed the number of buttons reported on the Pemount 6000 USB touchscreen controller, while Daniel Stone enabled BCM283x familiy devices in the ARM multi_v7_defconfig and Emilio López added module autoloading for a few sunxi devices.

Enric Balletbo i Serra added boot console output to AM335X(Sitara) and OMAP3-IGEP and fixed audio codec setup on AM335X using the right external clock. Martyn Welch added the USB device ID for the GE Healthcare cp210x serial device and renamed the reset reason of the Zodiac Watchdog.

Gustavo Padovan cleaned up the Android Sync Framework on the staging tree for further de-staging of the Sync File infrastructure, which will land in 4.7. Most of the work was removing interfaces that won’t be used in mainline. He also added vblank event support for atomic commits in the virtio DRM driver.

Peter Senna improved an error path and added some style fixes to the sisusbvga driver. While Sjoerd Simons enabled wireless on radxa Rock2 boards, fixed an issue withthe brcmfmac sdio driver sometimes timing out with a false positive and fixed some issues with Serial output on Renesas R-Car porter board.

Tomeu Vizoso changed driver_match_device() to return errors and in case of -EPROBE_DEFER queue the device for deferred probing, he also provided two fixes to Rockchip DRM driver as part of his work on making intel-gpu-tools work on other platforms.

Following is a list of all patches submitted by Collabora for this kernel release:

Andrew Shadura (1):

Daniel Stone (1):

Emilio López (4):

Enric Balletbo i Serra (3):

Gustavo Padovan (17):

Martyn Welch (2):

Peter Senna Tschudin (4):

Sjoerd Simons (6):

Tomeu Vizoso (4):

May 17, 2016 06:15 PM

May 12, 2016

Matthew Garrett: Convenience, security and freedom - can we pick all three?

Moxie, the lead developer of the Signal secure communication application, recently blogged on the tradeoffs between providing a supportable federated service and providing a compelling application that gains significant adoption. There's a set of perfectly reasonable arguments around that that I don't want to rehash - regardless of feelings on the benefits of federation in general, there's certainly an increase in engineering cost in providing a stable intra-server protocol that still allows for addition of new features, and the person leading a project gets to make the decision about whether that's a valid tradeoff.

One voiced complaint about Signal on Android is the fact that it depends on the Google Play Services. These are a collection of proprietary functions for integrating with Google-provided services, and Signal depends on them to provide a good out of band notification protocol to allow Signal to be notified when new messages arrive, even if the phone is otherwise in a power saving state. At the time this decision was made, there were no terribly good alternatives for Android. Even now, nobody's really demonstrated a free implementation that supports several million clients and has no negative impact on battery life, so if your aim is to write a secure messaging client that will be adopted by as many people is possible, keeping this dependency is entirely rational.

On the other hand, there are users for whom the decision not to install a Google root of trust on their phone is also entirely rational. I have no especially good reason to believe that Google will ever want to do something inappropriate with my phone or data, but it's certainly possible that they'll be compelled to do so against their will. The set of people who will ever actually face this problem is probably small, but it's probably also the set of people who benefit most from Signal in the first place.

(Even ignoring the dependency on Play Services, people may not find the official client sufficient - it's very difficult to write a single piece of software that satisfies all users, whether that be down to accessibility requirements, OS support or whatever. Slack may be great, but there's still people who choose to use Hipchat)

This shouldn't be a problem. Signal is free software and anybody is free to modify it in any way they want to fit their needs, and as long as they don't break the protocol code in the process it'll carry on working with the existing Signal servers and allow communication with people who run the official client. Unfortunately, Moxie has indicated that he is not happy with forked versions of Signal using the official servers. Since Signal doesn't support federation, that means that users of forked versions will be unable to communicate with users of the official client.

This is awkward. Signal is deservedly popular. It provides strong security without being significantly more complicated than a traditional SMS client. In my social circle there's massively more users of Signal than any other security app. If I transition to a fork of Signal, I'm no longer able to securely communicate with them unless they also install the fork. If the aim is to make secure communication ubiquitous, that's kind of a problem.

Right now the choices I have for communicating with people I know are either convenient and secure but require non-free code (Signal), convenient and free but insecure (SMS) or secure and free but horribly inconvenient (gpg). Is there really no way for us to work as a community to develop something that's all three?

comment count unavailable comments

May 12, 2016 02:40 PM

May 11, 2016

Daniel Vetter: Neat drm/i915 Stuff for 4.7

The 4.6 release is almost out of the door, it's time to look at what's in store for 4.7.
Let's first look at the epic saga called atomic support. In 4.7 the atomic watermark update support for Ironlake through Broadwell from Matt Roper, Ville Syrjälä and others finally landed. This took about 3 attempts to get merged because there's lots of small little corner cases that caused regressions each time around, but it's finally done. And it's an absolutely key piece for atomic support, since Intel hardware does not support atomic updates of the watermark settings for the display fetch fifos. And if those values are wrong tearings and other ugly things will result. We still need corresponding support for other platforms, but this is a really big step. But that's not the only atomic work: Maarten Lankhorst made the hardware state checker atomic, and there's been tons of smaller things all over to move the driver towards the shiny new.

Another big feature on the display side is color management, implemented by Lionel Landwerlin, and then fixes to make it fully atomic from Maarten. Color management aims for more accurate reproduction of a well definied color space on panels, using a de-gamma table, then a color matrix, and finally a gamma table.

For platform enabling the big thing is support for DSI panels on Broxton from Jani Nikula and Ramalingam C. One fallout from this effort is the cleaned up VBT parsing code, done by Jani. There's now a clean split between parsing the different VBT versions on all the various platforms, now neatly consolidated, and using that information in different places within the driver. Ville also hooked up upscaling/panel fitting for DSI panels on all platforms.

Looking more at driver internals Ander Conselvan de Oliviera and Ville refactored the entire display PLL code on all platforms, with the goal to reuse it in the DP detection code for upfront link training. This is needed to detect the link configuration in certain situations like USB type C connectors. Shubhangi Shrivastava reworked the DP detection code itself, again to prep for these features. Still on pure display topics Ville fixed lots of underrun issues to appease our CI on lots of platforms. Together with the atomic watermark updates this should shut up one of the largest sources of noise in our test results.

Moving on to power management work the big thing is lots of small fixes for the runtime PM support all over the place from Imre Deak and Ville, with a big focus on the Broxton platform. And while we talk features affecting the entire driver: Imre added fault injection to the driver load paths so that we can start to exercise all that code in an automated way.

Finally looking at the render/GEM side of the driver the short summary is that Tvrtko Ursulin and Chris Wilson worked the code all over the place: A cleanup up and tuned forcewake handling code from Tvrtko, fixes for more userptr corner cases from Chris, a new notifier to handle vmap exhaustion and assorted polish in the related shrinker code, cleaned up and fixed handling of gpu reset corner cases, fixes for context related hard hangs on Sandybridge and Ironlake, large-scale renaming of parameters and structures to realign old code with the newish execlist hardware mode, the list goes on. And finally a rather big piece, and one which causes some trouble, is all the work to speed up the execlist code, with a big focusing on reducing interrupt handling overhead. This was done by moving the expensive parts of execlist interrupt handling into a tasklet. Unfortunately that uncovered some bugs in our interrupt handling on Braswell, so Ville jumped in and fixed it all up, plus of course removed some cruft and applied some nice polish.

Other work in the GT are are gpu hang fixes for Skylake GT3 and GT4 configurations from Mika Kuoppala. Mika also provided patches to improve the edram handling on those same chips. Alex Dai and Dave Gordon kept working on making GuC ready for prime time, but not yet there. And Peter Antoine improved the MOCS support to work on all engines.

And of course there's been tons of smaller improvements, bugfixes, cleanups and refactorings all over the place, as usual.

May 11, 2016 03:02 PM

Michael Kerrisk (manpages): man-pages-4.06 is released

I've released man-pages-4.06. The release tarball is available on kernel.org. The browsable online pages can be found on man7.org. The Git repository for man-pages is available on kernel.org.

This release resulted from patches, bug reports, reviews, and comments from aroubd 20 contributors. The release includes changes to just over 40 man pages. Among which the more significant changes in man-pages-4.06 are the following:

May 11, 2016 06:46 AM

May 06, 2016

Pete Zaitcev: Dropbox lifts the kimono

Dropbox posted somewhat of a whitepaper about their exabyte storage system, which exceeds the largest Swift cluster by about 2 orders of magnitude. Here's a couple of fun quotes:

The Block Index is a giant sharded MySQL cluster, fronted by an RPC service layer, plus a lot of tooling for database operations and reliability. We’d originally planned on building a dedicated key-value store for this purpose but MySQL turned out to be more than capable.

Kinda like SQLite in Swift.

Cells are self-contained logical storage clusters that store around 50PB of raw data.

And they have dozens of those. Their cell has a master node BTW. Kinda like Ceph's PG, but unlike Swift.

RTWT

May 06, 2016 08:38 PM

May 05, 2016

LPC 2016: Tracing Microconference Accepted into 2016 Linux Plumbers Conference

After taking a break in 2015, Tracing is back at Plumbers this year! Tracing is heavily used throughout the Linux ecosystem, and provides an essential method for extracting information about the underlying code that is running on the system. Although tracing is simple in concept, effective usage and implementation can be quite involved.

Topics proposed for this year’s event include new features in the BPF compiler collection, perf, and ftrace; visualization frameworks; large-scale tracing and distributed debugging; always-on analytics and monitoring; do-it-yourself tracing tools; and, last but not least, a kernel-tracing wishlist.

We hope to see you there!

May 05, 2016 10:21 PM

May 04, 2016

Paul E. Mc Kenney: Tracing Microconference Accepted into 2016 Linux Plumbers Conference

After taking a break in 2015, Tracing is back at Plumbers this year! Tracing is heavily used throughout the Linux ecosystem, and provides an essential method for extracting information about the underlying code that is running on the system. Although tracing is simple in concept, effective usage and implementation can be quite involved.

Topics proposed for this year's event include new features in the BPF compiler collection, perf, and ftrace; visualization frameworks; large-scale tracing and distributed debugging; always-on analytics and monitoring; do-it-yourself tracing tools; and, last but not least, a kernel-tracing wishlist.

We hope to see you there!

May 04, 2016 03:18 PM

April 28, 2016

Pete Zaitcev: OpenStack Swift Proxy-FS by SwiftStack

SwiftStack's Joe Arnold and John Dickinson chose the Austin Summit and a low-key, #vBrownBag venue, to come out of closet with PROXY-FS (also spelled as ProxyFS), a tightly integrated addition to OpenStack Swift, which provides a POSIX-ish filesystem access to a Swift cluster.

Proxy-FS is basically a peer to a less known feature of Ceph Rados Gateway that permits accessing it over NFS. Both of them are fundamentally different from e.g. Swift-on-file in that the data is kept in Swift or Ceph, instead of a general filesystem.

The object layout is natural in that it takes advantage of SLO by creating a log-structured, manifested object. This way the in-place updates are handled, including appends. Yes, you can create a manifest with a billion of 1-byte objects just by invoking write(2). So, don't do that.

In response to my question, Joe promised to open the source, although we don't know when.

Another question dealt with the performance expectations. The small I/O performance of Proxy-FS is not going to be great in comparison to a traditional NFS filer. One of its key features is relative transparency: there is no cache involved and every application request goes straight to Swift. This helps to adhere to the principle of the least surprise, as well as achieve scalability for which Swift is famous. There is no need for any upcalls/cross-calls from the Swift Proxy into Proxy-FS that invalidate the cache, because there's no cache. But it has to be understood that Proxy-FS, as well as NFS mode in RGW, are not intended to compete with Netapp.

Not directly, anyway. But what they could do is to disrupt, in Christensen's sense. His disruption examples were defined as technologies that are markedly inferior to incumbents, as well as dramatically cheaper. Swift and Ceph are both: the filesystem performance sucks balls and the price per terabyte is 1/10th of NetApp (this statement was not evaluated by the Food and Drug Administration). If new applications come about that make use of these properties... You know the script.

April 28, 2016 04:42 PM

Daniel Vetter: X.Org Foundation Election Results

Two questions were up for voting, 4 seats on the Board of Directors and approval of the amended By-Laws to join SPI.

Congratulations to our reelected and new board members Egbert Eich, Alex Deucher, Keith Packard and Bryce Harrington. Thanks a lot to Lucas Stach for running. And also big thanks to our outgoing board member Matt Dew, who stepped down for personal reasons.

On the bylaw changes and merging with SPI, 61 out of 65 active members voted, with 54 voting yes, 4 no and 3 abstained. Which means we're well past the 2/3rd quorum for bylaw changes, and everything's green now to proceed with the plan to join SPI!

April 28, 2016 06:37 AM

April 27, 2016

James Bottomley: Unprivileged Build Containers

A while ago, a goal I set myself was to be able to maintain my build and test environments for architecture emulation containers without having to do any of the tasks as root and without creating any suid binaries to do this.  One of the big problems here is that distributions get annoyed (and don’t run correctly) if root doesn’t own most of the files … for instance the installers all check to see that the file got installed with the correct ownership and permissions and fail if they don’t.  Debian has an interesting mechanism, called fakeroot, to get around this using a preload library intercepting the chmod and chown system calls, but it’s getting a bit hackish to try to extend this to work permanently for an emulation container.

The correct way to do this is with user namespaces so the rest of this post will show you how.  Before we get into how to use them, lets begin with the theory of how user namespaces actually work.

Theory of User Namespaces

A user namespace is the single namespace that can be created by an unprivileged user.  Their job is to map a set of interior (inside the user namespace) uids, gids and projids1 to a set of exterior (outside the user namespace).

The way this works is that the root user namespace simply has a 1:1 identity mapping of all 2^32 identifiers, meaning it fully covers the space.  However, any new user namespace only need remap a subset of these.  Any id that is not mapped into the user namespace becomes inaccessible to that namespace.  This doesn’t mean completely inaccessible, it just means any resource owned or accessed by an unmapped id treats an attempted access (even from root in the namespace) as though it were completely unprivileged, so if the resource is readable by any id, it can still be read even in a user namespace where its owning id is unmapped.

User namespaces can also be nested but the nested namespace can only map ids that exist in its parent, so you can only reduce but not expand the id space by nesting.  The way the nested mapping works is that it remaps through all the parent namespaces, so what appears on the resource is still the original exterior ids.

User Namespaces also come with an owner (the uid/gid of the process that created the container).  The reason for this is that this owner is allowed to execute setuid/setgid to any id mapped into the namespace, so the owning uid/gid pair is the effective “root” of the container.  Note that this setuid/setgid property works on entry to the namespace even if uid 0 is not mapped inside the namespace, but won’t survive once the first setuid/setgid is issued.

The final piece of the puzzle is that every other namespace also has an owning user namespace, so while I cannot create a mount namespace as unprivileged user jejb, I can as remapped root inside my user namespace

jejb@jarvis:~> unshare --mount
unshare: unshare failed: Operation not permitted
jejb@jarvis:~> nsenter --user=/tmp/userns
root@jarvis:~# unshare --mount
root@jarvis:~#

And once created, I can always enter this mount namespace provided I’m also in my user namespace.

Setting up Unprivileged Namespaces

Any system user can actually create a user namespace.  However a non-root (meaning not uid zero in the parent namespace) user cannot remap any exterior id except their own.  This means that, because a build container needs a range of ids, it’s not possible to set up the intial remapped namespace without the help of root.  However, once that is done, the user can pretty much do every other operation2

The way remap ranges are set up is via the uid_map, gid_map and projid_map files sitting inside the /proc/<pid> directory.  These files may only be written to once and never updated3

As an example, to set up a build container, I need a remapping for every id that would be created during installation.  Traditionally for Linux, these are ids 0-999.  I want to remap them down to something unprivileged, say 100,000 so my line entry for this is

0 100000 1000

However, I also want an identity mapping for my own id (currently I’m at uid 1000), so I can still use my home directory from within the build container.  This also means I can create the roots for the containers within my home directory.  Finally, the nobody user and nobody,nogroup groups also need to be mapped, so the final uid map entries look like

0 100000 1000
1000 1000 1
65534 101001 1

For the groups, it’s even more complex because on openSUSE, I’m a member of the users group (gid 100) which sits in the middle of the privileged 0-999 group range, so the gid_map entry I use is

0 100000 100
100 100 1
101 100100 899
65533 101000 2

Which is almost up to the kernel imposed limit of five separate lines.

Finally, here’s how to set this up and create a binding for the user namespace.  As myself (I’m uid 1000 user name jejb) I do

jejb@jarvis:~> unshare --user
nobody@jarvis:~> echo $$
20211
nobody@jarvis:~>

Note that I become nobody inside the container because currently the map files are unwritten so there are no mapped ids at all.  Now as root, I have to write the mapping files and bind the entry file to the namespace somewhere

jarvis:/home/jejb # echo 1|awk '{print "0 100000 1000\n1000 1000 1\n65534 101001 1"}' > /proc/20211/uid_map
jarvis:/home/jejb # echo 1|awk '{print "0 100000 100\n100 100 1\n101 100100 899\n65533 101000 2"}' > /proc/20211/gid_map
jarvis:/home/jejb # touch /tmp/userns
jarvis:/home/jejb # mount --bind /proc/20211/ns/user /tmp/userns

Now I can exit my user namespace because it’s permanently bound and the next time I enter it I become root inside the container (although with uid 100000 outside)

jejb@jarvis:~> nsenter --user=/tmp/userns
root@jarvis:~# id
uid=0(root) gid=0(root) groups=0(root)
root@jarvis:~# su - jejb
jejb@jarvis:~> id
uid=1000(jejb) gid=100(users) groups=100(users)

Giving me a user namespace with sufficient mapped ids to create a build container.

Unprivileged Architecture Emulation Containers

Essentially, I can use the user namespace constructed above to bootstrap and enter the entire build container and its mount namespace with one proviso that I have to have a pre-created devices directory because I don’t possess the mknod capability as myself, so my container root also doesn’t possess it.  The way I get around this is to create the initial dev directory as root and then change the ownership to 100000.100000 (my unprivileged ids)

jejb@jarvis:~/containers/debian-amd64/dev> ls -l
total 0
lrwxrwxrwx 1 100000 100000 13 Feb 20 09:45 fd -> /proc/self/fd/
crw-rw-rw- 1 100000 100000 1, 7 Feb 20 09:45 full
crw-rw-rw- 1 100000 100000 1, 3 Feb 20 09:45 null
lrwxrwxrwx 1 100000 100000 8 Feb 20 09:45 ptmx -> pts/ptmx
drwxr-xr-x 2 100000 100000 6 Feb 20 09:45 pts/
crw-rw-rw- 1 100000 100000 1, 8 Feb 20 09:45 random
drwxr-xr-x 2 100000 100000 6 Feb 20 09:45 shm/
lrwxrwxrwx 1 100000 100000 15 Feb 20 09:45 stderr -> /proc/self/fd/2
lrwxrwxrwx 1 100000 100000 15 Feb 20 09:45 stdin -> /proc/self/fd/0
lrwxrwxrwx 1 100000 100000 15 Feb 20 09:45 stdout -> /proc/self/fd/1
crw-rw-rw- 1 100000 100000 5, 0 Feb 20 09:45 tty
crw-rw-rw- 1 100000 100000 1, 9 Feb 20 09:45 urandom
crw-rw-rw- 1 100000 100000 1, 5 Feb 20 09:45 zero

This seems to be sufficient of a dev skeleton to function with.  For completeness sake, I placed my bound user namespace into /run/build-container/userns and following the original architecture container emulation post, with modified susebootstrap and build-container scripts.  The net result is that as myself I can now enter and administer and update the build and test architecture emulation container with no suid required

jejb@jarvis:~> nsenter --user=/run/build-container/userns --mount=/run/build-container/ppc64
root@jarvis:/# id
uid=0(root) gid=0(root) groups=0(root)
root@jarvis:/# uname -m
ppc64
root@jarvis:/# su - jejb
jejb@jarvis:~> id
uid=1000(jejb) gid=100(users) groups=100(users)

The only final wrinkle is that root has to set up the user namespace on every boot, but that’s only because there’s no currently defined operating system way of doing this.

April 27, 2016 06:52 PM

April 26, 2016

LPC 2016: Checkpoint-Restore Microconference Accepted into 2016 Linux Plumbers Conference

This year will feature a four-fold deeper dive into checkpoint-restore technology, thanks to participation by people from a number of additional related projects! These are the OpenMPI message-passing library, Berkeley Lab Checkpoint/Restart (BLCR), and Distributed MultiThreaded CheckPointing (DMTCP) (not to be confused with TCP/IP), in addition to the Checkpoint/Restore in Userspace group that has participated in prior years.

Docker integration remains a hot topic, as is post-copy live migration, as well as testing/validation. As you might guess from the inclusion of people from BLCR and OpenMPI, checkpoint-restore for distributed workloads (rather than just single systems) is an area of interest.

Please join us for a timely and important discussion!

April 26, 2016 02:43 PM

April 25, 2016

Paul E. Mc Kenney: Checkpoint-Restore Microconference Accepted into 2016 Linux Plumbers Conference

This year will feature a four-fold deeper dive into checkpoint-restore technology, thanks to participation by people from a number of additional related projects! These are the OpenMPI message-passing library, Berkeley Lab Checkpoint/Restart (BLCR), and Distributed MultiThreaded CheckPointing (DMTCP) (not to be confused with TCP/IP), in addition to the Checkpoint/Restore in Userspace group that has participated in prior years.

Docker integration remains a hot topic, as is post-copy live migration, as well as testing/validation. As you might guess from the inclusion of people from BLCR and OpenMPI, checkpoint-restore for distributed workloads (rather than just single systems) is an area of interest.

Please join us for a timely and important discussion!

April 25, 2016 05:28 PM

April 24, 2016

LPC 2016: Testing & Fuzzing Microconference Accepted into 2016 Linux Plumbers Conference

Testing, fuzzing, and other diagnostics have made the Linux ecosystem much more robust than in the past, but there are still embarrassing bugs. Furthermore, million-year bugs will be happening many times per day across Linux’s huge installed base, so there is clearly need for even more aggressive validation.

The Testing and Fuzzing Microconference aims to significantly increase the aggression level of Linux-kernel validation, with discussions on tools and test suites including kselftest, syzkaller, trinity, mutation testing, and the 0day Test Robot. The effectiveness of these tools will be attested to by any of their victims, but we must further raise our game as the installed base of Linux continues to increase.

One additional way of raising the level of testing aggression is to document the various ABIs in machine-readable format, thus lowering the barrier to entry for new projects. Who knows? Perhaps Linux testing will be driven by artificial-intelligence techniques!

Join us for an important and spirited discussion!

April 24, 2016 07:43 PM

Daniel Vetter: Should the X.org Foundation join SPI? Vote Now!



In case you missed it: Please vote now on https://members.x.org/login.php!

April 24, 2016 09:14 AM

April 22, 2016

LPC 2016: Device Tree Microconference Accepted into 2016 Linux Plumbers Conference

Device-tree discussions are probably not quite as spirited as in the past, however, device tree is an active area. In particular, significant issues remain.

This microconference will cover the updated device-tree specification, debugging, bindings validation, and core-kernel code directions. In addition, there will be discussion of what does (and does not) go into the device tree, along with the device-creation and driver binding ordering swamp.

Join us for an important and spirited discussion!

April 22, 2016 01:33 PM

Matthew Garrett: Circumventing Ubuntu Snap confinement

Ubuntu 16.04 was released today, with one of the highlights being the new Snap package format. Snaps are intended to make it easier to distribute applications for Ubuntu - they include their dependencies rather than relying on the archive, they can be updated on a schedule that's separate from the distribution itself and they're confined by a strong security policy that makes it impossible for an app to steal your data.

At least, that's what Canonical assert. It's true in a sense - if you're using Snap packages on Mir (ie, Ubuntu mobile) then there's a genuine improvement in security. But if you're using X11 (ie, Ubuntu desktop) it's horribly, awfully misleading. Any Snap package you install is completely capable of copying all your private data to wherever it wants with very little difficulty.

The problem here is the X11 windowing system. X has no real concept of different levels of application trust. Any application can register to receive keystrokes from any other application. Any application can inject fake key events into the input stream. An application that is otherwise confined by strong security policies can simply type into another window. An application that has no access to any of your private data can wait until your session is idle, open an unconfined terminal and then use curl to send your data to a remote site. As long as Ubuntu desktop still uses X11, the Snap format provides you with very little meaningful security. Mir and Wayland both fix this, which is why Wayland is a prerequisite for the sandboxed xdg-app design.

I've produced a quick proof of concept of this. Grab XEvilTeddy from git, install Snapcraft (it's in 16.04), snapcraft snap, sudo snap install xevilteddy*.snap, /snap/bin/xevilteddy.xteddy . An adorable teddy bear! How cute. Now open Firefox and start typing, then check back in your terminal window. Oh no! All my secrets. Open another terminal window and give it focus. Oh no! An injected command that could instead have been a curl session that uploaded your private SSH keys to somewhere that's not going to respect your privacy.

The Snap format provides a lot of underlying technology that is a great step towards being able to protect systems against untrustworthy third-party applications, and once Ubuntu shifts to using Mir by default it'll be much better than the status quo. But right now the protections it provides are easily circumvented, and it's disingenuous to claim that it currently gives desktop users any real security.

comment count unavailable comments

April 22, 2016 01:51 AM

April 20, 2016

Paul E. Mc Kenney: Testing & Fuzzing Microconference Accepted into 2016 Linux Plumbers Conference

Testing, fuzzing, and other diagnostics have made the Linux ecosystem much more robust than in the past, but there are still embarrassing bugs. Furthermore, million-year bugs will be happening many times per day across Linux's huge installed base, so there is clearly need for even more aggressive validation.

The Testing and Fuzzing Microconference aims to significantly increase the aggression level of Linux-kernel validation, with discussions on tools and test suites including kselftest, syzkaller, trinity, mutation testing, and the 0day Test Robot. The effectiveness of these tools will be attested to by any of their victims, but we must further raise our game as the installed base of Linux continues to increase.

One additional way of raising the level of testing aggression is to document the various ABIs in machine-readable format, thus lowering the barrier to entry for new projects. Who knows? Perhaps Linux testing will be driven by artificial-intelligence techniques!

Join us for an important and spirited discussion!

April 20, 2016 04:43 PM

Daniel Vetter: X.org Foundation Election - Vote Now!

It's election season in X.org land, and it matters: Besides new board seats we're also voting on bylaw changes and whether to join SPI or not.

Personally, and as the secretary of the board I'm very much in favour of joining SPI. It will allow us to offload all the boring bits of running a foundation, and those are also all the bits we tend to struggle with. And that would give the board more time to do things that actually matter and help the community. And all that for a really reasonable price - running our own legal entity isn't free, and not really worth it for our small budget mostly consisting of travel sponsoring and the occasional internship.

And bylaw changes need a qualified supermajority of all members, every vote counts and not voting essentially means voting no. Hence please vote, and please vote even when you don't want to join - this is our second attempt and I'd really like to see a clear verdict from our members, one way or the other.

Thanks.

Voting closes by  Apr 26 23:59 UTC, but please don't cut it short, it's a computer that decides when it's over ...

April 20, 2016 07:24 AM

April 18, 2016

Paul E. Mc Kenney: Device Tree Microconference Accepted into 2016 Linux Plumbers Conference

Device-tree discussions are probably not quite as spirited as in the past, however, device tree is an active area. In particular, significant issues remain.

This microconference will cover the updated device-tree specification, debugging, bindings validation, and core-kernel code directions. In addition, there will be discussion of what does (and does not) go into the device tree, along with the device-creation and driver binding ordering swamp.

Join us for an important and spirited discussion!

April 18, 2016 03:21 PM

Matthew Garrett: One more attempt at SATA power management

Around a year ago I wrote some patches in an attempt to improve power management on Haswell and Broadwell systems by configuring Serial ATA power management appropriately. I got a couple of reports of them triggering SATA errors for some users, couldn't reproduce them myself and so didn't have a lot of confidence in them. Time passed.

I've been working on power management stuff again this week, so it seemed like a good opportunity to revisit these. I've made a few changes and pushed a couple of trees - one against master and one against 4.5.

First, these probably only have relevance to users of mobile Intel parts in the U or S range (/proc/cpuinfo will tell you - you're looking for a four-digit number that starts with 4 (Haswell), 5 (Broadwell) or 6 (Skylake) and ends with U or S), and won't do anything unless you have SATA drives (including PCI-based SATA). To test them, first disable anything like TLP that might alter your SATA link power management policy. Then check powertop - you should only be getting to PC3 at best. Build a kernel with these patches and boot it. /sys/class/scsi_host/*/link_power_management_policy should read "firmware". Check powertop and see whether you're getting into deeper PC states. Now run your system for a while and check the kernel log for any SATA errors that you didn't see before.

Let me know if you see SATA errors and are willing to help debug this, and leave a comment if you don't see any improvement in PC states.

comment count unavailable comments

April 18, 2016 02:15 AM

April 17, 2016

LPC 2016: PCI Microconference Accepted into 2016 Linux Plumbers Conference

Given that PCI was introduced more than two decades ago and that PCI Express was introduced more than ten years ago, one might think that the Linux plumbing already did everything possible to support PCI.

One would be quite wrong.

One issue with current PCI support is that resource allocation is handled on a per-architecture basis, leading to duplicate code, and, worse yet, duplicate bugs. This microconference will therefore look into possible consolidation of this code.

Another issue is integration of PCI’s message-signaled interrupts (MSI) into the kernel’s core interrupt code. Whilst the device tree bindings and relative core code for MSI management have just been integrated, legacy code needs more work, including architecture-specific code and PCI legacy host controller drivers. In addition, attention should be paid to the ACPI bindings, to the related ACPI kernel core code, and to the MSI passthrough usage in virtualized environments.

Advances in Virtualization technologies, System I/O devices and their memory management through IOMMU components are driving features in the PCI Express specifications (Address Translation Services (ATS) and Page Request Interface (PRI)). This microconference will also foster debate on the best way to integrate these features in the kernel in a seamless way for different architectures.

Of course, no discussion of PCI would be complete without considering firmware issues, hardware quirks, power management, and the interplay between device tree and ACPI.

Join us for an important new-to-Plumbers PCI discussion!

April 17, 2016 04:01 PM

April 15, 2016

Pavel Machek: Python is a nice... trap

Python is a nice... language

Easy to program in. No need to recompile, you can hack in as you would in shell. Fun.
Lets see if I can predict the weather. Oh yes, lets hack something in python. Hey, it works. But it is slow on a PC and so slow on a phone that future is already past when you predict is.
Then you try to improve the speed. That's very very bad idea. You can't optimize Python, but you can spend hours trying. You make a mental note to never ever use Python for computation again.
But Python should be fine for simple Gtk project, communicating over Dbus, right?
For bigger projects, Python has some support. Modules / object orientation really helps there. Lack of compilation really hurts there, but this is fun language, right? Allowing you to write nice and clean code.
Refactoring is hard, because variables are not declared.
Python is a nice... trap
But then your project grows larger, and you realize it takes 30 seconds to start up. Because many module need gtk, and importing gtk takes time. Heck... compiling _and_ running C based hello world is still faster then running Python based one.
Ok, C++ was tempting. Gtk Hello world takes 55 seconds to compile. My g++ does not support "auto", so the application starts with Glib::RefPtr<Gtk::Application> app = Gtk::Application::create(argc, argv, "org.gtkmm.example"). I think I'll stick with C.

Oh, and I already used my main SIM in N900/Debian... and I really used N900/Debian. I needed to tether a network, but not provide wifi hotspot. Done.

April 15, 2016 11:17 AM

Matthew Garrett: David MacKay

The first time I was paid to do software development came as something of a surprise to me. I was working as a sysadmin in a computational physics research group when a friend asked me if I'd be willing to talk to her PhD supervisor. I had nothing better to do, so said yes. And that was how I started the evening having dinner with David MacKay, and ended the evening better fed, a little drunker and having agreed in principle to be paid to write free software.

I'd been hired to work on Dasher, an information-efficient text entry system. It had been developed by one of David's students as a practical demonstration of arithmetic encoding after David had realised that presenting a visualisation of an effective compression algorithm allowed you to compose text without having to enter as much information into the system. At first this was merely a neat toy, but it soon became clear that the benefits of Dasher had a great deal of overlap with good accessibility software. It required much less precision of input, it made it easy to correct mistakes (you merely had to reverse direction in order to start zooming back out of the text you had entered) and it worked with a variety of input technologies from mice to eye tracking to breathing. My job was to take this codebase and turn it into a project that would be interesting to external developers.

In the year I worked with David, we turned Dasher from a research project into a well-integrated component of Gnome, improved its support for Windows, accepted code from an external contributor who ported it to OS X (using an OpenGL canvas!) and wrote ports for a range of handheld devices. We added code that allowed Dasher to directly control the UI of other applications, making it possible for people to drive word processors without having to leave Dasher. We taught Dasher to speak. We strove to avoid the mistakes present in so many other pieces of accessibility software, such as configuration that could only be managed by an (expensive!) external consultant. And we visited Dasher users and learned how they used it and what more they needed, then went back home and did what we could to provide that.

Working on Dasher was an incredible opportunity. I was involved in the development of exciting code. I spoke on it at multiple conferences. I became part of the Gnome community. I visited the USA for the first time. I entered people's homes and taught them how to use Dasher and experienced their joy as they realised that they could now communicate up to an order of magnitude more quickly. I wrote software that had a meaningful impact on the lives of other people.

Working with David was certainly not easy. Our weekly design meetings were, charitably, intense. He had an astonishing number of ideas, and my job was to figure out how to implement them while (a) not making the application overly complicated and (b) convincing David that it still did everything he wanted. One memorable meeting involved me gradually arguing him down from wanting five new checkboxes to agreeing that there were only two combinations that actually made sense (and hence a single checkbox) - and then admitting that this was broadly equivalent to an existing UI element, so we could just change the behaviour of that slightly without adding anything. I took the opportunity to delete an additional menu item in the process.

I was already aware of the importance of free software in terms of developers, but working with David made it clear to me how important it was to users as well. A community formed around Dasher, helping us improve it and allowing us to develop support for new use cases that made the difference between someone being able to type at two words per minute and being able to manage twenty. David saw that this collaborative development would be vital to creating something bigger than his original ideas, and it succeeded in ways he couldn't have hoped for.

I spent a year in the group and then went back to biology. David went on to channel his strong feelings about social responsibility into issues such as sustainable energy, writing a freely available book on the topic. He served as chief adviser to the UK Department of Energy and Climate Change for five years. And earlier this year he was awarded a knighthood for his services to scientific outreach.

David died yesterday. It's unlikely that I'll ever come close to what he accomplished, but he provided me with much of the inspiration to try to do so anyway. The world is already a less fascinating place without him.

comment count unavailable comments

April 15, 2016 06:26 AM

April 13, 2016

Matthew Garrett: Skylake's power management under Linux is dreadful and you shouldn't buy one until it's fixed

(Edit to add: this issue is restricted to the mobile SKUs. Desktop parts have very different power management behaviour)

Linux 4.5 seems to have got Intel's Skylake platform (ie, 6th-generation Core CPUs) to the point where graphics work pretty reliably, which is great progress (4.4 tended to lose all my windows every so often, especially over suspend/resume). I'm even running Wayland happily. Unfortunately one of the reasons I have a laptop is that I want to be able to do things like use it on battery, and power consumption's an important part of that. Skylake continues the trend from Haswell of moving to an SoC-type model where clock and power domains are shared between components that were previously entirely independent, and so you can't enter deep power saving states unless multiple components all have the correct power management configuration. On Haswell/Broadwell this manifested in the form of Serial ATA link power management being involved in preventing the package from going into deep power saving states - setting that up correctly resulted in a reduction in full-system power consumption of about 40%[1].

I've now got a Skylake platform with a nice shiny NVMe device, so Serial ATA policy isn't relevant (the platform doesn't even expose a SATA controller). The deepest power saving state I can get into is PC3, despite Skylake supporting PC8 - so I'm probably consuming about 40% more power than I should be. And nobody seems to know what needs to be done to fix this. I've found no public documentation on the power management dependencies on Skylake. Turning on everything in Powertop doesn't improve anything. My battery life is pretty poor and the system is pretty warm.

The best thing about this is the following statement from page 64 of the 6th Generation Intel ® Processor Datasheet for U-Platforms:

Caution: Long term reliability cannot be assured unless all the Low-Power Idle States are enabled.

which is pretty concerning. Without support for states deeper than PC3, Linux is running in a configuration that Intel imply may trigger premature failure. That's obviously not good. Until this situation is improved, you probably shouldn't buy any Skylake systems if you're planning on running Linux.

[1] These patches never went upstream. Someone reported that they resulted in their SSD throwing errors and I couldn't find anybody with deeper levels of SATA experience who was interested in working on the problem. Intel's AHCI drivers for Windows do the right thing, but I couldn't find anybody at Intel who could get any information from their Windows driver team.

comment count unavailable comments

April 13, 2016 10:16 PM

Paul E. Mc Kenney: PCI Microconference Accepted into 2016 Linux Plumbers Conference

Given that PCI was introduced more than two decades ago and that PCI Express was introduced more than ten years ago, one might think that the Linux plumbing already did everything possible to support PCI.

One would be quite wrong.

One issue with current PCI support is that resource allocation is handled on a per-architecture basis, leading to duplicate code, and, worse yet, duplicate bugs. This microconference will therefore look into possible consolidation of this code.

Another issue is integration of PCI's message-signaled interrupts (MSI) into the kernel's core interrupt code. Whilst the device tree bindings and relative core code for MSI management have just been integrated, legacy code needs more work, including architecture-specific code and PCI legacy host controller drivers. In addition, attention should be paid to the ACPI bindings, to the related ACPI kernel core code, and to the MSI passthrough usage in virtualized environments.

Advances in Virtualization technologies, System I/O devices and their memory management through IOMMU components are driving features in the PCI Express specifications (Address Translation Services (ATS) and Page Request Interface (PRI)). This microconference will also foster debate on the best way to integrate these features in the kernel in a seamless way for different architectures.

Of course, no discussion of PCI would be complete without considering firmware issues, hardware quirks, power management, and the interplay between device tree and ACPI.

Join us for an important new-to-Plumbers PCI discussion!

April 13, 2016 08:25 PM

April 12, 2016

LPC 2016: Call for Refereed-Track Proposals

We are pleased to announce the Call for Refereed-Track Proposals for the 2016 edition of the Linux Plumbers Conference, which will held be in Santa Fe, NM, USA on November 2-4 in conjunction with Linux Kernel Summit.

Refereed track presentations are 50 minutes in length and should focus on a specific aspect of the “plumbing” in the Linux system. Examples of Linux plumbing include core kernel subsystems, core libraries, windowing systems, management tools, device support, media creation/playback, and so on. The best presentations are not about finished work, but rather problems, proposals, or proof-of-concept solutions that require face-to-face discussions and debate.

Given that Plumbers is not colocated with LinuxCon this year, we are spreading the refereed-track talks over all three days. This provides a change of pace and also provides a conflict-free schedule for the refereed-track talks. (Yes, this does result in more conflicts between the referred-track talks and the Microconferences, but there never is a truly free lunch.)

Linux Plumbers Conference Program Committee members will be reviewing all submitted sessions. High-quality submisssion that cannot be accepted due to the limited number of slots will be forwarded to the Microconference leads for further consideration.

To submit a refereed track talk proposal follow the instructions at this website.

Submissions are due on or before Thursday 1 September, 2016 at 11:59PM Pacific Time. Since this is after the closure of early registration, speakers may register before this date and we’ll refund the registration for any selected presentation’s speaker, but for only one speaker per presentation.

Finally, we are still accepting Microconference submissions here.
 
Dates:

Submissions close: September 1, 2016
Speakers notified: September 22, 2016
Slides due: November 1, 2016

April 12, 2016 02:40 PM

LPC 2016: Live Kernel Patching Microconference Accepted into 2016 Linux Plumbers Conference

Live kernel patching was accepted into the Linux kernel in v4.0 in February 2015, so we can declare the 2014 LPC Live Kernel Patching Microconference to have been a roaring success! However, as was noted at the time, this is just the beginning of the real work. In short, the v4.0 work makes live kernel patching possible, but more work is required to make it more reliable and more routine.

Additional issues include stacktrace reliability, patch-safety criteria for kernel threads, thread consistency models, porting to non-x86 architectures, handling of loadable modules, compiler optimizations, userspace tooling, patching of data, automated regression testing, and patch-creation guidelines.

Join us for an important and spirited discussion!

April 12, 2016 01:01 AM

April 11, 2016

Paul E. Mc Kenney: 2016 Linux Plumbers Conference Call for Refereed-Track Proposals



 

 

Dates:

Submissions close: September 1, 2016
Speakers notified: September 22, 2016
Slides due: November 1, 2016

We are pleased to announce the Call for Refereed-Track Proposals for the 2016 edition of the Linux Plumbers Conference, which will held be in Santa Fe, NM, USA on November 2-4 in conjunction with Linux Kernel Summit.

Refereed track presentations are 50 minutes in length and should focus on a specific aspect of the "plumbing" in the Linux system. Examples of Linux plumbing include core kernel subsystems, core libraries, windowing systems, management tools, device support, media creation/playback, and so on. The best presentations are not about finished work, but rather problems, proposals, or proof-of-concept solutions that require face-to-face discussions and debate.

Given that Plumbers is not colocated with LinuxCon this year, we are spreading the refereed-track talks over all three days. This provides a change of pace and also provides a conflict-free schedule for the refereed-track talks. (Yes, this does result in more conflicts between the referred-track talks and the Microconferences, but there never is a truly free lunch.)

Linux Plumbers Conference Program Committee members will be reviewing all submitted sessions. High-quality submisssion that cannot be accepted due to the limited number of slots will be forwarded to the Microconference leads for further consideration.

To submit a refereed track talk proposal follow the instructions at this website.

Submissions are due on or before Thursday 1 September, 2016 at 11:59PM Pacific Time. Since this is after the closure of early registration, speakers may register before this date and we'll refund the registration for any selected presentation's speaker, but for only one speaker per presentation.

Finally, we are still accepting Microconference submissions here.

April 11, 2016 07:17 PM

Pavel Machek: pypy: suprisingly good

Using python for weather forecasting was a mistake... but it came with some surprises. It looks like pypy is as fast as gcc -O0 on simple correlation computation: https://gitlab.com/tui/tui/blob/master/nowcast/pypybench.py (and pypybench.c). gcc -O3 is twice as fast, and plain python2 is 20x slower (or 2000% slower). Python3 is 28x. slower. That's better than I expected for the JIT technology.

April 11, 2016 09:51 AM

Pavel Machek: Finally... power management on Nokia N900

After long long fight, it seems power management on Nokia N900 works for me for the first time. N900 is very picky about its configuration (you select lockdep, you lose video; you select something else 50mA power consumption... not good). That was the last major piece... I hope. I should have usable phone soon.

April 11, 2016 09:45 AM

Matthew Garrett: Making it easier to deploy TPMTOTP on non-EFI systems

I've been working on TPMTOTP a little this weekend. I merged a pull request that adds command-line argument handling, which includes the ability to choose the set of PCRs you want to seal to without rebuilding the tools, and also lets you print the base32 encoding of the secret rather than the qr code so you can import it into a wider range of devices. More importantly it also adds support for setting the expected PCR values on the command line rather than reading them out of the TPM, so you can now re-seal the secret against new values before rebooting.

I also wrote some new code myself. TPMTOTP is designed to be usable in the initramfs, allowing you to validate system state before typing in your passphrase. Unfortunately the initramfs itself is one of the things that's measured. So, you end up with something of a chicken and egg problem - TPMTOTP needs access to the secret, and the obvious thing to do is to put the secret in the initramfs. But the secret is sealed against the hash of the initramfs, and so you can't generate the secret until after the initramfs. Modify the initramfs to insert the secret and you change the hash, so the secret is no longer released. Boo.

On EFI systems you can handle this by sticking the secret in an EFI variable (there's some special-casing in the code to deal with the additional metadata on the front of things you read out of efivarfs). But that's not terribly useful if you're not on an EFI system. Thankfully, there's a way around this. TPMs have a small quantity of nvram built into them, so we can stick the secret there. If you pass the -n argument to sealdata, that'll happen. The unseal apps will attempt to pull the secret out of nvram before falling back to looking for a file, so things should just magically work.

I think it's pretty feature complete now, other than TPM2 support? That's on my list.

comment count unavailable comments

April 11, 2016 05:59 AM

April 08, 2016

Rusty Russell: Bitcoin Generic Address Format Proposal

I’ve been implementing segregated witness support for c-lightning; it’s interesting that there’s no address format for the new form of addresses.  There’s a segregated-witness-inside-p2sh which uses the existing p2sh format, but if you want raw segregated witness (which is simply a “0” followed by a 20-byte or 32-byte hash), the only proposal is BIP142 which has been deferred.

If we’re going to have a new address format, I’d like to make the case for shifting away from bitcoin’s base58 (eg. 1At1BvBMSEYstWetqTFn5Au4m4GFg7xJaNVN2):

  1. base58 is not trivial to parse.  I used the bignum library to do it, though you can open-code it as bitcoin-core does.
  2. base58 addresses are variable-length.  That makes webforms and software mildly harder, but also eliminates a simple sanity check.
  3. base58 addresses are hard to read over the phone.  Greg Maxwell points out that the upper and lower case mix is particularly annoying.
  4. The 4-byte SHA check does not guarantee to catch the most common form of errors; transposed or single incorrect letters, though it’s pretty good (1 in 4 billion chance of random errors passing).
  5. At around 34 letters, it’s fairly compact (36 for the BIP141 P2WPKH).

This is my proposal for a generic replacement (thanks to CodeShark for generalizing my previous proposal) which covers all possible future address types (as well as being usable for current ones):

  1. Prefix for type, followed by colon.  Currently “btc:” or “testnet:“.
  2. The full scriptPubkey using base 32 encoding as per http://philzimmermann.com/docs/human-oriented-base-32-encoding.txt.
  3. At least 30 bits for crc64-ecma, up to a multiple of 5 to reach a letter boundary.  This covers the prefix (as ascii), plus the scriptPubKey.
  4. The final letter is the Damm algorithm check digit of the entire previous string, using this 32-way quasigroup. This protects against single-letter errors as well as single transpositions.

These addresses look like btc:ybndrfg8ejkmcpqxot1uwisza345h769ybndrrfg (41 digits for a P2WPKH) or btc:yybndrfg8ejkmcpqxot1uwisza345h769ybndrfg8ejkmcpqxot1uwisza34 (60 digits for a P2WSH) (note: neither of these has the correct CRC or check letter, I just made them up).  A classic P2PKH would be 45 digits, like btc:ybndrfg8ejkmcpqxot1uwisza345h769wiszybndrrfg, and a P2SH would be 42 digits.

While manually copying addresses is something which should be avoided, it does happen, and the cost of making them robust against common typographic errors is small.  The CRC is a good idea even for machine-based systems: it will let through less than 1 in a billion mistakes.  Distinguishing which blockchain is a nice catchall for mistakes, too.

We can, of course, bikeshed this forever, but I wanted to anchor the discussion with something I consider fairly sane.

April 08, 2016 01:50 AM

April 07, 2016

Andi Kleen: Overview of Last Branch Records (LBRs) on Intel CPUs

I wrote a two part article on theory and practice of Last Branch
Records using Linux perf. They were published on LWN.

This includes the background (what is a last branch record and why
branch sampling), what you can use it for in perf, like hot path
profiling, frame pointerless callgraphs, or automatic micro benchmarks,
and also other uses like continuous compiler feedback.

The articles are now freely available:

Part 1: Introduction of Last Branch Records
Part 2: Advanced uses of Last Branch Records

April 07, 2016 02:26 PM

April 06, 2016

Pete Zaitcev: Amateur contributors to OpenStack

John was venting about our complicated contribution process in OpenStack and threw this off-hand remark:

I'm sure the fact that nearly 100% of @openstack contributors are paid to be so is completely unrelated. #eyeroll

While I share his frustration, one thing he may be missing is that OpenStack is generally useless to anyone who does not have thousands of computers dedicated to it. This is a significant barrier of entry for hobbyists, baked straight into the nature of OpenStack.

Exceptions that we see are basically people building little pseudo-clusters out of a dozen of VMs. They do it with an aim of advancing their careers.

April 06, 2016 04:44 PM

April 05, 2016

Matthew Garrett: There's more than one way to exploit the commons

There's a piece of software called XScreenSaver. It attempts to fill two somewhat disparate roles:

XScreenSaver does an excellent job of the second of these[2] and is pretty good at the first, which is to say that it only suffers from a disastrous security flaw once very few years and as such is certainly not appreciably worse than any other piece of software.

Debian ships an operating system that prides itself on stability. The Debian definition of stability is a very specific one - rather than referring to how often the software crashes or misbehaves, it refers to how often the software changes behaviour. Debian is very reluctant to upgrade software that is part of a stable release, to the extent that developers will attempt to backport individual security fixes to the version they shipped rather than upgrading to a release that contains all those security fixes but also adds a new feature. The argument here is that the new release may also introduce new bugs, and Debian's users desire stability (in the "things don't change" sense) more than new features. Backporting security fixes keeps them safe without compromising the reason they're running Debian in the first place.

This all makes plenty of sense at a theoretical level, but reality is sometimes less convenient. The first problem is that security bugs are typically also, well, bugs. They may make your software crash or misbehave in annoying but apparently harmless ways. And when you fix that bug you've also fixed a security bug, but the ability to determine whether a bug is a security bug or not is one that involves deep magic and a fanatical devotion to the cause so given the choice between maybe asking for a CVE and dealing with embargoes and all that crap when perhaps you've actually only fixed a bug that makes the letter "E" appear in places it shouldn't and not one that allows the complete destruction of your intergalactic invasion fleet means people will tend to err on the side of "Eh fuckit" and go drinking instead. So new versions of software will often fix security vulnerabilities without there being any indication that they do so[3], and running old versions probably means you have a bunch of security issues that nobody will ever do anything about.

But that's broadly a technical problem and one we can apply various metrics to, and if somebody wanted to spend enough time performing careful analysis of software we could have actual numbers to figure out whether the better security approach is to upgrade or to backport fixes. Conversations become boring once we introduce too many numbers, so let's ignore that problem and go onto the second, which is far more handwavy and social and so significantly more interesting.

The second problem is that upstream developers remain associated with the software shipped by Debian. Even though Debian includes a tool for reporting bugs against packages included in Debian, some users will ignore that and go straight to the upstream developers. Those upstream developers then have to spend at least 15 or so seconds telling the user that the bug they're seeing has been fixed for some time, and then figure out how to explain that no sorry they can't make Debian include a fixed version because that's not how things work. Worst case, the stable release of Debian ends up including a bug that makes software just basically not work at all and everybody who uses it assumes that the upstream author is brutally incompetent, and they end up quitting the software industry and I don't know running a nightclub or something.

From the Debian side of things, the straightforward solution is to make it more obvious that users should file bugs with Debian and not bother the upstream authors. This doesn't solve the problem of damaged reputation, and nor does it entirely solve the problem of users contacting upstream developers. If a bug is filed with Debian and doesn't get fixed in a timely manner, it's hardly surprising that users will end up going upstream. The Debian bugs list for XScreenSaver does not make terribly attractive reading.

So, coming back to the title for this entry. The most obvious failure of the commons is where a basically malicious actor consumes while giving nothing back, but if an actor with good intentions ends up consuming more than they contribute that may still be a problem. An upstream author releases a piece of software under a free license. Debian distributes this to users. Debian's policies result in the upstream author having to do more work. What does the upstream author get out of this exchange? In an ideal world, plenty. The author's software is made available to more people. A larger set of developers is willing to work on making improvements to the software. In a less ideal world, rather less. The author has to deal with bug mail about already fixed bugs. The author's reputation may be harmed by user exposure to said fixed bugs. The author may get less in the way of useful bug fixes or features because people are running old versions rather than fixing new ones. If the balance tips towards the latter, the author's decision to release their software under a free license has made their life more difficult.

Most discussions about Debian's policies entirely ignore the latter scenario, focusing more on the fact that the author chose to release their software under a free license to begin with. If the author is unwilling to handle the consequences of that, goes the argument, why did they do it in the first place? The unfortunate logical conclusion to that argument is that the author realises that they made a huge mistake and never does so again, and woo uh oops.

The irony here is that one of Debian's foundational documents, the Debian Free Software Guidelines, makes allowances for this. Section 4 allows for distribution of software in Debian even if the author insists that modified versions[4] are renamed. This allows for an author to make a choice - allow themselves to be associated with the Debian version of their work and increase (a) their userbase and (b) their support load, or try to distinguish what Debian ship from their identity. But that document was ratified in 1997 and people haven't really spent much time since then thinking about why it says what it does, and so this tradeoff is rarely considered.

Free software doesn't benefit from distributions antagonising their upstreams, even if said upstream is a cranky nightclub owner. Debian's users are Debian's highest priority, but those users are going to suffer if developers decide that not using free licenses improves their quality of life. Kneejerk reactions around specific instances aren't helpful, but now is probably a good time to start thinking about what value Debian bring to its upstream authors and how that can be increased. Failing to do so doesn't serve users, Debian itself or the free software community as a whole.

[1] The X server has no fundamental concept of a screen lock. This is implemented by an application asking that the X server send all keyboard and mouse input to it rather than to any other application, and then that application creating a window that fills the screen. Due to some hilarious design decisions, opening a pop-up menu in an application prevents any other application from being able to grab input and so it is impossible for the screensaver to activate if you open a menu and then walk away from your computer. This is merely the most obvious problem - there are others that are more subtle and more infuriating. The only fix in this case is to nuke the site from orbit.

[2] There's screenshots here. My favourites are the one that emulate the electrical characteristics of an old CRT in order to present a more realistic depiction of the output of an Apple 2 and the one that includes a complete 6502 emulator.

[3] And obviously new versions of software will often also introduce new security vulnerabilities without there being any indication that they do so, because who would ever put that in their changelog. But the less ethically challenged members of the security community are more likely to be looking at new versions of software than ones released three years ago, so you're probably still tending towards winning overall

[4] There's a perfectly reasonable argument that all packages distributed by Debian are modified in some way

comment count unavailable comments

April 05, 2016 11:53 PM

Matthew Garrett: TPMs, event logs, fine-grained measurements and avoiding fragility in remote-attestation

Trusted Platform Modules are fairly unintelligent devices. They can do some crypto, but they don't have any ability to directly monitor the state of the system they're attached to. This is worked around by having each stage of the boot process "measure" state into registers (Platform Configuration Registers, or PCRs) in the TPM by taking the SHA1 of the next boot component and performing an extend operation. Extend works like this:

New PCR value = SHA1(current value||new hash)

ie, the TPM takes the current contents of the PCR (a 20-byte register), concatenates the new SHA1 to the end of that in order to obtain a 40-byte value, takes the SHA1 of this 40-byte value to obtain a 20-byte hash and sets the PCR value to this. This has a couple of interesting properties:

But how do we know what those operations were? We control the bootloader and the kernel and we know what extend operations they performed, so that much is easy. But the firmware itself will have performed some number of operations (the firmware itself is measured, as is the firmware configuration, and certain aspects of the boot process that aren't in our control may also be measured) and we may not be able to reconstruct those from scratch.

Thankfully we have more than just the final PCR data. The firmware provides an interface to log each extend operation, and you can read the event log in /sys/kernel/security/tpm0/binary_bios_measurements. You can pull information out of that log and use it to reconstruct the writes the firmware made. Merge those with the writes you performed and you should be able to reconstruct the final TPM state. Hurrah!

The problem is that a lot of what you want to measure into the TPM may vary between machines or change in response to configuration changes or system updates. If you measure every module that grub loads, and if grub changes the order that it loads modules in, you also need to update your calculations of the end result. Thankfully there's a way around this - rather than making policy decisions based on the final TPM value, just use the final TPM value to ensure that the log is valid. If you extract each hash value from the log and simulate an extend operation, you should end up with the same value as is present in the TPM. If so, you know that the log is valid. At that point you can examine individual log entries without having to care about the order that they occurred in, which makes writing your policy significantly easier.

But there's another source of fragility. Imagine that you're measuring every command executed by grub (as is the case in the CoreOS grub). You want to ensure that no inappropriate commands have been run (such as ones that would allow you to modify the loaded kernel after it's been measured), but you also want to permit certain variations - for instance, you might have a primary root filesystem and a fallback root filesystem, and you're ok with either being passed as a kernel argument. One approach would be to write two lines of policy, but there's an even more flexible approach. If the bootloader logs the entire command into the event log, when replaying the log we can verify that the event description hashes to the value that was passed to the TPM. If it does, rather than testing against an explicit hash value, we can examine the string itself. If the event description matches a regular expression provided by the policy then we're good.

This approach makes it possible to write TPM policies that are resistant to changes in ordering and permit fine-grained definition of acceptable values, and which can cleanly separate out local policy, generated policy values and values that are provided by the firmware. The split between machine-specific policy and OS policy allows for the static machine-specific policy to be merged with OS-provided policy, making remote attestation viable even over automated system upgrades.

We've integrated an implementation of this kind of policy into the TPM support code we'd like to integrate into Kubernetes, and CoreOS will soon be generating known-good hashes at image build time. The combination of these means that people using Distributed Trusted Computing under Tectonic will be able to validate the state of their systems with nothing more than a minimal machine-specific policy description.

The support code for all of this should also start making it into other distributions in the near future (the grub code is already in Fedora 24), so with luck we can define a cross-distribution policy format and make it straightforward to handle this in a consistent way even in hetrogenous operating system environments. Remote attestation is a powerful tool for ensuring that your systems are in a valid state, but the difficulty of policy management has been a significant factor in making it difficult for people to deploy in their data centres. Making it easier for people to shield themselves against low-level boot attacks is a big step forward in improving the security of distributed workloads and makes bare-metal hosting a much more viable proposition.

comment count unavailable comments

April 05, 2016 04:08 PM

April 01, 2016

Rusty Russell: BIP9: versionbits In a Nutshell

Hi, I was one of the authors/bikeshedders of BIP9, which Pieter Wuille recently refined (and implemented) into its final form.  The bitcoin core plan is to use BIP9 for activations from now on, so let’s look at how it works!

Some background:

So, let’s look at BIP68 & 112 (Sequence locks and OP_CHECKSEQUENCEVERIFY) which are being activated together:

There are also two alerts in the bitcoin core implementation:

Now, when could the OP_CSV soft forks activate? bitcoin-core will only start setting the bit in the first period after the start date, so somewhere between 1st and 15th of May[1], then will take another period to lock-in (even if 95% of miners are already upgraded), then another period to activate.  So early June would be the earliest possible date, but we’ll get two weeks notice for sure.

The Old Algorithm

For historical purposes, I’ll describe how the old soft-fork code worked.  It used version as a simple counter, eg. 3 or above meant BIP66, 4 or above meant BIP65 support.  Every block, it examined the last 1000 blocks to see if more than 75% had the new version.  If so, then the new softfork rules were enforced on new version blocks: old version blocks would still be accepted, and use the old rules.  If more than 95% had the new version, old version blocks would be rejected outright.

I remember Gregory Maxwell and other core devs stayed up late several nights because BIP66 was almost activated, but not quite.  And as a miner there was no guarantee on how long before you had to upgrade: one smaller miner kept producing invalid blocks for weeks after the BIP66 soft fork.  Now you get two weeks’ notice (probably more if you’re watching the network).

Finally, this change allows for miners to reject a particular soft fork without rejecting them all.  If we’re going to see more contentious or competing proposals in the future, this kind of plumbing allows it.

Hope that answers all your questions!


 

[1] It would be legal for an implementation to start setting it on the very first block past the start date, though it’s easier to only think about version bits once every two weeks as bitcoin-core does.

April 01, 2016 01:28 AM

March 30, 2016

Pete Zaitcev: SwiftStack versus Swift

Erik Pounds posted an article on SwiftStack official blog where it presented a somewhat corporatist view of Swift and Ceph that comes down to this:

They are both productized by commercial companies so all enterprises can utilize them... Ceph via RedHat and Swift via SwiftStack.

This view is extremely reductionist along a couple of avenues.

First, it tries to sweep all of Swift under SwiftStack umbrella, whereas in reality it derives a lot of strength from not being controlled by SwiftStack. But way to assuage the fears of monoentity control by employing the PTL, guys. Fortunately, in the real and more complex world, Red Hat pays me to work on Swift, as well as offer Swift as a product, and I do not find that my needs are in any way sabotaged by PTL. Certainly our product focus differs; Red Hat's lifetime management offering, OSPd, manages OpenStack first and Swift as much as it's a part of OpenStack, whereas SwiftStack offer a Swift-specific product. Still it's not like Swift equals SwiftStack. I think RackSpace continue to operate the largest Swift cluster in the world.

Second, Erik somehow neglects to notice that Ceph provides a Swift compatibility through a component known as Rados Gateway. It is an option, you know, although obviously it can never be a better Swift than Swift itself, or better Amazon S3 than S3 itself.

March 30, 2016 07:36 PM

March 29, 2016

James Morris: Linux Security Summit 2016 – CFP Announced!

The 2016 Linux Security Summit (LSS) will be held in Toronto, Canada, on 25th and 26th, co-located with LinuxCon North America.  See the full announcement.

The Call for Participation (CFP) is now open, and submissions must be made by June 10th.  As with recent years, the committee is looking for refereed presentations and discussion topics.

This year, there will be a $100 registration fee for LSS, and you do not need to be registered for LinuxCon to attend LSS .

There’s also now an event twitter feed, for updates and announcements.

If you’ve been doing any interesting development, or deployment, of Linux security systems, please consider submitting a proposal!

March 29, 2016 11:08 AM

March 28, 2016

Paul E. Mc Kenney: Live Kernel Patching Microconference Accepted into 2016 Linux Plumbers Conference

Live kernel patching was accepted into the v4.0 Linux kernel in February 2015, so we can declare the 2014 LPC Live Kernel Patching Microconference to have been a roaring success! However, as was noted at the time, this is just the beginning of the real work. In short, the v4.0 work makes live kernel patching possible, but more work is required to make it more reliable and more routine.

Additional issues include stacktrace reliability, patch-safety criteria for kernel threads, thread consistency models, porting to non-x86 architectures, handling of loadable modules, compiler optimizations, userspace tooling, patching of data, automated regression testing, and patch-creation guidelines.

Join us for an important and spirited discussion!

March 28, 2016 04:52 PM

March 25, 2016

LPC 2016: Containers Microconference Accepted into 2016 Linux Plumbers Conference

The level of Containers excitement has increased even further this year, with much interplay between Docker, Kubernetes, Rkt, CoreOS, Mesos, LXC, LXD, OpenVZ, systemd, and much else besides. This excitement has led to some interesting new use cases, including even the use of containers on Android.

Some of these use cases in turn require some interesting new changes to the Linux plumbing, including mounts in unprivileged containers, improvements to cgroups resource management, ever-present security concerns, and interoperability between various sets of tools.

Please join us for an important discussion on an important aspect of cloud computing!

March 25, 2016 05:51 PM

March 22, 2016

Paul E. Mc Kenney: Containers Microconference Accepted into 2016 Linux Plumbers Conference

The level of Containers excitement has increased even further this year, with much interplay between Docker, Kubernetes, Rkt, CoreOS, Mesos, LXC, LXD, OpenVZ, systemd, and much else besides. This excitement has led to some interesting new use cases, including even the use of containers on Android.

Some of these use cases in turn require some interesting new changes to the Linux plumbing, including mounts in unprivileged containers, improvements to cgroups resource management, ever-present security concerns, and interoperability between various sets of tools.

Please join us for an important discussion on an important aspect of cloud computing!

March 22, 2016 05:56 PM

March 16, 2016

Gustavo F. Padovan: Collabora contributions to Linux Kernel 4.5

Linux Kernel 4.5 was released earlier this week, and once again Collabora engineers played a role in its development. In addition to their current projects, seven Collabora engineers contributed a total of 33 patches to the new Kernel.

As part of its continued committment to further increase ts participation to the Linux Kernel, Collabora is looking to expand its team of core software engineers. If you’d like to learn more, follow this link.

Here are some highlights of Collabora’s participation in Kernel 4.5:

Daniel Stone improved i915 runtime WARN() messages and fixed an important issue in the component subsystem when component_add() fails. Danilo Cesar made the DRM Docbook ready for Markdown text.

Gustavo Padovan improved the pm_runtime management on the drm/exynos driver and started work on de-staging the Android Sync Framework. On Rockchip, Sjoerd Simons enabled IR receiver to RK3288 Radxa Rock 2 Square, added multi_v7_defconfig for Rockchip audio and enabled RK3288 SPDIF clocks to change their parent. On the net side, Sjoerd added a patch to turn carrier off on phy attach to avoid unknown states and another patch to add ethernet0 alias for the RK3288 to help u-boot find this device-node.

During his brief time with us at Collabora, Heiko Stübner added the dts file for the veyron-brain board, a shutdown callback to platform variant dwc2 devices for a special clock handling to avoid getting stuck on the reboot/poweroff process and multi_v7_defconfig support to Rockchip’s io-domain driver, crypto module and rk808 clkout module. He also enabled support for veyron minnie touchscreen, adjusted temperature limits on veyron-speedy and fixed the edp-24m clock to be associated to the internal 24MHz oscillator all the time.

Martyn Welch added a driver for the Zodiac Aerospace RAVE Watchdog Processor, while Tomeu Vizoso added a device_is_bound() helper function and setter for dev.pm_domain that comes with extra checkings. Tomeu also added a patch to allow USB devices to remain runtime-suspended when sleeping and another patch to optimize sleep by going direct_complete if driver has no prepare and PM callbacks. Lastly, Tomeu also fixed a freq issue on Tegra devfreq_dev_profile.target callback.

Following is a list of all patches submitted by Collabora for this kernel release:

Daniel Stone (3):

Danilo Cesar Lemes de Paula (1):

Gustavo Padovan (9):

Heiko Stübner (8):

Martyn Welch (2):

Sjoerd Simons (5):

Tomeu Vizoso (5):

March 16, 2016 06:23 PM

March 15, 2016

Michael Kerrisk (manpages): man-pages-4.05 is released

I've released man-pages-4.05. The release tarball is available on kernel.org. The browsable online pages can be found on man7.org. The Git repository for man-pages is available on kernel.org.

This release resulted from patches, bug reports, reviews, and comments from more than 70 contributors. The release includes changes to more than 400 man pages. Among which the more significant changes in man-pages-4.05 are the following:

March 15, 2016 08:17 AM

March 14, 2016

Eric Sandeen: Return of the low-power server: a 10W, 6.5 terabyte, commodity x86 Linux box

5 years ago, I wrote about the build of the server hosting sandeen.net, which ran using only 18W of power.  After 5 years, it was time for an upgrade, and I’m pleased to say that I put together a much more capable system which now runs at only 10W!  This system handles external email and webserving, as well as handling media serving, energy datalogging, and backup storage for household purposes.  It’s the server which dished up this blog for you today.

sandeen.net wattage, as measured by a Kill-a-Watt meter [amzn]

If you run a server 24/7 at home, that always-on power consumption can really add up.  If you have an old beast running at 250W, that’s using about 2MWh of power per year, and will cost you over $200/year in electricity at $0.10/kWh.  A 10W build uses only 4% of that!  Here’s how I put it together:

I chose this motherboard because it has 4 SATA ports; the pairs of drives are in MD RAID1 mirror configurations using XFS for all filesystems, and the large 3T media drives are spun down most of the time; the system does get over 10W when they’re spinning.  And I have to say, the Samsung drives seem really nice; they are warranted for 150 terabytes written, or 10 years, whichever comes first.

I took most of the suggestions from the powertop tool, and I run this script at startup to put most devices into power-saving mode; this saves about 2W from the boot-up state. When boot-up power consumption is 12W, 2W is a significant fraction!  ;)

#!/bin/bash

# Set spindown time, see 
# https://sandeen.net/wordpress/computers/spinning-down-a-wd20ears/
hdparm -S 3 /dev/sdc
hdparm -S 3 /dev/sdd

# Idle these disks right now.
hdparm -y /dev/sdc
hdparm -y /dev/sdd

x86_energy_perf_policy -v powersave

# Stuff from powertop:
echo 'min_power' > '/sys/class/scsi_host/host0/link_power_management_policy';
echo 'min_power' > '/sys/class/scsi_host/host1/link_power_management_policy';
echo 'min_power' > '/sys/class/scsi_host/host2/link_power_management_policy';
echo 'min_power' > '/sys/class/scsi_host/host3/link_power_management_policy';
echo '0' > '/proc/sys/kernel/nmi_watchdog';
# I'm not willing to mess with data integrity...
# echo '1500' > '/proc/sys/vm/dirty_writeback_centisecs';
# Put most pci devices into low power mode
echo 'auto' > '/sys/bus/pci/devices/0000:00:1c.0/power/control';
echo 'auto' > '/sys/bus/pci/devices/0000:00:00.0/power/control';
echo 'auto' > '/sys/bus/pci/devices/0000:00:13.0/power/control';
echo 'auto' > '/sys/bus/pci/devices/0000:00:1a.0/power/control';
echo 'auto' > '/sys/bus/pci/devices/0000:00:14.0/power/control';
echo 'auto' > '/sys/bus/pci/devices/0000:00:1f.0/power/control';
echo 'auto' > '/sys/bus/pci/devices/0000:00:1c.1/power/control';
echo 'auto' > '/sys/bus/pci/devices/0000:00:1f.3/power/control';
echo 'auto' > '/sys/bus/pci/devices/0000:01:00.0/power/control';
# realtek chip doesn't like being in low power, see
# https://lkml.org/lkml/2016/2/24/88
# echo 'auto' > '/sys/bus/pci/devices/0000:02:00.0/power/control';
echo 'auto' > '/sys/bus/pci/devices/0000:00:02.0/power/control';

The old server took the network down to 100mbps, but that just feels too pokey most days, so I left it at full gigabit speed.

I’ve been very happy with this setup; it runs quiet, cool, fast, and cheap, while using less power than an old-fashioned night-light.  Jetway says the board will have a lifecycle through Q1 2019, so hopefully anyone reading this post well into the future can still pick one up if desired.

If you’re curious, here’s dmesg, lspci, and cpuinfo from the box.  Note that the cpu can even do virtualization!

March 14, 2016 01:00 PM

March 12, 2016

Matthew Garrett: I stayed in a hotel with Android lightswitches and it was just as bad as you'd imagine

I'm in London for Kubecon right now, and the hotel I'm staying at has decided that light switches are unfashionable and replaced them with a series of Android tablets.

A tablet displaying the text UK_bathroom isn't responding. Do you want to close it?
One was embedded in the wall, but the two next to the bed had convenient looking ethernet cables plugged into the wall. So.

I managed to borrow a couple of USB ethernet adapters, set up a transparent bridge (brctl addbr br0; brctl addif br0 enp0s20f0u1; brctl addif br0 enp0s20f0u2; ifconfig br0 up) and then stuck my laptop between the tablet and the wall. tcpdump -i br0 showed traffic, and wireshark revealed that it was Modbus over TCP. Modbus is a pretty trivial protocol, and notably has no authentication whatsoever. tcpdump showed that traffic was being sent to 172.16.207.14, and pymodbus let me start controlling my lights, turning the TV on and off and even making my curtains open and close. What fun!

And then I noticed something. My room number is 714. The IP address I was communicating with was 172.16.207.14. They wouldn't, would they?

I mean yes obviously they would.

It's basically as bad as it could be - once I'd figured out the gateway, I could access the control systems on every floor and query other rooms to figure out whether the lights were on or not, which strongly implies that I could control them as well. Jesus Molina talked about doing this kind of thing a couple of years ago, so it's not some kind of one-off - instead, hotels are happily deploying systems with no meaningful security, and the outcome of sending a constant stream of "Set room lights to full" and "Open curtain" commands at 3AM seems fairly predictable.

We're doomed.

(edited: this previously claimed I could only access systems on my own floor, but it turns out that each floor is a separate broadcast domain and I just needed to set a gateway to access the others)

(further edit: I'm deliberately not naming the hotel. They were receptive to my feedback and promised to do something about the issue.)

comment count unavailable comments

March 12, 2016 12:57 PM

March 10, 2016

Daniel Vetter: Neat drm/i915 stuff for 4.6

The 4.5 release is close, it's time to look at what's in store for the next kernel's merge window in the Intel graphics driver.

Headline features for sure are that FBC and PSR are enabled by default. And this time around I'm really hopeful that it will stick, since Paulo&Rodrigo have done a stellar job hunting down all the corner cases, writing testcases for them all and in general making sure we have a really solid foundation for display power saving features. There's still some oddball cornercases, which means it's not yet enabled everywhere and on all platforms, but like I said: Looking really good, and the culmination of over 1 year of effort to get the code infrastructure fixed up and solid.

Another project ongoing is atomic display support, with again lots of work from Maarten and Matt and others to move things forward. Specifically Maarten adapted the load detect logic to atomic and removed a pile of legacy structures no longer needed in preparation of next steps. Matt continued to work on atomic display fifo watermark updates. Another area that has seen a lot of work in the background is runtime PM. Mika, Imre and Ville have massively improved the debugging infrastructure we have to track down rare bugs in our power status tracking code. They also merged lots of fixes in this area. Unfortunately we're not yet at a point where we can enable overall runtime PM for the device by default.

More on the feature side is pixel clock limit checks for all outputs and platforms from Mika Kahola. This is related to the work to also enable dynamic display clock scaling on Gen9, but that part is still being worked on. Ville worked a lot on how offsets and alignment are handled for display planes, all in preparation to support rotated multi-planar formats like NV12. Again a feature where a lot of hard work is required to make the final patch to enable it all look really simple.

On the plain hardware and platform enabling side Jani implemented support for version 3 of the VBT DSI descriptions, which should extend DSI panel support to all current hardware. Which includes the Surface 3.

Finally on the GEM side there have been mostly small fixes and imrpovements under the hood. Tvrtko decoupled the internal engine representation from the userspace ABI defines. He also restructed the CS irq handler code, and started to fix up some locking issues in execlist. Chris tracked down some coherency issues in the execlist interrupt handler. Dave Gordon finally started to somewhat untangle the execlist initialization logic and some of the confusion in how all the different software structures connect.

One real feature work by Alex Dai though was enabled ADS for GuC, which is some means to hand additional metadata to the GuC firmware scheduler. But since GuC based command submission isn't enabled yet, this doesn't have a direct impact.

And of course there have been bugfixes all over the place, as usual.

March 10, 2016 10:27 AM

Pavel Machek: liferea .. allows tracking. Too much tracking for my taste.

I noticed that every time I click a header in liferea, on http://servis.idnes.cz/rss.aspx?c=zpravodaj feed, network shows usage. But there's nothing displayed...
I turned off network, and three broken images... So it turns out liferea is downloading three invisible images to assist idnes in tracking.
Can I somehow tell liferea not to download anything just because I clicked the headline?

March 10, 2016 09:01 AM

March 07, 2016

Pete Zaitcev: The "fast-post" is merged into Swift

I'm just back from a hackathon at HPE site in Bristol, England, where we made a final look to so-called "fast-post" patch and merged it in. It was developed by Alistair Coles and basically made POST work as everyone expected it to work, at last.

In the original Swift, it was found that when you do a POST to an object, in the presence of failures it was possible to end with some nodes having old data but new (posted) attributes. The bad part was that the replication mechanism could not do anything to recoincile the inconsistency, and then your GET returns varying data forever depending on what node you hit. It occured when new timestamp from POST attached itself to old data (and other equivalent scenarios).

This is some of a fundamental issue with using a timestamp based replication in Swift. Greg and Chuck knew about it all along, and their solution was known as "POST to PUT". They made Swift Proxy to fetch the object, then update its attributes for the POST, then do essentially a PUT. This way timestamps, data, and attributes are always consistent, as they are in the initial PUT. If this POST-to-PUT thing occurs across a failure, replication uses timestamps to restore consistently correctly.

The problem with that, POST-to-PUT is slow, as well as deceptive. Users think they issue a lightweight POST, but actually they prompt a massive data move inside the cluster if the object is big.

Alasdair's insight was that the root of the problem was not that timestamps were no good as a basic mechanism, but that the "fast" POST broke them by assigning new timestamps to old data (or old attributes, metadata, Content-Type). As long as each indepentently settable thing had its own timestamp, there was no problem. In Swift, we have 3 of those: object data, object metadata, and Content-Type (don't ask). So, store 3 timestamps with each object and presto!

The actual patch employs an additional trick by not changing the container DB schema. Instead, it encodes 3 timestamps into 1 field where the timestamp used to live. This way a smooth migration is possible in a cluster where old async pendings still float, for example. It looks a little kludgy at first, but I convinced myself that it made sense under the circumstances.

P.S. The fast-post is not a default, even now. Needs Container Sync updated to be compatible. I think Eran was going to look into that.

March 07, 2016 07:51 PM

March 04, 2016

Pavel Machek: blinking cursor of death

Yesterday, my / filesystem on Debian on Nokia N900 went full. Ok. So I removed some cached dpkgs. I believe phone still worked today morning.

Now, at the point where it should start X, screen goes dark, and then blinking cursor reapears. Normally, I'd hit ctrl-alt-f1, and debug
stuff from there, but N900 has no alt, and no f1. I'll image filesystem now. I can take a look at logs and fix stuff on filesystem,
but can't do other experiments easily. Debian 7.9. If you have some ideas what log to look at, let me know.

Window manager warning: CurrentTime used to choose focus window; focus window may not be correct.
Window manager warning: Got a request to focus the no_focus_window with a timestamp of 0.  This shoul\
dn't happen!
[1456818626,000,xklavier.c:xkl_engine_start_listen/]    The backend does not require manual layout ma\
nagement - but it is provided by the application
** Message: applet now removed from the notification area
polkit-mate-authentication-agent-1: Fatal IO error 11 (Resource temporarily unavailable) on X server \
:0.
g_dbus_connection_real_closed: Remote peer vanished with error: Underlying GIOStream returned 0 bytes\
on an async read (g-io-error-quark, 0). Exiting.
The application 'mate-panel' lost its connection to the display :0;
most likely the X server was shut down or you killed/destroyed
the application.
(nm-applet:3386): Gdk-WARNING **: nm-applet: Fatal IO error 0 (Success) on X server :0.

This is second time it happened, so it would be good to solve, OTOH just restoring from backup might be easier.

March 04, 2016 10:53 PM

February 28, 2016

Andy Grover: I’m Part of SFConservancy’s GPL Compliance Project for Linux

I believe GPL enforcement in general, and specifically around the Linux kernel, is a good thing. Because of this, I am one of the Linux copyright holders who has signed an agreement for the Software Freedom Conservancy to enforce the GPL on my behalf. I’m also a financial supporter of Conservancy.

That I hold any copyright at all to the work I’ve done is actually somewhat surprising. Usually one of the documents you sign when beginning employment at a company is agreeing that the company has ownership of work you do for them, and even things you create off-hours. Fair enough. But, my current employer does not do this. Its agreement lets its employees retain individual copyright to their work for open source development projects. I haven’t checked, but I suspect the agreement I signed with my less-steeped-in-F/OSS previous employers did not.

It’s something to ask about when considering accepting a new position.

I consider myself an idealist, but not a zealot. For projects I’ve started, I’ve used MIT, AGPLv3, MPLv2, GPLv2, GPLv3, LGPLv2, and Apache 2.0 licenses, based on the individual circumstances.

I believe if you’re going to build upon someone else’s work, it’s only fair to honor the rules they have set — the license — that allow it. Everyone isn’t doing that now with Linux, and are using the ambiguity over what is a “derived work” as a fig leaf. Enforcing the GPL is the only way to ensure the intent of the license is honored. We’ll also hopefully eventually gain some clarity on exactly what constitutes a derived work from the courts.

Please consider becoming a Conservancy Supporter.

February 28, 2016 02:29 AM

February 26, 2016

Pete Zaitcev: Ceph needs a build revolution

I've been poking at this Ceph thing since June, or 9 months now, and I feel like I'm overdue for a good rant. Today's topic is, Ceph's build system is absolutely insufferable. It's unbelievably fragile, and people break it every week at the most. Then it takes a week to fix, or even worse: it only breaks for me, but not for others, and then it stays broken, because I cannot possibly wade into the swamp this deep to fix it.

Here's today's post-10.0.3 trunk:

[zaitcev@lembas ceph-tip]$ sh autogen.sh 
..................
[zaitcev@lembas ceph-tip]$ ./configure --prefix=$(pwd)/build --with-radosgw
..................
[zaitcev@lembas ceph-tip]$ make -j3
..................
  CXX      common/PluginRegistry.lo
  CXXLD    libcommon_crc.la
ar: `u' modifier ignored since `D' is the default (see `U')
ar: common/.libs/libcommon_crc_la-crc32c_intel_fast_asm.o: No such file or direc
tory
Makefile:13137: recipe for target 'libcommon_crc.la' failed
[zaitcev@lembas ceph-tip]$ find . -name '*crc32c_intel*'
./src/common/.deps/libcommon_crc_la-crc32c_intel_fast_asm.Plo
./src/common/.deps/libcommon_crc_la-crc32c_intel_fast_zero_asm.Plo
./src/common/.deps/libcommon_crc_la-crc32c_intel_baseline.Plo
./src/common/.deps/libcommon_crc_la-crc32c_intel_fast.Plo
./src/common/.libs/libcommon_crc_la-crc32c_intel_baseline.o
./src/common/.libs/libcommon_crc_la-crc32c_intel_fast.o
./src/common/crc32c_intel_baseline.c
./src/common/crc32c_intel_baseline.h
./src/common/crc32c_intel_fast.c
./src/common/crc32c_intel_fast.h
./src/common/crc32c_intel_fast_asm.S
./src/common/crc32c_intel_fast_zero_asm.S
./src/common/libcommon_crc_la-crc32c_intel_fast.lo
./src/common/libcommon_crc_la-crc32c_intel_fast_asm.lo
./src/common/libcommon_crc_la-crc32c_intel_baseline.o
./src/common/libcommon_crc_la-crc32c_intel_fast.o
./src/common/libcommon_crc_la-crc32c_intel_fast_zero_asm.lo
./src/common/libcommon_crc_la-crc32c_intel_baseline.lo
[zaitcev@lembas ceph-tip]$ 

Sometimes these things fix themselves after a fresh clone/autogen.sh/configure/make. But doing so all the time is prohibited by how long Ceph builds. Literally it takes many hours (depending if you use autotools or Cmake, and how parallel your build is). I bought a 4-core laptop with 16 GB and SSD just for that. A $1,200 later, I only have to wait 4 hours. Yay, I can build Ceph 2 times in 1 day.

The situation is completely insane, and it remained so for the months I spent working on this. The worst is that I don't understand how people even deal with this without killing themselves. If you look at the pull requests, obviously a large number of developers manage to build this thing somehow... unless all of them post untested patches all the time.

UPDATE: Waiting a bit and a fresh clone made the build to complete, but then:

..................
make[1]: Nothing to be done for 'all'.
make[1]: Leaving directory '/q/zaitcev/ceph/ceph-tip/selinux'
[zaitcev@lembas ceph-tip]$ echo $?
0
[zaitcev@lembas ceph-tip]$ ./src/vstart.sh -n -d -r -i 192.168.132.2
ls: cannot access compressor/*/: No such file or directory
** going verbose **
./src/vstart.sh: line 374: ./init-ceph: No such file or directory
[zaitcev@lembas ceph-tip]$

We are about to freeze Jewel with this codebase.

February 26, 2016 11:05 PM

February 25, 2016

Pete Zaitcev: git rebase - proceed with caution

In the age of Github, we're not supposed to do the "git merge" anymore, but a "git rebase" instead. Everyone knows that. However, the rebase has its quirks. One stepped on goes like this:

  1. Have 2 patches: A and B, on top of a tree T. Submit A upstream.
  2. Upstream merges A.
  3. Do "git rebase" in T. You'd think A would disappear and you get to keep B on top of T (or T'=T+A).
  4. But instead, git finds some conflicts and throws you a 3-way merge node, which contains your A, upstream A, and some small, unrelated conflict. It has an empty commit message, obviously.
  5. Resolve the merge, "git commit -a", "git rebase --continue", and you end with next commit being empty.
  6. If at this point you think that it's the former A and do "git rebase --skip", then — hold onto your chair — B is thrown into a conflict as well, and its commit message is attached to a follow-on empty commit too, just like A was. If you skip that one, you lose the commit message forever. There's no reset you could do at that point.

Well, if you pushed on a branch, you could go back to Github and salvage the message from there.

Anyhow, rebase takes a certain care. You can't assume that it always works, or assume that you could return to the previous state of repository. In this sense it's fundamentally different from the merge, where you can always do "git reset", no matter how much you screwed up.

February 25, 2016 08:09 PM

Matthew Garrett: I bought some awful light bulbs so you don't have to

I maintain an application for bridging various non-Hue lighting systems to something that looks enough like a Hue that an Amazon Echo will still control them. One thing I hadn't really worked on was colour support, so I picked up some cheap bulbs and a bridge. The kit is badged as an iSuper iRainbow001, and it's terrible.

Things seemed promising enough at first, although the bulbs were alarmingly heavy (there's a significant chunk of heatsink built into them, which seems to get a lot warmer than I'd expect from something that claims a 7W power consumption). The app was a bit clunky, but eh - I wasn't planning on using it for long. I pressed the button on the bridge, launched the app and could control the bulbs. The first thing I noticed was that they had a separate "white" and "colour" mode. White mode was pretty bright, but colour mode massively less so - presumably the white LEDs are entirely independent of the RGB ones, and much higher intensity. Still, potentially useful as mood lighting.

Anyway. Next step was to start playing with the protocol, which meant finding the device on my network. I checked anything that had picked up a DHCP lease recently and nmapped them. The OS detection reported Linux, which wasn't hugely surprising - there was no GPL notice or source code included with the box, but I'm way past the point of shock at that. It also reported that there was a telnet daemon running. I connected and got a login prompt. And then I typed admin as the username and admin as the password and got a root prompt. So, there's that. The copy of Busybox included even came with tftp, so it was easy to get copies of tcpdump and strace on there to see what was up.

So. Next step. Protocol sniffing. I wanted to know how discovery worked, so reset the device to factory and watched what happens. The app on my phone sent out a discovery packet on UDP port 18602 which looked like this:

INLAN:CLIP:23.21.212.48:CLPT:22345:MAC:02:00:00:00:00:00

The CLIP and CLPT fields refer to the cloud server that allows for management when you're not on the local network. The mac field contains an utterly fake address. If you send out a discovery packet and your mac hasn't been registered with the device, you get a denial back. If your mac has been (and by "mac" here I mean "utterly fake mac that's the same for all devices"), you get back a response including the device serial number and IP address. And if you just leave out the mac field entirely, you get back a response no matter whether your address is registered or not. So, that's a start. Once you've registered one copy of the app with the device, anything can communicate with it by just using the same fake mac in the discovery packets.

Onwards. The command format turns out to be simple. They start ##, are followed by two ascii digits encoding a command, four ascii digits containing a bulb id, two ascii digits containing the number of following bytes and then the command data (in ascii). An example is:

##05010002ff

which decodes as command 5 (set white intensity) on bulb 1 with two bytes of data following, each of which is an f. This translates as "Set bulb 1 to full white intensity". I worked out the rest pretty quickly - command 03 sets the RGB colour of the bulb, 0A asks the bridge to search for new bulbs, 0B tells you which bulbs are available and their capabilities and 0E gives you the MAC addresses of the bulbs(‽). 0C crashes the server process, and 06 spews a bunch of garbage back at you and potentially crashes the bulb in a hilarious way that involves it flickering at about 15Hz. It turns out that 06 is actually the "Rename bulb" command, and if you send it less data than it's expecting something goes hilariously wrong in string parsing and everything is awful.

Ok. Easy enough, but not hugely exciting. What about the remote protocol? This turns out to involve sending a login packet and then a wrapped command packet. The login has some length data, a header of "MQIsdp", a long bunch of ascii-encoded hex, a username and a password.

The username is w13733 and the password is gbigbi01. These are hardcoded in the app. The ascii-encoded hex can be replaced with 0s and the login succeeds anyway.

Ok. What about the wrapping on the command? The login never indicated which device we wanted to talk to, so presumably there's some sort of protection going on here oh wait. The command packet is a length, the serial number of the bridge and then a normal command packet. As long as you know the serial number of the device (which it tells you in response to a discovery packet, even if you're unauthenticated), you can use the cloud service to send arbitrary commands to the device (including the one that crashes the service). One of which involves the device then doing some kind of string parsing that doesn't appear to work that well. What could possibly go wrong?

Ok, so that seemed to be the entirety of the network protocol. Anything else to do? Some poking around on the bridge revealed (a) that it had an active wireless device and (b) a running DHCP server. They wouldn't, would they?

Yes. Yes, they would.

The devices all have a hardcoded SSID of "iRainbow", although they don't broadcast it. There's no security - anybody can associate. It'll then hand out an IP address. It's running telnetd on that interface as well. You can bounce through there to the owner's internal network.

So, in summary: it's a device that infringes my copyright, gives you root access in response to trivial credentials, has access control that depends entirely on nobody ever looking at the packets, is sufficiently poorly implemented that you can crash both it and the bulbs, has a cloud access protocol that has no security whatsoever and also acts as an easy mechanism for people to circumvent your network security. This may be the single worst device I've ever bought.

I called the manufacturer and they insisted that the device was designed in 2012, was no longer manufactured or supported, that they had no source code to give me and there would be no security fixes. The vendor wants me to pay for shipping back to them and reserves the right to deduct the cost of the original "free" shipping from the refund. Everything is awful, which is why I just ordered four more random bulbs to play with over the weekend.

comment count unavailable comments

February 25, 2016 07:54 AM