Kernel Planet

June 29, 2015

Paul E. Mc Kenney: In Practice, What is the C Language, Really?

The official definition of the C Language is the standard, but the standard doesn't actually compile any programs. One can argue that the actual implementations are the real definition of the C Language, although further thought along this line usually results in a much greater appreciation of the benefits of having standards. Nevertheless, the implementations usually win any conflicts with the standard, at least in the short term.

Another interesting source of definitions is the opinions of the developers who actually write C. And both the standards bodies and the various implementations do take these opinions into account at least some of the time. Differences of opinion within the standards bodies are sometimes settled by surveying existing usage, and implementations sometimes provide facilities outside the standard based on user requests. For example, relatively few compiler warnings are actually mandated by the standard.

Although one can argue that the standard is the end-all and be-all definition of the C Language, the fact remains that if none of the implementers provide a facility called out by the standard, the implementers win. Similarly, if nobody uses a facility that is called out by the standard, the users win—even if that facility is provided by each and every implementation. Of course, things get more interesting if the users want something not guaranteed by the standard.

Therefore, it is worth knowing what users expect, even if only to adjust their expectations, as John Regehr has done for number of topics, perhaps most notably signed integer overflow. Some researchers have been taking a more proactive stance, with one example being Peter Sewell's group from the University of Cambridge. This group has put together a survey on padding bytes, pointer arithmetic, and unions. This survey is quite realistic, with “that would be crazy” being a valid answer to a number of the questions.

So, if you think you know a thing or two about C's handling of padding bytes, pointer arithmetic, and unions, take the survey!

June 29, 2015 04:46 PM

June 26, 2015

Daniel Vetter: Neat drm/i915 stuff for 4.2

The 4.1 kernel release is still a few weeks off and hence a bit early to talk about 4.2. But the drm subsystem feature cut-off already passed and I'm going on vacation for 2 weeks, so here we go.

First things first: No, i915 does not yet support atomic modesets. But a lot of progress has been made again towards enabling it. As I explained last time around the trouble is that the intel driver has grown its own almost-atomic modeset infrastructure over the past few years. And now we need to convert that to the slightly different proper atomic support infrastructure merged into the drm core, which means lots and lots of small changes all over the driver. A big part merged in this release is the removal of the ->new_config pointer by Ander, Matt & Maarten. This was the old i915-specific pointer to the staged new configuration. Removing it required switching all the CRTC code over to handling the staged configuration stored in the struct drm_atomic_state to be compatible with the atomic core. Unfortunately we still need to do the same for encoder/connector states and for plane states, so there's still lots of shuffling pending for 4.2.

There has also been other feature work going on on the modeset side: Ville cleaned&fixed up the CDCLK support in anticipation of implementing dynamic display clock frequency scaling. Unfortunately that part of his patches hasn't landed yet. Ville has also merged patches to fix up some details in the CPT modeset sequence, maybe this will finally fix the remaining "DP port stuck" issues we still seem to have.

Looking at newer platforms the interesting bit is rotation support for SKL from Sonika and Tvrtko. Compared to older platforms skl now also supports 90° and 270° rotation in the scanout engines, but only when the framebuffer uses a special tiling layout (which have been enabled in 4.0). A related feature is support for plane/CRTC scalers on SKL, provided by Chandra. Skylake has also gained support for the new low-power display states DC5/6. For Broxton basic enabling has landed, but there's nothing too interesting yet besides piles of small adjustments all over. This is because Broxton and Skylake have a common display block (similar to how the render block for atom chips was already shared since Baytrail) and hence share a lot of the infrastructure code. Unfortunately neither of these platforms has yet left the preliminary hardware support label for the i915 driver.

There's also a few minor features in the display code worth mentioning: DP compliance testing infrastructure from Todd Previte - DP compliance test devices have a special DP AUX sidechannel protocol for requesting certain test procedures and hence need a bit of driver support. Most of this will be in userspace though, with the kernel just forward requests and handing back results. Mika Kahola has optimized the DP link training, the kernel will now first try to use the current values (either from a previous modeset or set up by the firmware). PSR has also seen some more work, unfortunately it's still not yet enabled by default. And finally there's been lots of cleanups and improvements under the hood all over, as usual.

A big feature is the dynamic pagetable allocation for gen8+ from Michel Thierry and Ben Widawsky. This will greatly reduce the overhead of PPGTT and is a requirement for 48bit address space support - with that big a VM preallocating all the pagetables is just not possible any more. The gen7 cmd parser is now finally fixed up and enabled by default (thanks to Rebecca Palmer for one crucial fix), which means finally some newer GL extensions can be used without adding kernel hacks. And Chris Wilson has fine-tuned the cmd parser with a big pile of patches to reduce the overhead. And Chris has tuned the RPS boost code more, it should now no longer erratically boost the GPU's clock when it's inappropriate. He has also written a lot of patches to reduce the overhead of execlist command submission, and some of those patches have been merged into this release.

Finally two pieces of prep work: A few patches from John Harrison to prepare for removing the outstanding lazy request. We've added this years ago as a cheap way out of a memory and ringbuffer space preallocation issue and ever since then paid the price for this with added complexity leaking all over the GEM code. Unfortunately the actual removal is still pending. And then Joonas Lahtinen has implemented partial GTT mmap support. This is needed for virtual enviroments like XenGT where the GTT is cut up between the different guests and hence badly fragmented. The merged code only supports linear views and still needs support for fenced buffer objects to be actually useful.

June 26, 2015 08:58 AM

June 25, 2015

Rusty Russell: Hashing Speed: SHA256 vs Murmur3

So I did some IBLT research (as posted to bitcoin-dev ) and I lazily used SHA256 to create both the temporary 48-bit txids, and from them to create a 16-bit index offset.  Each node has to produce these for every bitcoin transaction ID it knows about (ie. its entire mempool), which is normally less than 10,000 transactions, but we’d better plan for 1M given the coming blopockalypse.

For txid48, we hash an 8 byte seed with the 32-byte txid; I ignored the 8 byte seed for the moment, and measured various implementations of SHA256 hashing 32 bytes on on my Intel Core i3-5010U CPU @ 2.10GHz laptop (though note we’d be hashing 8 extra bytes for IBLT): (implementation in CCAN)

  1. Bitcoin’s SHA256: 527.7+/-0.9 nsec
  2. Optimizing the block ending on bitcoin’s SHA256: 500.4+/-0.66 nsec
  3. Intel’s asm rorx: 314.1+/-0.3 nsec
  4. Intel’s asm SSE4 337.5+/-0.5 nsec
  5. Intel’s asm RORx-x8ms 458.6+/-2.2 nsec
  6. Intel’s asm AVX 336.1+/-0.3 nsec

So, if you have 1M transactions in your mempool, expect it to take about 0.62 seconds of hashing to calculate the IBLT.  This is too slow (though it’s fairly trivially parallelizable).  However, we just need a universal hash, not a cryptographic one, so I benchmarked murmur3_x64_128:

  1. Murmur3-128: 23 nsec

That’s more like 0.046 seconds of hashing, which seems like enough of a win to add a new hash to the mix.

June 25, 2015 07:51 AM

June 24, 2015

Matthew Garrett: Python for remote reconfiguration of server firmware

One project I've worked on at Nebula is a Python module for remote configuration of server hardware. You can find it here, but there's a few caveats:

  1. It's not hugely well tested on a wide range of hardware
  2. The interface is not yet guaranteed to be stable
  3. You'll also need this module if you want to deal with IBM (well, Lenovo now) servers
  4. The IBM support is based on reverse engineering rather than documentation, so who really knows how good it is

There's documentation in the README, and I'm sorry for the API being kind of awful (it suffers rather heavily from me writing Python while knowing basically no Python). Still, it ought to work. I'm interested in hearing from anybody with problems, anybody who's interested in getting it on Pypi and anybody who's willing to add support for new HP systems.

(Edited to update URL after Nebula went out of business and stopped paying for github)

comment count unavailable comments

June 24, 2015 11:56 PM

June 19, 2015

Rusty Russell: Mining on a Home DSL connection: latency for 1MB and 8MB blocks

I like data.  So when Patrick Strateman handed me a hacky patch for a new testnet with a 100MB block limit, I went to get some.  I added 7 digital ocean nodes, another hacky patch to prevent sendrawtransaction from broadcasting, and a quick utility to create massive chains of transactions/

My home DSL connection is 11Mbit down, and 1Mbit up; that’s the fastest I can get here.  I was CPU mining on my laptop for this test, while running tcpdump to capture network traffic for analysis.  I didn’t measure the time taken to process the blocks on the receiving nodes, just the first propagation step.

1 Megabyte Block

Naively, it should take about 10 seconds to send a 1MB block up my DSL line from first packet to last.  Here’s what actually happens, in seconds for each node:

  1. 66.8
  2. 70.4
  3. 71.8
  4. 71.9
  5. 73.8
  6. 75.1
  7. 75.9
  8. 76.4

The packet dump shows they’re all pretty much sprayed out simultaneously (bitcoind may do the writes in order, but the network stack interleaves them pretty well).  That’s why it’s 67 seconds at best before the first node receives my block (a bit longer, since that’s when the packet left my laptop).

8 Megabyte Block

I increased my block size, and one node dropped out, so this isn’t quite the same, but the times to send to each node are about 8 times worse, as expected:

  1. 501.7
  2. 524.1
  3. 536.9
  4. 537.6
  5. 538.6
  6. 544.4
  7. 546.7

Conclusion

Using the rough formula of 1-exp(-t/600), I would expect orphan rates of 10.5% generating 1MB blocks, and 56.6% with 8MB blocks; that’s a huge cut in expected profits.

Workarounds

Fixes

 

June 19, 2015 02:37 AM

June 16, 2015

Michael Kerrisk (manpages): Linux/UNIX System Programming course scheduled for September 2015, Munich

I've scheduled a further 5-day Linux/UNIX System Programming course to take place in Munich, Germany, for the week of 14-18 September 2015.

The course is intended for programmers developing system-level, embedded, or network applications for Linux and UNIX systems, or programmers porting such applications from other operating systems (e.g., Windows) to Linux or UNIX. The course is based on my book, The Linux Programming Interface (TLPI), and covers topics such as low-level file I/O; signals and timers; creating processes and executing programs; POSIX threads programming; interprocess communication (pipes, FIFOs, message queues, semaphores, shared memory), and network programming (sockets).
     
The course has a lecture+lab format, and devotes substantial time to working on some carefully chosen programming exercises that put the "theory" into practice. Students receive printed and electronic copies of TLPI, along with a 600-page course book that includes all slides presented in the course. A reading knowledge of C is assumed; no previous system programming experience is needed.

Some useful links for anyone interested in the course:

Questions about the course? Email me via training@man7.org.

June 16, 2015 02:24 PM

June 11, 2015

Pete Zaitcev: Rich and comments

Rich Jones posted an article about being banned by Boing-Boing, supposedly for bringing attention to their use of affiliate links (the practice that Gamergate groups criticized as well — and scored a regulatory win against). Meanwhile, all my comments at Rich's blog are blackholed, which is quite ironic. Generally, I am not into this "blog comment" thing. Ani-nouto never had any comments and is doing great that way. But some people like comments, so I leave them as necessary.

June 11, 2015 05:38 PM

June 10, 2015

LPC 2015: General Registration is now Closed

We’re pleased to announce that thanks to overwhelming support, interest in Linux Plumbers Conference has exceeded expectations.  The downside is that the conference is now officially full. Originally we were going to post a warning today, but a sudden surge of corporate registrations caught us off guard, so we had to close immediately.

If you still haven’t registered but would like to participate, contact us. We are running a waiting list on a first come first serve basis but with priority given to people who have accepted microconference topics. You could also try to use one of the sponsors tickets if your employer can provide one to you.

We look forward to seeing you in Seattle for a memorable conference.

June 10, 2015 02:14 AM

June 09, 2015

LPC 2015: Graphics Microconference Accepted into 2015 Linux Plumbers Conference

Although the Year of the Linux Desktop has yet to arrive, a surprising number of Linux users nevertheless need graphics support. This is because there have been a number of years of the Linux smartphone, the Linux television, the Linux digital sign/display/billboard, the Linux automobile, and more. This microconference will cover a number of topics including atomic modesetting in KMS, buffer allocation, verified-secure graphics pipelines, fencing and synchronisation, Wayland, and more.

For more information on this important topic, see the wiki page.

June 09, 2015 03:55 PM

June 05, 2015

James Morris: Hiring Subsystem Maintainers

The regular LWN kernel development stats have been posted here for version 4.1 (if you really don’t have a subscription, email me for a free link).  In this, Jon Corbet notes:

over 60% of the changes going into this kernel passed through the hands of developers working for just five companies. This concentration reflects a simple fact: while many companies are willing to support developers working on specific tasks, the number of companies supporting subsystem maintainers is far smaller. Subsystem maintainership is also, increasingly, not a job for volunteer developers..

As most folks reading this would know, I lead the mainline Linux Kernel team at Oracle.  We do have several people on the team who work in leadership roles in the kernel community (myself included), and what I’d like to make clear is that we are actively looking to support more such folk.

If you’re a subsystem maintainer (or acting in a comparable leadership role), please always feel free to contact me directly via email to discuss employment possibilities.  You can also contact Oracle kernel folk who may be presenting or attending Linux conferences.

June 05, 2015 05:34 AM

June 04, 2015

LPC 2015: Thermal Microconference Accepted into 2015 Linux Plumbers Conference

In stark contrast with decades past, thermal issues in computer systems now means much more than fans and heat sinks, and this microconference looks at some of the things that are now handled in software. The topics include the thermal framework, handling of temperature sensors, and different approaches to handling overtemperature conditions, ranging up to and including closed-loop control. Of course, software that is not tested can be assumed not to work, and the best way to ensure that testing happens when needed is to automate it, so automated testing of thermal subsystems is also on the agenda. Coordination with userspace is useful in order to determine how best to shed computational load, as is coordination among multiple cooling devices.

For more information on this important new-to-Plumbers topic, see the wiki page.

June 04, 2015 04:04 PM

June 03, 2015

LPC 2015: File and Storage Systems Microconference Accepted into 2015 Linux Plumbers Conference

Despite having been in production use for many decades, file and storage systems are very active areas, and have retained the ability to provide many new-technology surprises. This year’s edition of the File and Storage Systems Microconference will look at improved error reporting, filesystem-level encryption in traditional filesystems, SMR drives, online fsck, persistent memory, smart block-layer support in traditional filesystems, better interoperability of and support for NFS and Samba, userspace filesystem innovations, and much else besides.

For more information on this topic, see the wiki page.

June 03, 2015 04:33 PM

Rusty Russell: What Transactions Get Crowded Out If Blocks Fill?

What happens if bitcoin blocks fill?  Miners choose transactions with the highest fees, so low fee transactions get left behind.  Let’s look at what makes up blocks today, to try to figure out which transactions will get “crowded out” at various thresholds.

Some assumptions need to be made here: we can’t automatically tell the difference between me taking a $1000 output and paying you 1c, and me paying you $999.99 and sending myself the 1c change.  So my first attempt was very conservative: only look at transactions with two or more outputs which were under the given thresholds (I used a nice round $200 / BTC price throughout, for simplicity).

(Note: I used bitcoin-iterate to pull out transaction data, and rebuild blocks without certain transactions; you can reproduce the csv files in the blocksize-stats directory if you want).

Paying More Than 1 Person Under $1 (< 500000 Satoshi)

Here’s the result (against the current blocksize):

Sending 2 Or More Sub-$1 Outputs

Let’s zoom in to the interesting part, first, since there’s very little difference before 220,000 (February 2013).  You can see that only about 18% of transactions are sending less than $1 and getting less than $1 in change:

Since March 2013…

Paying Anyone Under 1c, 10c, $1

The above graph doesn’t capture the case where I have $100 and send you 1c.   If we eliminate any transaction which has any output less than various thresholds, we’ll catch that. The downside is that we capture the “sending myself tiny change” case, but I’d expect that to be rarer:

Blocksizes Without Small Output Transactions

This eliminates far more transactions.  We can see only 2.5% of the block size is taken by transactions with 1c outputs (the dark red line following the block “current blocks” line), but the green line shows about 20% of the block used for 10c transactions.  And about 45% of the block is transactions moving $1 or less.

Interpretation: Hard Landing Unlikely, But Microtransactions Lose

If the block size doesn’t increase (or doesn’t increase in time): we’ll see transactions get slower, and fees become the significant factor in whether your transaction gets processed quickly.  People will change behaviour: I’m not going to spend 20c to send you 50c!

Because block finding is highly variable and many miners are capping blocks at 750k, we see backlogs at times already; these bursts will happen with increasing frequency from now on.  This will put pressure on Satoshdice and similar services, who will be highly incentivized to use StrawPay or roll their own channel mechanism for off-blockchain microtransactions.

I’d like to know what timescale this happens on, but the graph shows that we grow (and occasionally shrink) in bursts.  A logarithmic graph prepared by Peter R of bitcointalk.org suggests that we hit 1M mid-2016 or so; expect fee pressure to bend that graph downwards soon.

The bad news is that even if fees hit (say) 25c and that prevents all the sub-$1 transactions, we only double our capacity, giving us perhaps another 18 months. (At that point miners are earning $1000 from transaction fees as well as $5000 (@ $200/BTC) from block reward, which is nice for them I guess.)

My Best Guess: Larger Blocks Desirable Within 2 Years, Needed by 3

Personally I think 5c is a reasonable transaction fee, but I’d prefer not to see it until we have decentralized off-chain alternatives.  I’d be pretty uncomfortable with a 25c fee unless the Lightning Network was so ubiquitous that I only needed to pay it twice a year.  Higher than that would have me reaching for my credit card to charge my Lightning Network account :)

Disclaimer: I Work For BlockStream, on Lightning Networks

Lightning Networks are a marathon, not a sprint.  The development timeframes in my head are even vaguer than the guesses above.  I hope it’s part of the eventual answer, but it’s not the bandaid we’re looking for.  I wish it were different, but we’re going to need other things in the mean time.

I hope this provided useful facts, whatever your opinions.

June 03, 2015 03:57 AM

Rusty Russell: Current Blocksize, by graphs.

I used bitcoin-iterate and gnumeric to render the current bitcoin blocksizes, and here are the results.

My First Graph: A Moment of Panic

This is block sizes up to yesterday; I’ve asked gnumeric to derive an exponential trend line from the data (in black; the red one is linear)

Woah! We hit 1M blocks in a month! PAAAANIC!

That trend line hits 1000000 at block 363845.5, which we’d expect in about 32 days time!  This is what is freaking out so many denizens of the Bitcoin Subreddit. I also just saw a similar inaccurate [correction: misleading] graph reshared by Mike Hearn on G+ :(

But Wait A Minute

That trend line says we’re on 800k blocks today, and we’re clearly not.  Let’s add a 6 hour moving average:

Oh, we’re only halfway there….

In fact, if we cluster into 36 blocks (ie. 6 hours worth), we can see how misleading the terrible exponential fit is:

What! We’re already over 1M blocks?? Maths, you lied to me!

Clearer Graphs: 1 week Moving Average

Actual Weekly Running Average Blocksize

So, not time to panic just yet, though we’re clearly growing, and in unpredictable bursts.

June 03, 2015 02:34 AM

June 02, 2015

LPC 2015: Boot, Init, and Config Microconference Accepted into 2015 Linux Plumbers Conference

The combination of security issues, kernel tinification, and a renewed concern about fast boot has intensified focus on system boot, initialization, and configuration, so much so that there is now a Linux Plumbers Conference Microconference focused on these topics.

In addition to secure boot and minimizing size and bloat, this microconference will delve into a number of topics related to boot speed. These topics include tuning systemd for embedded systems, optimizing and/or delaying memory initialization, deferring initcall-based initialization, introducing parallelism and multicore earlier in boot, speeding up early-boot I/O, pre-loading known configurations, speeding up installation, and of course improved timing and tracing analysis earlier in system startup. In short, the fast-boot work has definitely moved into the sub-second realm. A final bonus topic is better configuring for cloud- and container-based workloads.

For more information on this topic, see the wiki page.

June 02, 2015 03:58 PM

June 01, 2015

LPC 2015: Persistent Memory Microconference Accepted into 2015 Linux Plumbers Conference

The topic of persistent memory is back to the future for those of us old enough to have used core memory, but today’s persistent memory boasts densities, speeds, latencies, and capacities that are well beyond the scope even of science fiction back in the core-memory era.

However, with extreme densities, speeds, latencies, and capacities come interesting technical challenges. This microconference will therefore cover the “struct page” problem, performance hotspots in both kernel and userspace I/O fastpaths, managing access mechanisms such as DAX, providing atomic sector updates, and more.

For more information on this topic, see the wiki page.

We hope to see you there!

June 01, 2015 07:28 PM

Rusty Russell: Block size: rate of internet speed growth since 2008?

I’ve been trying not to follow the Great Blocksize Debate raging on reddit.  However, the lack of any concrete numbers has kind of irked me, so let me add one for now.

If we assume bandwidth is the main problem with running nodes, let’s look at average connection growth rates since 2008.  Google lead me to NetMetrics (who seem to charge), and Akamai’s State Of The Internet (who don’t).  So I used the latter, of course:

Akamai’s Average Connection Speed Chart Q4/07 to Q4/14

I tried to pick a range of countries, and here are the results:

Country % Growth Over 7 years Per Annum
Australia 348 19.5%
Brazil 349 19.5%
China 481 25.2%
Philippines 258 14.5%
UK 333 18.8%
US 304 17.2%

 

Countries which had best bandwidth grew about 17% a year, so I think that’s the best model for future growth patterns (China is now where the US was 7 years ago, for example).

If bandwidth is the main centralization concern, you’ll want block growth below 15%. That implies we could jump the cap to 3MB next year, and 15% thereafter. Or if you’re less conservative, 3.5MB next year, and 17% there after.

June 01, 2015 01:20 AM

May 30, 2015

James Bottomley: DNSSEC, DANE and the failure of X.509

As a few people have noticed, I’m a bit of an internet control freak: In an age of central “cloud based” services, I run pretty much my own everything (blog, mail server, DNS, OpenID, web page etc.).  That doesn’t make me anti-cloud; I just believe in federation instead of centralisation.  In particular, I believe in owning my own content and obeying my own rules rather than those of $BIGCLOUDPROVIDER.

In the modern world, this is perfectly possible: I rent a co-lo site and I have a DNS delegation so I can run and tune my own services how I wish.  I need a secure web server for a few things (OpenID, an email portal, secure log in for my blog etc) but I’ve always used a self-signed certificate.  Why?  well having to buy one from a self appointed X.509 root of trust always really annoyed me.  Firstly because they do very little for the money; secondly because it means effectively giving my security to some self appointed entity; and thirdly, as all the compromises and misuse attests, the X.509 root of trust model is fundamentally broken.

In the ordinary course of events, none of this would affect me.  However, recently, curl, which is used as the basis of most OpenID implementations took to verifying X.509 certificate chains, meaning that OpenID simply stopped working for ID providers with self signed certificates and at a stroke I was locked out of quite a few internet sites.  The only solution is to give up on OpenID or swallow pride and get a chained X.509 certificate.  Fortunately startssl will issue free certificates and the Linux Foundation is also getting into the game, so the first objection is overcome but not the other two.

So, what’s the answer?  As a supporter of cloud federation, I really like the monkeysphere approach which links ssl certificate verification directly to the user’s personal pgp web of trust.  Unfortunately, that also means that monkeysphere suffers from all the usual web of trust problems, the biggest being that it’s pretty much inaccessible to non-techies who just don’t understand why they should invest time in building up their own trust contacts.  That’s not to say that the web of trust can’t be made accessible in a simple fashion to everyone and indeed google is working on a project along these lines; however, today the reality is that today it isn’t.

Enter DANE.  At is most basic, DANE is a protocol that links certificate verification to the DNS.  It means that because anyone who owns a domain must have a DNS entry somewhere and the ability to modify it, they can directly link their certificate verification to this ability.  To my mind, this represents a nice compromise between making the system simple for end users and the full federation of the web of trust.  The implementation of DANE relies on DNSSEC (which is a royal pain to set up and get right, but I’ll do another blog post about that).  This means that effectively DANE has the same operational model as X.509, because DNSSEC is a hierarchically rooted trust model.  It also means that the delegation record to your domain is managed by your registrar and could be compromised if your registrar is.  However, as long as you trust the DNSSEC root and your registrar, the ability to generate trusted certificates for your domain is delegated to you.  So how is this different from X509?  Surely abusive registrars could cause similar problems as abusive or negligent X.509 roots?  That’s true, but an abusive registrar can only affect their own domain and delegates, they can’t compromise everyone else (unlike X.509), so to give an example of recent origin: the Chinese registrar could falsify the .cn domain, but wouldn’t be able to issue false certificates for the .com one.  The other reason for hope is that DNSSEC is at the root of the scheme to protect the DNS infrastructure itself from attack.  This makes the functioning and administration of DNSSEC a critical task for ICANN itself, so it’s a fair bet to assume that any abuse by a registrar won’t just result in a light slap on the wrist and a vague threat to delist their certificate in some browsers, but will have ICANN threatening to revoke their accreditation and with it, their entire business model as a domain registrar.

In many ways, the foregoing directly links the business model of the registrars to making DNSSEC work correctly for you.  In theory, the same is true of the X.509 CA roots of trust, of course, but there’s no one sitting at the top making sure they behave (and the fabric of the internet isn’t dependent on securing this behaviour, even if there were).

Details of DANE

In spite of the statements above, DANE is designed to complement X.509 as well as replace it.  Dane has four separate certificate verification styles, two of which integrate with X.509 and solve specific threats in its model (the actual solution is called pinning, a way of protecting yourself from the proliferation of X509 CAs all of whom could issue certificates for your site):

Mode 3 is most commonly used to specify an exact certificate outside of the X.509 chain.  Mode 2 can be useful, but the site must have access to an external certificate store (using the DNS CERT records) or the TLSA record must specify the full certificate for it to work.

Who Supports DANE?

This is the big problem:  For certificates distributed via DANE to be useful, there must be support for them in browsers.  For Mozilla, there is the DANE validator extension but in spite of several attempts, nothing actually built into the browser certificate verifier itself.  The most complete patch set is from the DNSSEC people and there’s also a Mozilla inspired one about how they will add it one day but right at the moment it isn’t a priority.  The Chromium browser has had a similar attitude.  The conspiracy theorists are quick to point out that this is because the browser companies derive considerable revenue from the CA system, which is in itself a multi-billion dollar industry and thus there’s active lobbying against anything that would dilute the power, and hence perceived value, of the CA roots.  There is some evidence for this position in that Google recognises that certificate pinning, which DANE supports, can protect against recent fake google certificate attacks, but, while supporting DNSSEC (at least for validation, the google DNS doesn’t secure itself via DNSSEC), they steadfastly ignore DANE certificate pinning and instead have a private arrangement with Mozilla.

I learned long ago: never to ascribe to malice (or conspiracy) what can be easily explained by incompetence (or political problems).  In this case, the world was split long ago into using openssl for security (in spite of the problematic licence) or using nss (the Netscape Security Services).  Mozilla, of course, uses the latter but every implementation of DANE for mozilla (including the patches in the bugzilla) use openssl.   I actually have an experimental build of mozilla with DANE, but incorporating the two separate SSL systems is a real pain.  I think it’s safe to say that until someone comes up with a nss based DANE verifier, the DANE patches for mozilla still aren’t yet up to the starting blocks, and thus conspiracy allegations are somewhat premature.  Unfortunately, the same explanation applies to chromium: for better or worse, it’s currently using nss for certificate validation as well.

May 30, 2015 07:10 PM

May 29, 2015

Pete Zaitcev: Cool hardware in Vancouver

There wasn't much, but more than in Atlanta. The most "pro" looking kit was presented by NEC: basically a bladeserver, but the "blades" are SBCs, each of them accompained by a dedicated drive card. I can see downsides of this design, but very cute.

Unfortunately, they only offer CPU cards based on Atom. No ARM or anything.

The only other interesting booth belonged to StackVelocity, a subsidiary of JB Circuits that does custom design.

I'm sorry to say, their wares looked decidedly pedestrian, which is to be expected: their sales point is low cost, and stuff of that nature underpins the modern datacenter. One curious thing, however, is the variety of flash cards they offer. Basically Fusion-IO on budget. One was particularly tricky by having 2 layers. At first I even thought it could have flash chips mounted sideways, but nope, the science of low-cost computing is not there yet.

P.S. NEC also sell the same chassis with CPU cards instead of drive cards under the index "DX1000".

May 29, 2015 06:38 PM

Pete Zaitcev: Semi-hard numbers from Rackspace

Previously in hard numbers: China, Wikimedia, Amazon S3. Rackspace previously reported in creiht's preso 18 months ago. This time, scotty went public at the Vancouver (Liberty) summit with the following:

> 50 billion objects
> 100 PB data (sanitized number, but way higher than 85 PB)
= 6 global clusters
3:1 PUT:GET ratio
10k+ requests/second

The number of objects is roughly 40 times less than in Amazon S3.

May 29, 2015 06:11 PM

LPC 2015: Performance and Scalability Microconference Accepted into 2015 Linux Plumbers Conference

Core counts keep rising, and that means that the Linux kernel continues to encounter interesting performance and scalability issues. Which is not a bad thing, since it has been well over ten years since the “free lunch” of exponential CPU-clock frequency increases came to an abrupt end. This microconference will therefore look at futex scaling, address-space scaling, improvements to queued spinlocks, additional lockless algorithms, userspace per-CPU critical sections, and much else besides.

For more information on this topic, please see the wiki page.

May 29, 2015 03:09 AM

May 27, 2015

LPC 2015: Networking Microconference Accepted into 2015 Linux Plumbers Conference

In the past, this has been called the Network Virtualization Microconference, but this year’s instance is branching out to include IPv6 and Security.

Network-virtualization topics include networking for multi-tenant container clusters, reducing network-namespace load on the system, intelligent processing at the edge of the data center, programmable datapath (as in fun with eBPF, OVS, nft, and much else besides), hardware support, and protocol development.

IPv6 topics include performance (can it catch up to IPv4?), consolidation of common IPv4/IPv6 functionality, solving the IPv6 datacenter addressing problem, and providing network virtualization without encapsulation using logical IPv6 overlays.

Security topics include scalable networking security policies for containers, securing applications in multi-host data centers, encryption of overlays, and hardware support.

It appears that coordinated work in all three of these areas is required to make good progess. For more information, see the wiki page.

May 27, 2015 05:29 PM

Matthew Garrett: This is not the UEFI backdoor you are looking for

This is currently the top story on the Linux subreddit. It links to this Tweet which demonstrates using a System Management Mode backdoor to perform privilege escalation under Linux. This is not a story.

But first, some background. System Management Mode (SMM) is a feature in most x86 processors since the 386SL back in 1990. It allows for certain events to cause the CPU to stop executing the OS, jump to an area of hidden RAM and execute code there instead, and then hand off back to the OS without the OS knowing what just happened. This allows you to do things like hardware emulation (SMM is used to make USB keyboards look like PS/2 keyboards before the OS loads a USB driver), fan control (SMM will run even if the OS has crashed and lets you avoid the cost of an additional chip to turn the fan on and off) or even more complicated power management (some server vendors use SMM to read performance counters in the CPU and adjust the memory and CPU clocks without the OS interfering).

In summary, SMM is a way to run a bunch of non-free code that probably does a worse job than your OS does in most cases, but is occasionally helpful (it's how your laptop prevents random userspace from overwriting your firmware, for instance). And since the RAM that contains the SMM code is hidden from the OS, there's no way to audit what it does. Unsurprisingly, it's an interesting vector to insert malware into - you could configure it so that a process can trigger SMM and then have the resulting SMM code find that process's credentials structure and change it so it's running as root.

And that's what Dmytro has done - he's written code that sits in that hidden area of RAM and can be triggered to modify the state of the running OS. But he's modified his own firmware in order to do that, which isn't something that's possible without finding an existing vulnerability in either the OS or (or more recently, and) the firmware. It's an excellent demonstration that what we knew to be theoretically possible is practically possible, but it's not evidence of such a backdoor being widely deployed.

What would that evidence look like? It's more difficult to analyse binary code than source, but it would still be possible to trace firmware to observe everything that's dropped into the SMM RAM area and pull it apart. Sufficiently subtle backdoors would still be hard to find, but enough effort would probably uncover them. A PC motherboard vendor managed to leave the source code to their firmware on an open FTP server and copies leaked into the wild - if there's a ubiquitous backdoor, we'd expect to see it there.

But still, the fact that system firmware is mostly entirely closed is still a problem in engendering trust - the means to inspect large quantities binary code for vulnerabilities is still beyond the vast majority of skilled developers, let alone the average user. Free firmware such as Coreboot gets part way to solving this but still doesn't solve the case of the pre-flashed firmware being backdoored and then installing the backdoor into any new firmware you flash.

This specific case may be based on a misunderstanding of Dmytro's work, but figuring out ways to make it easier for users to trust that their firmware is tamper free is going to be increasingly important over the next few years. I have some ideas in that area and I hope to have them working in the near future.

comment count unavailable comments

May 27, 2015 06:38 AM

May 26, 2015

LPC 2015: Extending the Early Bird Rate through June 5th 2015

We have decided to extend the Early Bird rate through June 5th 2015, to match LinuxCon North America’s dates and to allow more time for people to register, after authors notifications and schedule announcements. The regular rate will now start on June 6th.

May 26, 2015 07:32 PM

May 20, 2015

Pavel Machek: Alcatel Pixi 3.5

Available in Czech Republic, too, 98 grams, and pretty cheap. On my Nokia n900, GSM parts died, and hacking cellphone you are using is a bad idea... So... what about Pixi? Underpowered hardware, but still more powerful than n900. Does firefox os support wifi tethering by default? Is it reasonably easy to hack? (I guess "apt-get install python would be too much to ask, but..) Other candidates are Jolla/Sailfish and Ubuntu Phone.

May 20, 2015 05:03 PM

May 19, 2015

Paul E. Mc Kenney: Dagstuhl Seminar: Compositional Verification Methods for Next-Generation Concurrency

Some time ago, I figured out that there are more than a billion instances of the Linux kernel in use, and this in turn led to the realization that a million-year RCU bug is happening about three times a day across the installed base. This realization has caused me to focus more heavily on RCU validation, which has uncovered a number of interesting bugs. I have also dabbled a bit in formal verification, which has not yet found a bug. However, formal verification might be getting there, and might some day be a useful addition to RCU's regression testing. I was therefore quite happy to be invited to this Dagstuhl Seminar. In what follows, I summarize a few of the presentation. See here for the rest of the presentations.

Viktor Vafeiadis presented his analysis of the C11 memory model, including some “interesting” consequences of data races, where a data race is defined as a situation involving multiple concurrent accesses to a non-atomic variable, at least one of which is a write. One such consequence involves a theoretically desirable “strengthening” property. For example, this property would mean that multiplexing two threads onto a single underlying thread would not introduce new behaviors. However, with C11, the undefined-behavior consequences of data races can actually cause new behaviors to appear with fewer threads, for example, see Slide 7. This suggests the option of doing away with the undefined behavior, which is exactly the option that LLVM has taken. However, this approach requires some care, as can be seen on Slide 19. Nevertheless, this approach seems promising. One important takeaway from this talk is that if you are worried about weak ordering, you need to pay careful attention to reining in the compiler's optimizations. If you are unconvinced, take a look at this! Jean Pichon-Pharabod, Kyndylan Nienhuis, and Mike Dodds presented on other aspects of the C11 memory model.

Martin T. Vechev apparently felt that the C11 memory model was too tame, and therefore focused on event-driven applications, specifically javascript running on Android. This presentation included some entertaining concurrency bugs and their effects on the browser's display. Martin also discussed formalizing javascript's memory model.

Hongjin Liang showed that ticket locks can provide starvation freedom given a minimally fair scheduler. This provides a proof point for Björn B. Brandenburg's dissertation, which analyzed the larger question of real-time response from lock-based code. It should also provide a helpful corrective to people who still believe that non-blocking synchronization is required.

Joseph Tassarotti presented a formal proof of the quiescent-state based reclamation (QSBR) variant of userspace RCU. In contrast to previous proofs, this proof did not rely on sequential consistency, but instead leveraged a release-acquire memory model. It is of course good to see researchers focusing their tools on RCU! That said, when a researcher asked me privately whether I felt that the proof incorporated realistic assumptions, I of course could not resist saying that since they didn't find any bugs, the assumptions clearly must have been unrealistic.

My first presentation covered what would be needed for me to be able to use formal verification as part of Linux-kernel RCU's regression testing. As shown on slide 34, these are:


  1. Either automatic translation or no translation required. After all, if I attempt to manually translate Linux-kernel RCU to some special-purpose language every release, human error will make its presence known.
  2. Correctly handle environment, including the memory model, which in turn includes compiler optimizations.
  3. Reasonable CPU and memory overhead. If these overheads are excessive, RCU is better served by simple stress testing.
  4. Map to source code lines containing the bug. After all, I already know that there are bugs—I need to know where they are.
  5. Modest input outside of source code under test. The sad fact is that a full specification of RCU would be at least as large as the implementation, and also at least as buggy.
  6. Find relevant bugs. To see why this is important, imagine that some tool finds 100 different million-year bugs and I fix them all. Because roughly one of six fixes introduces a bug, and because that bug is likely to reproduce in far less than a million years, this process has likely greatly reduced the robustness of the Linux kernel.


I was not surprised to get some “frank and honest” feedback, but I was quite surprised (but not at all displeased) to learn that some of the feedback was of the form “we want to see more C code.” After some discussion, I provided just that.

May 19, 2015 07:30 PM

May 18, 2015

LPC 2015: Extending the Earlybird deadline to 29 May

Somewhere along the way, the deadline for notifications to Authors of the Shared LinuxCon/Plumbers track got pushed out by a week to 25 May.  In the light of that, we’re extending the deadline for Earlybird registration to Friday 29 May to allow anyone who doesn’t get a talk accepted but who still wishes to attend Plumbers to take advantage of the Earlybird registration rate.

May 18, 2015 06:37 AM

May 11, 2015

Pavel Machek: More SSD fun

http://www.techspot.com/article/997-samsung-ssd-read-performance-degradation/

This scheds some light on how tricky multi-level NAND drives are.

May 11, 2015 01:17 PM

Pavel Machek: SSD temperature sensitivity.

http://www.ibtimes.co.uk/ssds-lose-data-if-left-without-power-just-7-days-1500402

If you store SSDs at higher temperature than operating, bad things will happen... Like failure in less than a week. "Enterprise" SSDs are more sensitive to this (I always thought that "enterprise" is code word for "expensive", but apprently it has other implications, too).

Oh and that N900 modem problems... it seems it was not a battery. Moving SIM card to different phone to track it down...

May 11, 2015 08:56 AM

May 09, 2015

Pavel Machek: Good use for old, 80GB 3.5" hard drive

If it does not work, open it and try to repair it.

If it works, and you are tired of killing working drives...

tar czv /data/$1 | aespipe > /mnt/$1.tgz.aes

...fill your harddrive with data you'd like to keep, and bury it in the woods on a moonless night.

On a unrelated note... it seems Nokia N900 does not have as many capacitors at it should have. If you battery is too old, it will be still good enough to power most functions, but not the GSM/SIM card parts, resulting in network errors, no calls possible, etc. Problem mysteriously goes away with newer battery...

May 09, 2015 05:47 PM

May 08, 2015

Daniel Vetter: GFX Kernel Upstreaming Requirements

Upstreaming requirements for the DRM subsystem are a bit special since Dave Airlie requires a full-blown open-source implementation as a demonstration vehicle for any new interfaces added. I've figured it's better to clear this up once instead of dealing with the fallout from surprises and made a few slides for a training session. Dave reviewed and acked them, hence this should be the up-to-date rules - the old mails and blogs back from when some ARM SoC vendors tried to push drm drivers for blob userspace to upstream are a bit outdated.

Any here's the slides for my gfx kernel upstreaming requirements training.

May 08, 2015 01:21 PM

James Morris: Linux Security Summit 2015 CFP

The CFP for the 2015 Linux Security Summit (LSS) is now open: see here.

Proposals are due by June 5th, and accepted speaker notifications will go out by June 12th.

LSS 2015 will be held over 20-21 August, in Seattle, WA, USA.

Last year’s event went really well, and we’ll follow a similar format over two days again this year.  We’re co-located again with LinuxCon, and a host of other events including Linux Plumbers, CloudOpen, KVM Forum, and ContainerCon.  We’ve been upgraded to an LF managed event this year, which means we’ll get food.

All LSS attendees, including speakers, must be registered attendees of LinuxCon.   The first round of early registration ends May 29th.

We’d like to cast our net as wide as possible in terms of presentations, so please share this info with anyone you know who’s been doing interesting Linux security development or implementation work recently.

May 08, 2015 11:01 AM

May 07, 2015

Michael Kerrisk (manpages): man-pages-4.00 is released

Version numbers for the current man-pages release had been getting uncomfortably high, so that I'd been thinking about bumping to a new major version for a while, and now that the Linux kernel has just done that, it seems an opportune moment do likewise. So, here we have it: man-pages-4.00, my 166th man-pages release.

The release tarball is available on kernel.org. The browsable online pages can be found on man7.org. The Git repository for man-pages is available on kernel.org.

This release resulted from patches, bug reports,and  comments from over 50 contributors. As well as a large number of minor fixes to around 90 man pages, the more significant changes in man-pages-4.00 include the following:

May 07, 2015 10:53 AM

May 06, 2015

Pete Zaitcev: How Mitchell Baker made me to divorce

Well, nearly did. Deleting history in Firefox 37 is very slow and the UI locks up while you do that. "Very slow" means an operation that takes 13 minutes (not exaggerating - it's reproducible). The UI lock-up means a non-dismissable context menu floating over everything; Firefox itself being, of course, entirely unresponsive. See the screencap.

The screencap is from Linux where I confirmed the problem, but the story started on Windows, where my wife tried to tidy up a bit. So, when Firefox locked up, she killed it, and repeated the process a few times. And what else would you do? We are not talking about hanging up for seconds - it literally was many minutes. Firefox did not pop a dialog with "Please wait, deleting 108,534 objects with separate SQLite transactions", a progress gauge, and a "Cancel" button. Instead, it pretended to lock up.

Interestingly enough, remember when Firefox had a default to keep the history for a week? This mode is gone now - FF keeps the history potentially forever. Instead, it offers a technical limit: 108,534 entries are saved in the "Places" database at the most, in order to prevent SQLite from eating all your storage. Now I understand why my brown "visited" links never go back to blue anymore.

The problem is, there's no alternative. I tried to use Midori as my main browser for a month or two in early 2014, but it was a horrible crash city. I had no choice but to give up and go back to Firefox and its case of Featuritis Obesum.

May 06, 2015 08:10 PM

May 05, 2015

Dave Jones: Thoughts on a feedback loop for Trinity.

With the success that afl has been having on fuzzing userspace, I’ve been revisiting an idea that Andi Kleen gave me years ago for trinity, which was pretty much the same thing but for kernel space. I.e., a genetic algorithm that rates how successful the last fuzz attempt was, and makes a decision on whether to mutate that last run, or do something completely new.

It’s something I’ve struggled to get my head around for a few years. The mutation part would be fairly easy. We would need to store the parameters from the last run, and extrapolate out a set of ->mutate functions from the existing ->sanitize functions that currently generate arguments.

The difficult part is the “how successful” measurement. Typically, we don’t really get anything useful back from a syscall other than “we didn’t crash”, which isn’t particularly useful in this case. What we really want is “did we execute code that we’ve not previously tested”. I’ve done some experiments with code coverage in the past. Explorations of the GCOV feature in the kernel didn’t really get very far however for a few reasons (primarily that it really slowed things down too much, and also I was looking into this last summer, when the initial cracks were showing that I was going to be leaving Red Hat, so my time investment for starting large new projecs was limited).

After recent discussions at work surrounding code coverage, I got thinking about this stuff again, and trying to come up with workable alternatives. I started wondering if I could use the x86 performance counters for this. Basically counting the number of instructions executed between system call enter/exit. The example code that Vince Weaver wrote for perf_event_open looked like a good starting point. I compiled it and ran it a few times.

$ ./a.out 
Measuring instruction count for this printf
Used 3212 instructions
$ ./a.out 
Measuring instruction count for this printf
Used 3214 instructions

Ok, so there’s some loss of precision there, but we can mask off the bottom few bits. A collision isn’t the end of the world for what we’re using this for. That’s just measuring userspace however. What happens if we tell it to measure the kernel, and measure say.. getpid().

$ ./a.out 
Used 9283 instructions
$ ./a.out 
Used 9367 instructions

Ok, that’s a lot more precision we’ve lost. What the hell.
Given how much time he’s spent on this stuff, I emailed Vince, and asked if he had insight as to why the counters weren’t deterministic across different runs. He had actually written a paper on the subject. Turns out we’re also getting event counts here for page faults, hardware interrupts, timers, etc.
x86 counters lack the ability to say “only generate events if RIP is within this range” or anything similar, so it doesn’t look like this is going to be particularly useful.

That’s kind of where I’ve stopped with this for now. I don’t have a huge amount of time to work on this, but had hoped that I could hack up something basic using the perf counters, but it looks like even if it’s possible, it’s going to be a fair bit more work than I had anticipated.

update:
It occurred to me after posting this that measuring instructions isn’t going to work regardless of the amount of precision the counters offer. Consider a syscall that operates on vma’s for example. Over the lifetime of a process, the number of executed instructions of a call to such a syscall will vary even with the same input parameters, as the lengths of various linked lists that have to be walked will change. Number of instructions, or number of branches taken/untaken etc just isn’t a good match for this idea. Approximating “have we been here before” isn’t really achievable with this approach afaics, so I’m starting to think something like the initial gcov idea is the only way this could be done.

Thoughts on a feedback loop for Trinity. is a post from: codemonkey.org.uk

May 05, 2015 05:41 PM

LPC 2015: Deadline for Refereed Talks Is May 12

The deadline for submission of the refereed talks is now Tuesday, May 12, 2105. The Authors Notification date has been moved to May 26th. Get your proposals in! See details on the Participate page.

May 05, 2015 05:38 PM

May 04, 2015

Dave Jones: kernel code coverage brain dump.

Someone at work recently asked me about code coverage tooling for the kernel. I played with this a little last year. At the time I was trying to figure out just how much of certain syscalls trinity was exercising. I ended up being a little disappointed at the level of post-processing tools to deal with the information presented, and added some things to my TODO list to find some time to hack up something, which quickly bubbled its way to the bottom.

As I did a write-up based on past experiences with this stuff, I figured I’d share.

gcov/gprof
requires kernel built with
CONFIG_GCOV_KERNEL=y
GCOV_PROFILE_ALL=y
GCOV_FORMAT_AUTODETECT=y
Note: Setting GCOV_PROFILE_ALL incurs some performance penalty, so any resulting kernel built with this option should _never_ be used for any kind of performance tests.
I can’t exaggerate this enough, it’s miserably slow. Disk operations that took minutes for me now took hours. As example:

Before:

# time dd if=/dev/zero of=output bs=1M count=500
500+0 records in
500+0 records out
524288000 bytes (524 MB) copied, 0.409712 s, 1.3 GB/s
0.00user 0.40system 0:00.41elapsed 99%CPU (0avgtext+0avgdata 2980maxresident)k
136inputs+1024000outputs (1major+340minor)pagefaults 0swaps

After:

# time dd if=/dev/zero of=output bs=1M count=500
500+0 records in
500+0 records out
524288000 bytes (524 MB) copied, 6.17212 s, 84.9 MB/s
0.00user 7.17system 0:07.22elapsed 99%CPU (0avgtext+0avgdata 2940maxresident)k
0inputs+1024000outputs (0major+338minor)pagefaults 0swaps

From 41 seconds, to over 7 minutes. Ugh.

If we *didn’t* set GCOV_PROFILE_ALL, we’d have to recompile just the files we cared about with the relevant gcc profiling switches. It’s kind of a pain.

For all this to work, gcov expects to see a source tree, with:

After booting the kernel, a subtree appears in sysfs at /sys/kernel/debug/gcov/
These directories mirror the kernel source tree, but instead of source files, now contain files that can be fed to the gcov tool. There will be a .gcda file, and a .gcno symlink back to the source tree (with complete path). Ie, /sys/kernel/debug/mm for example contains (among others..)

-rw------- 1 root root 0 Mar 24 11:46 readahead.gcda
lrwxrwxrwx 1 root root 0 Mar 24 11:46 readahead.gcno -> /home/davej/build/linux-dj/mm/readahead.gcno

It is likely the symlink will be broken on the test machine, because the path doesn’t exist, unless you nfs mount the source code from the built kernel for eg.

I hacked up the script below, which may or may not be useful for anyone else (honestly, it’s way easier to just use nfs).
Run it from within a kernel source tree, and it will populate the source tree with the relevant gcda files, and generate the .gcov output file.

  
#!/bin/sh
# gen-gcov-data.sh
obj=$(echo "$1" | sed 's/\.c/\.o/')
if [ ! -f $obj ]; then
  exit
fi

pwd=$(pwd)
dirname=$(dirname $1)
gcovfn=$(echo "$(basename $1)" | sed 's/\.c/\.gcda/')
if [ -f /sys/kernel/debug/gcov$pwd/$dirname/$gcovfn ]; then
  cp /sys/kernel/debug/gcov$pwd/$dirname/$gcovfn $dirname
  gcov -f -r -o $1 $obj
 
  if [ -f $(basename $1).gcov ]; then
    mv $(basename $1).gcov $dirname
  fi
else
  echo "no gcov data for /sys/kernel/debug/gcov$pwd/$dirname/$gcovfn"
fi

Take that script, and run it like so..

$ cd kernel-source-tree
$ find . -type f -name "*.c" -exec gen-gcov-data.sh "{}" \;

Running for eg, gen-gcov-data.sh mm/mmap.c will cause gcov to spit out a mmap.c.gcov file (in the current directory) that has coverage information that looks like..

 
   135684:  269:static struct vm_area_struct *remove_vma(struct vm_area_struct *vma)
        -:  270:{
   135684:  271:        struct vm_area_struct *next = vma->vm_next;
        -:  272:
   135684:  273:        might_sleep();
   135686:  274:        if (vma->vm_ops && vma->vm_ops->close)
     5080:  275:                vma->vm_ops->close(vma);
   135686:  276:        if (vma->vm_file)
    90302:  277:                fput(vma->vm_file);
        -:  278:        mpol_put(vma_policy(vma));
   135686:  279:        kmem_cache_free(vm_area_cachep, vma);
   135686:  280:        return next;
        -:  281:}

The numbers on the left being the number of times that line of code was executed.
Lines beginning with ‘-‘ have no coverage information for whatever reason.
If a branch is not taken, it gets prefixed with ‘#####’, like so..

 
  4815374:  391:                if (vma->vm_start < pend) {
    #####:  392:                        pr_emerg("vm_start %lx < pend %lx\n",
        -:  393:                                  vma->vm_start, pend);
        -:  394:                        bug = 1;
        -:  395:                }

There are some cases that need a little more digging to explain. eg:

    88105:  237:static void __remove_shared_vm_struct(struct vm_area_struct *vma,
        -:  238:                struct file *file, struct address_space *mapping)
        -:  239:{
    88105:  240:        if (vma->vm_flags & VM_DENYWRITE)
    15108:  241:                atomic_inc(&file_inode(file)->i_writecount);
    88105:  242:        if (vma->vm_flags & VM_SHARED)
        -:  243:                mapping_unmap_writable(mapping);
        -:  244:
        -:  245:        flush_dcache_mmap_lock(mapping);
    88105:  246:        vma_interval_tree_remove(vma, &mapping->i_mmap);
        -:  247:        flush_dcache_mmap_unlock(mapping);
    88104:  248:}

In this example, lines 245 & 247 have no hitcount, even though there’s no way they could have been skipped.
If we look at the definition of flush_dcache_mmap_(un)lock, we see..
#define flush_dcache_mmap_lock(mapping) do { } while (0)
So the compiler never emitted any code, and hence, it gets treated the same way as the blank lines.

There is a /sys/kernel/debug/gcov/reset file that can be written to to reset the counters before each test if desired.

Additional thoughts

kernel code coverage brain dump. is a post from: codemonkey.org.uk

May 04, 2015 02:54 PM

May 03, 2015

LPC 2015: Device Tree Tools, Validation, and Troubleshooting Microconference Accepted into 2015 Linux Plumbers Conference

There have been more than a few spirited discussions on the topic of device trees (described here) over the past few years, and we can probably expect a few more at this year’s Device Tree microconference. The main focus is on programs, scripts, techniques, and core support to help create correct device trees, validate existing device trees, and support troubleshooting of incorrect device trees, drivers, and subsystems. Within that area of focus, topics span the range from inspection to verification/validation to bindings to documentation. This microconference will also examine the impact of overlays, including boot-time and runtime updates to device trees.

May 03, 2015 07:27 AM

May 01, 2015

Dave Jones: Trinity socket improvements

I’ve been wanting to get back to working on the networking related code in trinity for a long time. I recently carved out some time in the evenings to make a start on some of the lower hanging fruit.

Something that bugged me a while is that we create a bunch of sockets on startup, and then when we call for eg, setsockopt() on that socket, the socket options we pass have more chance of not being the correct protocol for the protocol the socket was created for. This isn’t always a bad thing; for eg, one of the oldest kernel bugs trinity found was found by setting TCP options on a non-TCP socket. But doing this the majority of the time is wasteful, as we’ll just get -EINVAL most the time.

We actually have the necessary information in trinity to know what kind of socket we were dealing with in a socketinfo struct.

struct socket_triplet {
        unsigned int family;
        unsigned int type;
        unsigned int protocol;
};

struct socketinfo {
        struct socket_triplet triplet;
        int fd; 
};

We just had it at the wrong level of abstraction. setsockopt only ever saw a file descriptor. We could have searched through the fd arrays looking for the socketinfo that matched, but that seems like a lame solution. So I changed the various networking syscalls to take a ARG_SOCKETINFO instead of an ARG_FD. As a side-effect, we actually pass sockets to those syscalls more than say, a perf fd, or an epoll fd, or ..

There is still a small chance we pass some crazy fd, just to cover the crazy cases, though those cases don’t tend to trip things up much any more.

After passing down the triplet, it was a simple case of annotating the structures containing the various setsockopt function pointers to indicate which family they belonged to. AF_INET was the only complication, which needed special casing due to the multiple protocols for which we have setsockopt() functions. Creation of a second table, using the protocol instead of the family was enough for the matching code.

There are still a ton of improvements I want to make to this code, but it’s going to take a while, so it’s good when some mostly trivial changes like the above come together quickly.

Trinity socket improvements is a post from: codemonkey.org.uk

May 01, 2015 04:10 PM

April 30, 2015

Rusty Russell: Some bitcoin mempool data: first look

Previously I discussed the use of IBLTs (on the pettycoin blog).  Kalle and I got some interesting, but slightly different results; before I revisited them I wanted some real data to play with.

Finally, a few weeks ago I ran 4 nodes for a week, logging incoming transactions and the contents of the mempools when we saw a block.  This gives us some data to chew on when tuning any fast block sync mechanism; here’s my first impressions looking a the data (which is available on github).

These graphs are my first look; in blue is the number of txs in the block, and in purple stacked on top is the number of txs which were left in the mempool after we took those away.

The good news is that all four sites are very similar; there’s small variance across these nodes (three are in Digital Ocean data centres and one is behind two NATs and a wireless network at my local coworking space).

The bad news is that there are spikes of very large mempools around block 352,800; a series of 731kb blocks which I’m guessing is some kind of soft limit for some mining software [EDIT: 750k is the default soft block limit; reported in 1024-byte quantities as blockchain.info does, this is 732k.  Thanks sipa!].  Our ability to handle this case will depend very much on heuristics for guessing which transactions are likely candidates to be in the block at all (I’m hoping it’s as simple as first-seen transactions are most likely, but I haven’t tested yet).

Transactions in Mempool and in Blocks: Australia (poor connection)

Transactions in Mempool and in Blocks: Singapore

Transactions in Mempool and in Blocks: San Francisco

Transactions in Mempool and in Blocks: San Francisco (using Relay Network)

April 30, 2015 12:26 PM

April 29, 2015

Matthew Garrett: Reducing power consumption on Haswell and Broadwell systems

Edit to add: These patches on their own won't enable this functionality, they just give us a better set of options. Once they're merged we can look at changing the defaults so people get the benefit of this out of the box.

Haswell and Broadwell (Intel's previous and current generations of x86) both introduced a range of new power saving states that promised significant improvements in battery life. Unfortunately, the typical experience on Linux was an increase in power consumption. The reasons why are kind of complicated and distinctly unfortunate, and I'm at something of a loss as to why none of the companies who get paid to care about this kind of thing seemed to actually be caring until I got a Broadwell and looked unhappy, but here we are so let's make things better.

Recent Intel mobile parts have the Platform Controller Hub (Intel's term for the Southbridge, the chipset component responsible for most system i/o like SATA and USB) integrated onto the same package as the CPU. This makes it easier to implement aggressive power saving - the CPU package already has a bunch of hardware for turning various clock and power domains on and off, and these can be shared between the CPU, the GPU and the PCH. But that also introduces additional constraints, since if any component within a power management domain is active then the entire domain has to be enabled. We've pretty much been ignoring that.

The tldr is that Haswell and Broadwell are only able to get into deeper package power saving states if several different components are in their own power saving states. If the CPU is active, you'll stay in a higher-power state. If the GPU is active, you'll stay in a higher-power state. And if the PCH is active, you'll stay in a higher-power state. The last one is the killer here. Having a SATA link in a full-power state is sufficient to keep the PCH active, and that constrains the deepest package power savings state you can enter.

SATA power management on Linux is in a kind of odd state. We support it, but we don't enable it by default. In fact, right now we even remove any existing SATA power management configuration that the firmware has initialised. Distributions don't enable it by default because there are horror stories about some combinations of disk and controller and power management configuration resulting in corruption and data loss and apparently nobody had time to investigate the problem.

I did some digging and it turns out that our approach isn't entirely inconsistent with the industry. The default behaviour on Windows is pretty much the same as ours. But vendors don't tend to ship with the Windows AHCI driver, they replace it with the Intel Rapid Storage Technology driver - and it turns out that that has a default-on policy. But to make things even more awkwad, the policy implemented by Intel doesn't match any of the policies that Linux provides.

In an attempt to address this, I've written some patches. The aim here is to provide two new policies. The first simply inherits whichever configuration the firmware has provided, on the assumption that the system vendor probably didn't configure their system to corrupt data out of the box[1]. The second implements the policy that Intel use in IRST. With luck we'll be able to use the firmware settings by default and switch to the IRST settings on Intel mobile devices.

This change alone drops my idle power consumption from around 8.5W to about 5W. One reason we'd pretty much ignored this in the past was that SATA power management simply wasn't that big a win. Even at its most aggressive, we'd struggle to see 0.5W of saving. But on these new parts, the SATA link state is the difference between going to PC2 and going to PC7, and the difference between those states is a large part of the CPU package being powered up.

But this isn't the full story. There's still work to be done on other components, especially the GPU. Keeping the link between the GPU and an internal display panel active is both a power suck and requires additional chipset components to be powered up. Embedded Displayport 1.3 introduced a new feature called Panel Self-Refresh that permits the GPU and the screen to negotiate dropping the link, leaving it up to the screen to maintain its contents. There's patches to enable this on Intel systems, but it's still not turned on by default. Doing so increases the amount of time spent in PC7 and brings corresponding improvements to battery life.

This trend is likely to continue. As systems become more integrated we're going to have to pay more attention to the interdependencies in order to obtain the best possible power consumption, and that means that distribution vendors are going to have to spend some time figuring out what these dependencies are and what the appropriate default policy is for their users. Intel's done the work to add kernel support for most of these features, but they're not the ones shipping it to end-users. Let's figure out how to make this right out of the box.

[1] This is not necessarily a good assumption, but hey, let's see

comment count unavailable comments

April 29, 2015 07:37 AM

James Morris: SPARC Processor Documentation Online

For folks who don’t follow my twitter or plus accounts, there’s a bunch of SPARC processor documentation here:

http://www.oracle.com/technetwork/server-storage/sun-sparc-enterprise/documentation/sparc-processor-2516655.html

This is up to T4 & M5 and also now includes legacy systems back to Ultra-SPARC I.  Thanks to all who worked on getting these published.

April 29, 2015 01:34 AM

April 23, 2015

Paul E. Mc Kenney: Verification Challenge 5: Uses of RCU

This is another self-directed verification challenge, this time to validate uses of RCU instead of validating the RCU implementations as in earlier posts. As you can see from Verification Challenge 4, the logic expression corresponding even to the simplest Linux-kernel RCU implementation is quite large, weighing in at tens of thousands of variables and hundreds of thousands of clauses. It is therefore worthwhile to look into the possibility of a trivial model of RCU that could be used for verification.

Because logic expressions do not care about cache locality, memory contention, energy efficiency, CPU hotplug, and a host of other complications that a Linux-kernel implementation must deal with, we can start with extreme simplicity. For example:

 1 static int rcu_read_nesting_global;
 2 
 3 static void rcu_read_lock(void)
 4 {
 5   (void)__sync_fetch_and_add(&rcu_read_nesting_global, 2);
 6 }
 7 
 8 static void rcu_read_unlock(void)
 9 {
10   (void)__sync_fetch_and_add(&rcu_read_nesting_global, -2);
11 }
12 
13 static inline void assert_no_rcu_read_lock(void)
14 {
15   BUG_ON(rcu_read_nesting_global >= 2);
16 }
17 
18 static void synchronize_rcu(void)
19 {
20   if (__sync_fetch_and_xor(&rcu_read_nesting_global, 1) < 2)
21     return;
22   SET_NOASSERT();
23   return;
24 }


The idea is to reject any execution in which synchronize_rcu() does not wait for all readers to be done. As before, SET_ASSERT() sets a variable that suppresses all future assertions.

Please note that this model of RCU has some shortcomings:


  1. There is no diagnosis of rcu_read_lock()/rcu_read_unlock() misnesting. (A later version of the model provides limited diagnosis, but under #ifdef CBMC_PROVE_RCU.)
  2. The heavyweight operations in rcu_read_lock() and rcu_read_unlock() result in artificial ordering constraints. Even in TSO systems such as x86 or s390, a store in a prior RCU read-side critical section might be reordered with loads in later critical sections, but this model will act as if such reordering was prohibited.
  3. Although synchronize_rcu() is permitted to complete once all pre-existing readers are done, in this model it will instead wait until a point in time at which there are absolutely no readers, whether pre-existing or new. Therefore, this model's idea of an RCU grace period is even heavier weight than in real life.


Nevertheless, this approach will allow us to find at least some RCU-usage bugs, and it fits in well with cbmc's default fully-ordered settings. For example, we can use it to verify a variant of the simple litmus test used previously:

 1 int r_x;
 2 int r_y;
 3 
 4 int x;
 5 int y;
 6 
 7 void *thread_reader(void *arg)
 8 {
 9   rcu_read_lock();
10   r_x = x;
11 #ifdef FORCE_FAILURE_READER
12   rcu_read_unlock();
13   rcu_read_lock();
14 #endif
15   r_y = y;
16   rcu_read_unlock();
17   return NULL;
18 }
19 
20 void *thread_update(void *arg)
21 {
22   x = 1;
23 #ifndef FORCE_FAILURE_GP
24   synchronize_rcu();
25 #endif
26   y = 1;
27   return NULL;
28 }
29 
30 int main(int argc, char *argv[])
31 {
32   pthread_t tr;
33 
34   if (pthread_create(&tr, NULL, thread_reader, NULL))
35     abort();
36   (void)thread_update(NULL);
37   if (pthread_join(tr, NULL))
38     abort();
39 
40   BUG_ON(r_y != 0 && r_x != 1);
41   return 0;
42 }


This model has only 3,032 variables and 8,844 clauses, more than an order of magnitude smaller than for the Tiny RCU verification. Verification takes about half a second, which is almost two orders of magnitude faster than the 30-second verification time for Tiny RCU. In addition, the model successfully flags several injected errors. We have therefore succeeded in producing a simpler and faster model approximating RCU, and that can handle multi-threaded litmus tests.

A natural next step would be to move to litmus tests involving linked lists. Unfortunately, there appear to be problems with cbmc's handling of pointers in multithreaded situations. On the other hand, cbmc's multithreaded support is quite new, so hopefully there will be fixes for these problems in the near future. After fixes appear, I will give the linked-list litmus tests another try.

In the meantime, the full source code for these models may be found here.

April 23, 2015 05:49 PM

James Bottomley: Getting your old Sync Server to work with New Firefox

Much has been written about Mozilla trying to force people to use their new sync service.  If, like me, you run your own sync server for Firefox, you’ve mostly been ignoring this because there’s still no real way of running your own sync server for the new service (and if you simply keep upgrading, Firefox keeps working with your old server).

However, recently I had cause to want to connect my old sync server to a new installation of firefox without just copying over all the config files (one of the config settings broke google docs and I couldn’t figure out which one it was, so I figured I’d just blow the entire config away and restore from sync).  Long ago Mozilla disabled the ability to connect newer Firefoxes to an old sync server, so this is an exposé of how to do it.  I did actually search the internet for this one, but no-one else seems to have figured it out (or if they have, they’re not known to the search engines).

There are two config files you need to update get new Firefox to connect to sync (note, I did this with Firefox 37; I’ve not tested it with a different version, but I’m pretty sure it will work).  The first is that you need to put your sync key and weave user login into logins.json.  Since the password and user are encypted in this file, the easiest way is to use a password manager extension, like Saved Password Editor add on.  Then you need two new password entries of type “Annotated” under the host chrome://weave.  For each, your username is your weave username.  For the first, you’re going to add your weave password under the annotation “Mozilla Services Password”.  For the second, add the Firefox  key with all the dashes removed as the password under the annotation “Mozilla Services Encryption Passphrase”.  If you’ve got all this right, password manager will show this (my username is jejb):

tmpNext you’re going to close firefox and manually edit the prefs.js file.  To sync completely from scratch, this just needs three entries, so firstly strip out every preference that begins ‘services.sync.’ and then add three new lines

user_pref("services.sync.account", "<my account>");
user_pref("services.sync.serverURL", "<my weave URL>");
user_pref("services.sync.username", "<my weave user name>");

For most people, the account and weave user name are the same.  Now start Firefox and it should just sync on its own.  To check that you got this right, go to the Sync tab of preferences and you should see something like this

tmp

And that’s it.  You’re all done.

April 23, 2015 02:06 AM

April 21, 2015

Eric Sandeen: Buy Green Power this Earth Day

Clicking map goes to a DOE webpage, then please come back!

Happy Earth Day!  If you’ve had enough of the same tired suggestions to recycle more or turn off your lights when you leave the room, and feel like maybe that’s just not cutting it these days, I humbly propose that you can do more.  Right now.  Easy, and cheap.

You can run your home on renewable energy right now.

While it’s super fun to have solar on your roof, that’s not the only way to put more clean energy on the grid.  In most parts of the US, you can choose to buy enough renewable energy to cover part, or all, of your electricity bill.  In some utility markets, the program is through your utility; in others with deregulated utility markets, you can choose a new provider which produces clean energy.  The DOE has more information about these programs here.  The map above is shaded based on the number of options available in each state – click your state to see the DOE page which lists your local options, and choose one today!  Really.  I worked hard on that map, make me proud!

How it works in Minnesota…

Here in Minnesota (and other states as well), Xcel offers Windsource, a program which funds contracts for additional wind energy on the grid.  In 2012, Xcel announced that it had sold its billionth kWh of Windsource energy – that’s one billion kWh generated from wind which would not have been generated without subscriber participation.  In 2013, 33,000 Minnesota homes and 264 businesses participated in Windsource.  It’s surprisingly cheap; while there is an extra cost for the program, all fuel cost charges are removed because, after all, the actual wind is free:

windsource_billAs you can see from my January bill above, my net cost was only half a cent per kWh after the fuel charge was removed.

If you’re an Xcel customer and want to sign up for Windsource, you can do it right here, right now.  You can choose all or part of your bill, but I’d go big, and do your whole bill.  It’s cheap, and it feels good.

… and everywhere else

Other utilities and other states have similar programs; click your state in the map above to get details, links, and pricing information for your local options.

Let me know!

Did this post motivate you to sign up for clean power?  Have you already signed up for renewable energy?  What has been your experience with these utility programs?  Let me know in the comments below, and hey – good job doing something more relevant on Earth Day!

April 21, 2015 06:01 AM

April 20, 2015

Daniel Vetter: Neat drm/i915 stuff for 4.1

With Linux kernel v3.20^W v4.0 already out there door my overview of what's in 4.1 for drm/i915 is way overdue.

First looking at the modeset side of the driver the big overall thing is all the work to convert i915 to atomic. In this release there's code from Ander Conselvan de Oliveira to have a struct drm_atomic_state allocated for all the legacy modeset code paths in the driver. With that we can switch the internals to start using atomic state objects and gradually convert over everything to atomic on the modeset side. Matt Roper on the other hand was busy to prepare the plane code to land the atomic watermark update code. Damien has reworked the initial plane configuration code used for fastboot, which also needs to be adapted to the atomic world.


For more specific feature work there's the DRRS (dynamic refresh rate switching) from Sonika, Vandana and more people, which is now enabled where supported. The idea is to reduce the refresh rate of the panel to save power when nothing changes on the screen. And Paulo Zanoni has provided patches to improve the FBC code, hopefully we can enable that by default soon too. Under the hood Ville has refactored the DP link rate computation and the sprite color key handling, both to prepare for future work and platform enabling. Intermediate link rate support for eDP 1.4 from Sonika built on top of this. Imre Deak has also reworked the Baytrail/Braswell DPLL code to prepare for Broxton.

Speaking of platforms, Skyleigh has gained runtime PM support from Damien, and RPS (render turbo and sleep states) from Akash. Another SKL exclusive is support for scanout of Y-tiled buffers and scanning out buffers rotated by 90°/270° (instead of just normal and rotated by 180°) from Tvrtko and Damien. Well the rotation support didn't quite land yet, but Tvrtko's support for the special pagetable binding needed for that feature in the form of rotated GGTT views. Finally Nick Hoath and Damien also submitted a lot of workaround patches for SKL.

Moving on to Braswell/Cherryview there have been tons of fixes to the DPLL and watermark code from Vijay and Ville, and BSW left the preliminary hw support. And also for the SoC platforms Chris Wilson has supplied a pile of patches to tune the rps code and bring it more in-line with the big core platforms.

On the GT side the big ongoing work is dyanmic pagetable allocations Michel Thierry based upon patches from Ben Widawsky. With per-process address spaces and even more so with the big address spaces gen8+ supports it would be wasteful if not impossible to allocate pagetables for the entire address space upfront. But changing the code to handle another possible memory allocation failure point needed a lot of work. Most of that has landed now, but the benefits of enabling bigger address spaces haven't made it into 4.1.

Another big work is XenGT client-side support fromYu Zhang and team. This is paravirtualization to allow virtual machines to tap into the render engines without requiring exclusive access, but also with a lot less overhead than full virtual hardware like vmware or virgil would provide. The host-side code is also submitted already, but needs a bit more work still to integrate cleanly into the driver.

And of course there's been lots of other smaller work all over, as usual. Internal documentation for the shrinker, more dead UMS code removed, the vblank interrupt code cleaned up and more.

April 20, 2015 03:57 AM

April 19, 2015

Michael Kerrisk (manpages): man-pages-3.83 is released

I've released man-pages-3.83. The release tarball is available on kernel.org. The browsable online pages can be found on man7.org. The Git repository for man-pages is available on kernel.org.

This release resulted from patches, bug reports,and  comments from 30 contributors. As well as a large number of minor fixes to more than 70 man pages, the more significant changes in man-pages-3.83 include the following:




April 19, 2015 02:55 PM

April 18, 2015

James Bottomley: Squirrelmail and imaps

Somewhere along the way squirrelmail stopped working with my dovecot imap server, which runs only on the secure port (imaps).  I only ever use webmail as a last resort, so the problem may be left over from years ago.  The problem is that I’m getting a connect failure but an error code of zero and no error message.  This is what it actually shows

Error connecting to IMAP server "localhost:993".Server error: (0)

Which is very helpful.  Everything else works with imaps on this system, so why not squirrelmail?

The answer, it seems, is buried deep inside php.  Long ago, when php first started using openssl, it pretty much did no peer verification.  Nowadays it does.  I know I ran into this a long time ago, so the self signed certificate my version of dovecot is using is present in the /etc/ssl/certs directory where php looks for authoritative certificates.  Digging into the sources of squirrelmail, it turns out this php statement (with the variables substituted) is the failing one

$imap_stream = @fsockopen('tls://localhost', 993, $errno, $errstr, 15);

It’s failing because $imap_stream is empty, but, as squirrelmail claims, it’s actually failing with a zero error code.  After several hours of casting about with the fairly useless php documentation, it turns out that php has an interactive mode where it will actually give you all the errors.  executing this

echo 'fsockopen("tls://localhost",993,$errno,$errmsg,15);'|php -a

Finally tells me what’s wrong

Interactive mode enabled

PHP Warning: fsockopen(): Peer certificate CN=`bedivere.hansenpartnership.com' did not match expected CN=`localhost' in php shell code on line 1
PHP Warning: fsockopen(): Failed to enable crypto in php shell code on line 1
PHP Warning: fsockopen(): unable to connect to tls://localhost:993 (Unknown error) in php shell code on line 1

So that’s it: php has tightened up the certificate verification not only to validate the certificate itself, but also to check that the CN matches the requested service.  In this case, because I’m connecting over the loopback device (localhost) instead of the internet to the DNS name, that CN check has failed and lead to the results I’m seeing.  Simply fixing squirrelmail to connect to imaps over the fully qualified hostname instead of localhost gets everything working again.

April 18, 2015 12:06 AM

April 14, 2015

Dave Jones: the more things change.. 4.0


$ ping gelk
PING gelk.kernelslacker.org (192.168.42.30) 56(84) bytes of data.
WARNING: kernel is not very fresh, upgrade is recommended.
...
$ uname -r
4.0.0

Remember that one time the kernel versioning changed and nothing in userspace broke ? Me either.

Why people insist on trying to think they can get this stuff right is beyond me.

YOU’RE PING. WHY DO YOU EVEN CARE WHAT KERNEL VERSION IS RUNNING.

update: this was already fixed, almost exactly a year ago in the ping git tree. The (now removed) commentary kind of explains why they cared. Sigh.

the more things change.. 4.0 is a post from: codemonkey.org.uk

April 14, 2015 03:01 PM

April 08, 2015

Rusty Russell: Lightning Networks Part IV: Summary

This is the fourth part of my series of posts explaining the bitcoin Lightning Networks 0.5 draft paper.  See Part I, Part II and Part III.

The key revelation of the paper is that we can have a network of arbitrarily complicated transactions, such that they aren’t on the blockchain (and thus are fast, cheap and extremely scalable), but at every point are ready to be dropped onto the blockchain for resolution if there’s a problem.  This is genuinely revolutionary.

It also vindicates Satoshi’s insistence on the generality of the Bitcoin scripting system.  And though it’s long been suggested that bitcoin would become a clearing system on which genuine microtransactions would be layered, it was unclear that we were so close to having such a system in bitcoin already.

Note that the scheme requires some solution to malleability to allow chains of transactions to be built (this is a common theme, so likely to be mitigated in a future soft fork), but Gregory Maxwell points out that it also wants selective malleability, so transactions can be replaced without invalidating the HTLCs which are spending their outputs.  Thus it proposes new signature flags, which will require active debate, analysis and another soft fork.

There is much more to discover in the paper itself: recommendations for lightning network routing, the node charging model, a risk summary, the specifics of the softfork changes, and more.

I’ll leave you with a brief list of requirements to make Lightning Networks a reality:

  1. A soft-fork is required, to protect against malleability and to allow new signature modes.
  2. A new peer-to-peer protocol needs to be designed for the lightning network, including routing.
  3. Blame and rating systems are needed for lightning network nodes.  You don’t have to trust them, but it sucks if they go down as your money is probably stuck until the timeout.
  4. More refinements (eg. relative OP_CHECKLOCKTIMEVERIFY) to simplify and tighten timeout times.
  5. Wallets need to learn to use this, with UI handling of things like timeouts and fallbacks to the bitcoin network (sorry, your transaction failed, you’ll get your money back in N days).
  6. You need to be online every 40 days to check that an old HTLC hasn’t leaked, which will require some alternate solution for occasional users (shut down channel, have some third party, etc).
  7. A server implementation needs to be written.

That’s a lot of work!  But it’s all simply engineering from here, just as bitcoin was once the paper was released.  I look forward to seeing it happen (and I’m confident it will).

April 08, 2015 03:59 AM

April 06, 2015

Rusty Russell: Lightning Networks Part III: Channeling Contracts

This is the third part of my series of posts explaining the bitcoin Lightning Networks 0.5 draft paper.

In Part I I described how a Poon-Dryja channel uses a single in-blockchain transaction to create off-blockchain transactions which can be safely updated by either party (as long as both agree), with fallback to publishing the latest versions to the blockchain if something goes wrong.

In Part II I described how Hashed Timelocked Contracts allow you to safely make one payment conditional upon another, so payments can be routed across untrusted parties using a series of transactions with decrementing timeout values.

Now we’ll join the two together: encapsulate Hashed Timelocked Contracts inside a channel, so they don’t have to be placed in the blockchain (unless something goes wrong).

Revision: Why Poon-Dryja Channels Work

Here’s half of a channel setup between me and you where I’m paying you 1c: (there’s always a mirror setup between you and me, so it’s symmetrical)

Half a channel: we will invalidate transaction 1 (in favour of a new transaction 2) to send funds.

The system works because after we agree on a new transaction (eg. to pay you another 1c), you revoke this by handing me your private keys to unlock that 1c output.  Now if you ever released Transaction 1, I can spend both the outputs.  If we want to add a new output to Transaction 1, we need to be able to make it similarly stealable.

Adding a 1c HTLC Output To Transaction 1 In The Channel

I’m going to send you 1c now via a HTLC (which means you’ll only get it if the riddle is answered; if it times out, I get the 1c back).  So we replace transaction 1 with transaction 2, which has three outputs: $9.98 to me, 1c to you, and 1c to the HTLC: (once we agree on the new transactions, we invalidate transaction 1 as detailed in Part I)

Our Channel With an Output for an HTLC

Note that you supply another separate signature (sig3) for this output, so you can reveal that private key later without giving away any other output.

We modify our previous HTLC design so you revealing the sig3 would allow me to steal this output. We do this the same way we did for that 1c going to you: send the output via a timelocked mutually signed transaction.  But there are two transaction paths in an HTLC: the got-the-riddle path and the timeout path, so we need to insert those timelocked mutually signed transactions in both of them.  First let’s append a 1 day delay to the timeout path:

Timeout path of HTLC, with locktime so it can be stolen once you give me your sig3.

Similarly, we need to append a timelocked transaction on the “got the riddle solution” path, which now needs my signature as well (otherwise you could create a replacement transaction and bypass the timelocked transaction):

Full HTLC: If you reveal Transaction 2 after we agree it’s been revoked, and I have your sig3 private key, I can spend that output before you can, down either the settlement or timeout paths.

Remember The Other Side?

Poon-Dryja channels are symmetrical, so the full version has a matching HTLC on the other side (except with my temporary keys, so you can catch me out if I use a revoked transaction).  Here’s the full diagram, just to be complete:

A complete lightning network channel with an HTLC, containing a glorious 13 transactions.

Closing The HTLC

When an HTLC is completed, we just update transaction 2, and don’t include the HTLC output.  The funds either get added to your output (R value revealed before timeout) or my output (timeout).

Note that we can have an arbitrary number of independent HTLCs in progress at once, and open and/or close as many in each transaction update as both parties agree to.

Keys, Keys Everywhere!

Each output for a revocable transaction needs to use a separate address, so we can hand the private key to the other party.  We use two disposable keys for each HTLC[1], and every new HTLC will change one of the other outputs (either mine, if I’m paying you, or yours if you’re paying me), so that needs a new key too.  That’s 3 keys, doubled for the symmetry, to give 6 keys per HTLC.

Adam Back pointed out that we can actually implement this scheme without the private key handover, and instead sign a transaction for the other side which gives them the money immediately.  This would permit more key reuse, but means we’d have to store these transactions somewhere on the off chance we needed them.

Storing just the keys is smaller, but more importantly, Section 6.2 of the paper describes using BIP 32 key hierarchies so the disposable keys are derived: after a while, you only need to store one key for all the keys the other side has given you.  This is vastly more efficient than storing a transaction for every HTLC, and indicates the scale (thousands of HTLCs per second) that the authors are thinking.

Next: Conclusion

My next post will be a TL;DR summary, and some more references to the implementation details and possibilities provided by the paper.

 


[1] The new sighash types are fairly loose, and thus allow you to attach a transaction to a different parent if it uses the same output addresses.  I think we could re-use the same keys in both paths if we ensure that the order of keys required is reversed for one, but we’d still need 4 keys, so it seems a bit too tricky.

April 06, 2015 11:21 AM

April 03, 2015

LPC 2015: LLVM Microconference Accepted into 2015 Linux Plumbers Conference

This microconference will cover all things LLVM related to Linux. Discussions will range from progress in compiling the Linux kernel (and changes in clang/LLVM) to support of clang in yocto, and even to compiling an entire distro with clang (while also using the “musl” replacement for glibc and uclibc). The topics will also include LLVM being used for bug hunting and for the extended Berkeley Packet Filter (eBPF). More topics and speakers will of course be added. If you have an LLVM-related discussion you would like to lead up, please contact Behan Webster <behanw@converseincode.com>.

April 03, 2015 02:23 PM

April 02, 2015

Daniel Vetter: Community Code of Conduct for intel-gfx

[This is a cross-post from the mail I just sent out to intel-gfx.]

Code of conducts seem to be in the news a bit recently, and I realized that I've never really documented how we run things. It's different from the kernel's overall CodeOfConflict and also differs from the official Intel/OTC one in small details about handling issues. And for completeness there's also the Xorg Foundation event policy. Anyway, I think this is worth clarifying and here it goes.

It's simple: Be respectful, open and excellent to each another.

Which doesn't mean we want to sacrifice quality to be nice. Striving for technical excellence very much doesn't exclude being excellent to someone else, and in our experience it tends to go hand in hand.

Unfortunately things go south occasionally. So if you feel threatened, personally abused or otherwise uncomfortable, even and especially when you didn't participate in a discussion yourself, then please raise this in private with the drm/i915 maintainers (currently Daniel Vetter and Jani Nikula, see MAINTAINERS for contact information). And the "in private" part is important: Humans screw up, disciplining minor fumbles by tarnishing someones google-able track record forever is out of proportion.

Still there are some teeth to this code of conduct:

1. First time around minor issues will be raised in private.

2. On repeat cases a public reply in the discussion will enforce that respectful behavior is expected.

3. We'll ban people who don't get it.

And severe cases will be escalated much quicker.

This applies to all community communication channels (irc, mailing list and bugzilla). And as mentioned this really just is a public clarification of the rules already in place - you can't see that though since we never had to go further than step 1.

Let's keep it at that.

And in case you have a problem with an individual drm/i915 maintainer and don't want to raise it with the other one there's the Xorg BoD, linux foundation TAB and the drm upstream maintainer Dave Airlie.

April 02, 2015 07:53 AM

April 01, 2015

Rusty Russell: Lightning Networks Part II: Hashed Timelock Contracts (HTLCs)

In Part I, we demonstrated Poon-Dryja channels; a generalized channel structure which used revocable transactions to ensure that old transactions wouldn’t be reused.

A channel from me<->you would allow me to efficiently send you 1c, but that doesn’t scale since it takes at least one on-blockchain transaction to set up each channel. The solution to this is to route funds via intermediaries;  in this example we’ll use the fictitious “MtBox”.

If I already have a channel with MtBox’s Payment Node, and so do you, that lets me reliably send 1c to MtBox without (usually) needing the blockchain, and it lets MtBox send you 1c with similar efficiency.

But it doesn’t give me a way to force them to send it to you; I have to trust them.  We can do better.

Bonding Unrelated Transactions using Riddles

For simplicity, let’s ignore channels for the moment.  Here’s the “trust MtBox” solution:

I send you 1c via MtBox; simplest possible version, using two independent transactions. I trust MtBox to generate its transaction after I send it mine.

What if we could bond these transactions together somehow, so that when you spend the output from the MtBox transaction, that automatically allows MtBox to spend the output from my transaction?

Here’s one way. You send me a riddle question to which nobody else knows the answer: eg. “What’s brown and sticky?”.  I then promise MtBox the 1c if they answer that riddle correctly, and tell MtBox that you know.

MtBox doesn’t know the answer, so it turns around and promises to pay you 1c if you answer “What’s brown and sticky?”. When you answer “A stick”, MtBox can pay you 1c knowing that it can collect the 1c off me.

The bitcoin blockchain is really good at riddles; in particular “what value hashes to this one?” is easy to express in the scripting language. So you pick a random secret value R, then hash it to get H, then send me H.  My transaction’s 1c output requires MtBox’s signature, and a value which hashes to H (ie. R).  MtBox adds the same requirement to its transaction output, so if you spend it, it can get its money back from me:

Two Independent Transactions, Connected by A Hash Riddle.

Handling Failure Using Timeouts

This example is too simplistic; when MtBox’s PHP script stops processing transactions, I won’t be able to get my 1c back if I’ve already published my transaction.  So we use a familiar trick from Part I, a timeout transaction which after (say) 2 days, returns the funds to me.  This output needs both my and MtBox’s signatures, and MtBox supplies me with the refund transaction containing the timeout:

Hash Riddle Transaction, With Timeout

MtBox similarly needs a timeout in case you disappear.  And it needs to make sure it gets the answer to the riddle from you within that 2 days, otherwise I might use my timeout transaction and it can’t get its money back.  To give plenty of margin, it uses a 1 day timeout:

MtBox Needs Your Riddle Answer Before It Can Answer Mine

Chaining Together

It’s fairly clear to see that longer paths are possible, using the same “timelocked” transactions.  The paper uses 1 day per hop, so if you were 5 hops away (say, me <-> MtBox <-> Carol <-> David <-> Evie <-> you) I would use a 5 day timeout to MtBox, MtBox a 4 day to Carol, etc.  A routing protocol is required, but if some routing doesn’t work two nodes can always cancel by mutual agreement (by creating timeout transaction with no locktime).

The paper refers to each set of transactions as contracts, with the following terms:

The hashing and timelock properties of the transactions are what allow them to be chained across a network, hence the term Hashed Timelock Contracts.

Next: Using Channels With Hashed Timelock Contracts.

The hashed riddle construct is cute, but as detailed above every transaction would need to be published on the blockchain, which makes it pretty pointless.  So the next step is to embed them into a Poon-Dryja channel, so that (in the normal, cooperative case) they don’t need to reach the blockchain at all.

April 01, 2015 11:46 AM

LPC 2015: Android/Mobile Microconference Accepted into 2015 Linux Plumbers Conference

As with 2014 and several years prior, 2015 is the year of the Linux smartphone. There are a number of mobile/embedded environments based on the Linux kernel, the most prominent of course being Android. One consequence of this prominence is a variety of projects derived from Android Open Source Project (AOSP), which raises the question of how best to manage them, and additionally if it is possible to run a single binary image of the various software components across a variety of devices. In addition, although good progress has been made upstreaming various Android patches, there is more work to be done for ADF, KMS, and Sync, among others. Migrating from Binder to KDBus is still a challenge, as are a number of other candidates for removal from drivers/staging. There are also issues remaining with ION, cenalloc, and DMA API. Finally, power management is still in need of improvement, with per-process power management being a case in point.

So when is the year of the Linux desktop? It seems that these developers are too busy working on mobile devices to have time to ask that question!

We hope to see you there!

April 01, 2015 07:18 AM

March 31, 2015

Michael Kerrisk (manpages): man-pages-3.82 is released

I've released man-pages-3.82. The release tarball is available on kernel.org. The browsable online pages can be found on man7.org. The Git repository for man-pages is available on kernel.org.

As well as a large number of minor fixes to more than 80 man pages, the more significant changes in man-pages-3.82 include the following:

March 31, 2015 07:00 AM

March 30, 2015

Rusty Russell: Lightning Networks Part I: Revocable Transactions

I finally took a second swing at understanding the Lightning Network paper.  The promise of this work is exceptional: instant reliable transactions across the bitcoin network. But the implementation is complex and the draft paper reads like a grab bag of ideas; but it truly rewards close reading!  It doesn’t involve novel crypto, nor fancy bitcoin scripting tricks.

There are several techniques which are used in the paper, so I plan to concentrate on one per post and wrap up at the end.

Revision: Payment Channels

I open a payment channel to you for up to $10

A Payment Channel is a method for sending microtransactions to a single recipient, such as me paying you 1c a minute for internet access.  I create an opening transaction which has a $10 output, which can only be redeemed by a transaction input signed by you and me (or me alone, after a timeout, just in case you vanish).  That opening transaction goes into the blockchain, and we’re sure it’s bedded down.

I pay you 1c in the payment channel. Claim it any time!

Then I send you a signed transaction which spends that opening transaction output, and has two outputs: one for $9.99 to me, and one for 1c to you.  If you want, you could sign that transaction too, and publish it immediately to get your 1c.

Update: now I pay you 2c via the payment channel.

Then a minute later, I send you a signed transaction which spends that same opening transaction output, and has a $9.98 output for me, and a 2c output for you. Each minute, I send you another transaction, increasing the amount you get every time.

This works because:

  1.  Each transaction I send spends the same output; so only one of them can ever be included in the blockchain.
  2. I can’t publish them, since they need your signature and I don’t have it.
  3. At the end, you will presumably publish the last one, which is best for you.  You could publish an earlier one, and cheat yourself of money, but that’s not my problem.

Undoing A Promise: Revoking Transactions?

In the simple channel case above, we don’t have to revoke or cancel old transactions, as the only person who can spend them is the person who would be cheated.  This makes the payment channel one way: if the amount I was paying you ever went down, you could simply broadcast one of the older, more profitable transactions.

So if we wanted to revoke an old transaction, how would we do it?

There’s no native way in bitcoin to have a transaction which expires.  You can have a transaction which is valid after 5 days (using locktime), but you can’t have one which is valid until 5 days has passed.

So the only way to invalidate a transaction is to spend one of its inputs, and get that input-stealing transaction into the blockchain before the transaction you’re trying to invalidate.  That’s no good if we’re trying to update a transaction continuously (a-la payment channels) without most of them reaching the blockchain.

The Transaction Revocation Trick

But there’s a trick, as described in the paper.  We build our transaction as before (I sign, and you hold), which spends our opening transaction output, and has two outputs.  The first is a 9.99c output for me.  The second is a bit weird–it’s 1c, but needs two signatures to spend: mine and a temporary one of yours.  Indeed, I create and sign such a transaction which spends this output, and send it to you, but that transaction has a locktime of 1 day:

The first payment in a lightning-style channel.

Now, if you sign and publish that transaction, I can spend my $9.99 straight away, and you can publish that timelocked transaction tomorrow and get your 1c.

But what if we want to update the transaction?  We create a new transaction, with 9.98c output to me and 2c output to a transaction signed by both me and another temporary address of yours.  I create and sign a transaction which spends that 2c output, has a locktime of 1 day and has an output going to you, and send it to you.

We can revoke the old transaction: you simply give me the temporary private key you used for that transaction.  Weird, I know (and that’s why you had to generate a temporary address for it).  Now, if you were ever to sign and publish that old transaction, I can spend my $9.99 straight away, and create a transaction using your key and my key to spend your 1c.  Your transaction (1a below) which could spend that 1c output is timelocked, so I’ll definitely get my 1c transaction into the blockchain first (and the paper uses a timelock of 40 days, not 1).

Updating the payment in a lightning-style channel: you sent me your private key for sig2, so I could spend both outputs of Transaction 1 if you were to publish it.

So the effect is that the old transaction is revoked: if you were to ever sign and release it, I could steal all the money.  Neat trick, right?

A Minor Variation To Avoid Timeout Fallback

In the original payment channel, the opening transaction had a fallback clause: after some time, it is all spendable by me.  If you stop responding, I have to wait for this to kick in to get my money back.  Instead, the paper uses a pair of these “revocable” transaction structures.  The second is a mirror image of the first, in effect.

A full symmetric, bi-directional payment channel.

So the first output is $9.99 which needs your signature and a temporary signature of mine.  The second is  1c for meyou.  You sign the transaction, and I hold it.  You create and sign a transaction which has that $9.99 as input, a 1 day locktime, and send it to me.

Since both your and my “revocable” transactions spend the same output, only one can reach the blockchain.  They’re basically equivalent: if you send yours you must wait 1 day for your money.  If I send mine, I have to wait 1 day for my money.  But it means either of us can finalize the payment at any time, so the opening transaction doesn’t need a timeout clause.

Next…

Now we have a generalized transaction channel, which can spend the opening transaction in any way we both agree on, without trust or requiring on-blockchain updates (unless things break down).

The next post will discuss Hashed Timelock Contracts (HTLCs) which can be used to create chains of payments…

Notes For Pedants:

In the payment channel open I assume OP_CHECKLOCKTIMEVERIFY, which isn’t yet in bitcoin.  It’s simpler.

I ignore transaction fees as an unnecessary distraction.

We need malleability fixes, so you can’t mutate a transaction and break the ones which follow.  But I also need the ability to sign Transaction 1a without a complete Transaction 1 (since you can’t expose the signed version to me).  The paper proposes new SIGHASH types to allow this.

[EDIT 2015-03-30 22:11:59+10:30: We also need to sign the other symmetric transactions before signing the opening transaction.  If we released a completed opening transaction before having the other transactions, we might be stuck with no way to get our funds back (as we don’t have a “return all to me” timeout on the opening transaction)]

March 30, 2015 10:47 AM

March 29, 2015

LPC 2015: Development Tools Tutorial Accepted into 2015 Linux Plumbers Conference

In a departure from prior Plumbers tradition, we are pleased to announce not a Development Tools Microconference, but rather a set of Development Tools tutorials, including interactive tutorials, demos, and short presentations. Topics include Coccinelle (Julia Lawall), testing and debugging tools (Shuah Khan), issues with copying and pasting Linux kernel code (Michael Godfrey), and LLVM/clang and the Linux kernel (Behan Webster).

Given how important tools are to productivity of developers and the quality of their code, the time devoted to these tutorials promises to be time well spent!

Come and find out how to use the tools you have heard about! We hope to see you there!

March 29, 2015 07:37 PM

March 27, 2015

LPC 2015: Registration for LPC 2015 Is Now Open

The 2015 Linux Plumbers Conference organizing committee is pleased to announce that the registration for this year’s conference is now open. The conference will be taking place in Seattle, Washington, USA, August 19th through August 21st. Information on how to register can be found on the ATTEND page. Registration prices and cutoff dates are also published in the ATTEND page. As usual, contact us if you have questions.

March 27, 2015 11:04 PM

March 26, 2015

LPC 2015: Containers Microconference Accepted into 2015 Linux Plumbers Conference

Over the past year, the advent of Docker has further increased the level of Containers excitement. Additional points of Containers interest include the LXC 1.1 release (which includes CRIU checkpoint/restore, in-container systemd support, and feature-set compatibility across systemd, sysvinit, and upstart), the recently announced merger of OpenVZ and Cloud server, and progress in the kernel namespace and cgroups infrastructure.

The goal of this microconference is to get these diverse groups together to discuss the long-standing issue of container device isolation, live migration, security features such as user namespaces, and the dueling-systemd challenges stemming from running systemd both within containers and on the host OS (see “The Grumpy Editor’s guide to surviving the systemd debate” for other systemd-related topics).

Please join us for a timely and important discussion!

March 26, 2015 11:10 AM