Kernel Planet

June 20, 2013

Dave Jones: Daily log June 19th 2013

Continued work tracking down the mysterious soft-lockup case I seem to be triggering with ease. Thought at one point I’d nailed it after building a kernel with a recent RCU patch reverted. But it just took longer to hit.
Set up a second system with the same hardware just to rule out some kind of weird hardware bug. Reproduced the bug on that instantly.
Upside: Now I have two test machines. Downside: Still no closer to figuring out what the hell is causing the bug.

Tomorrow involves git-bisect.

Daily log June 19th 2013 is a post from: codemonkey.org.uk

June 20, 2013 04:36 AM

Matt Domsch: MirrorManager 1.4 now in production in Fedora Infrastructure

After nearly 3 years in on-again/off-again development, MirrorManager 1.4 is now live in the Fedora Infrastructure, happily serving mirrorlists to yum, and directing Fedora users to their favorite ISOs – just in time for the Fedora 19 freeze.

Kudos go out to Kevin Fenzi, Seth Vidal, Stephen Smoogen, Toshio Kuratomi, Pierre-Yves Chivon, Patrick Uiterwijk, Adrian Reber, and Johan Cwiklinski for their assistance in making this happen.  Special thanks to Seth for moving the mirrorlist-serving processes to their own servers where they can’t harm other FI applications, and to Smooge, Kevin and Patrick, who gave up a lot of their Father’s Day weekend (both days and nights) to help find and fix latent bugs uncovered in production.

What does this bring the average Fedora user?  Not a lot…  More stability – fewer failures with yum retrieving the mirror lists, not that there were many, but it was nonzero.  A list of public mirrors where the versions are sorted in numerical order.

What does this bring to a Fedora mirror administrator?  A few new tricks:

What does this bring Fedora Infrastructure (or anyone else running MirrorManager)?

MM 1.4 is a good step forward, and hopefully I’ve laid the groundwork to make it easier to improve in the future.  I’m excited that more of the Fedora Infrastructure team has learned (the hard way) the internals of MM, so I’ll have additional help going forward too.

June 20, 2013 01:39 AM

June 19, 2013

Matthew Garrett: Mir, the Canonical CLA and skewing the playing field

Mir is Canonical's equivalent to Wayland - a display server, responsible for getting application pixmaps onto a screen. It's intended to scale from mobile devices to the desktop, and as such is expected to turn up in Ubuntu Phone before too long[1]. There's already plenty of discussion about whether the technical differences between Wayland and Mir are sufficient to justify Canonical going their own way, so I'm not planning on talking about that.

Like many Canonical-led projects, Mir is under GPLv3 - a strong copyleft license. There's a couple of aspects of GPLv3 that are intended to protect users from being unable to make use of the rights that the license grants them. The first is that if GPLv3 code is shipped as part of a user product, it must be possible for the user to replace that GPLv3 code. That's a problem if your device is intended to be locked down enough that it can only run vendor code. The second is that it grants an explicit patent license to downstream recipients, permitting them to make use of those patents in derivative works.

One of the consequences of these obligations is that companies whose business models depend on either selling locked-down devices or licensing patents tend to be fairly reluctant to ship GPLv3 software. In effect, this is GPLv3 acting entirely as intended - unless you're willing to guarantee that a user can exercise the freedoms defined by the free software definition, you don't get to ship GPLv3 material. Some companies have decided that shipping GPLv3 code would be more expensive than either improving existing code under a more liberal license or writing new code from scratch. Android's a pretty great example of this - it contains no GPLv3 code, and even GPLv2 code (outside the kernel) is kept to a minimum.

Which, given Canonical's focus on pushing Ubuntu into GPLv3-hostile markets, makes the choice of GPLv3 an odd one. This isn't a problem as long as they're the sole copyright holder, because the copyright holder is obviously free to ship their code under as many licenses as they want. But Canonical still aim to foster community involvement, and ideally that includes accepting external contributions to their code. If Canonical simply accepted those contributions under GPLv3 then they'd no longer have the right to relicense the entire codebase, so any contributions are only accepted if the contributor has signed a Contributor License Agreement.

Canonical's CLA is pretty simple. In essence, it grants Canonical the right to use, modify and distribute your code, and it grants Canonical a patent license under any patents you own that may cover the code in question. But, most importantly, it grants Canonical the right to relicense your contribution under their choice of license. This means that, despite not being the sole copyright holder, Canonical are free to relicense your code under a proprietary license.

Given Canonical's market goals, this makes sense. They can relicense Mir (and any other GPLv3 projects they own) under licenses that keep their hardware partners happy, and they can ship in the phone market. Everyone's a winner.

Except, if Canonical want to ship proprietary versions, why not just license Mir under a license that permits that in the first place? This is where the asymmetry comes in. The Android userland is released under a permissive license that allows anyone to take Google's code, modify it as they wish and ship it on whatever hardware they want. I could legally start a company that provided customised versions of Android to phone vendors without them having any GPLv3 concerns. I won't be able to do that with Ubuntu Phone.

I'm a fan of GPLv3. I think the provisions it contains to support user freedom are important. I hate the growing trend of using free software to build devices that are, effectively, impossible for the end user to modify. If Canonical were releasing software under GPLv3 because of a commitment to free software then that would be an amazing thing. But it's pretty much impossible to square the CLA's requirement that contributors grant Canonical the right to ship under a proprietary license with a commitment to free software. Instead you end up with a situation that looks awfully like Canonical wanting to squash competition by making it impossible for anyone else to sell modified versions of Canonical's software in the same market.

Canonical aren't doing anything illegal or immoral here. They're free to run their projects in any way they choose. But retaining the right to produce proprietary versions of external contributions without granting equivalent reciprocal rights isn't consistent with caring about free software or contributing to the wider Linux community, especially if it means you get to exclude those external contributors from the market you're selling their code into.

(Edit to add: a friend in the contracting industry points out that it also prevents vendors who won't ship GPLv3 from using external contractors to work on Mir - they have to go to Canonical, because only Canonical can relicense contributions under a proprietary license.)

[1] Right now Ubuntu Phone is using Surfaceflinger, the Android display server, but that's apparently just an interim solution.

comment count unavailable comments

June 19, 2013 10:50 PM

Dave Jones: Daily log June 18th 2013

Lots of hitting already-found bugs today (sync() lockups, btrfs lockdep report, btrfs warnings).
Things are bad enough I’m having to avoid running certain things during testing.
Spent some time looking through various upstream trees to ascertain just how much stuff is fixed, and pending merge. Results inconclusive, but I get the feeling a lot of people are waiting for 3.10 before merging, which is annoying.

Rebuilt some older test machines to get some 32-bit in the mix. Hit a bug straight away. Again, one that has already been reported a month ago. Nngh.

Started paving the way for merging some of my linked-list debugging improvements, starting with some cleanups to get rid of __list_for_each, now that it’s redundant. (It’s identical to the regular list_for_each() now that the prefetching has been removed).

Also started prepping some of the other debug patches I’ve been sitting on for a while.

Daily log June 18th 2013 is a post from: codemonkey.org.uk

June 19, 2013 06:07 AM

June 18, 2013

Eric Sandeen: Amazon cancels Minnesota affiliates

[Note, if you're looking for an alternative, you might try VigLink; I'm giving that a shot now to see how it goes.]

Well, it happened.  First and foremost, I’ve always tried to make my blog interesting to readers interested in technology & energy, and in the process I’ve sometimes linked out to relevent products on Amazon, to make me a little beer money.  I’ve tried not to be too annoying or gratuitous about it, but it did help a little to offset the ISP charges etc.  But today I got this email:

We are writing from the Amazon Associates Program to notify you that your Associates account will be closed and your Amazon Services LLC Associates Program Operating Agreement will be terminated effective June 30, 2013. This is a direct result of the unconstitutional Minnesota state tax collection legislation passed by the state legislature and signed by Governor Dayton on May 23, 2013, with an effective date of July 1, 2013. As a result, we will no longer pay any advertising fees for customers referred to an Amazon Site after June 30 nor will we accept new applications for the Associates Program from Minnesota residents.

As near as I can tell, Amazon has neatly evaded the law, which added:

(b) A retailer is presumed to have a solicitor in this state if it enters into an agreement with a resident under which the resident, for a commission or other substantially similar consideration, directly or indirectly refers potential customers, whether by a link on an Internet Web site, or otherwise, to the seller.

So: Chuck out all the affiliates, collect no tax, done and done.  The state is no better off, and the bloggers in the state are worse off.  This is exactly what has happened in other states, so it should come as no surprise to our esteemed legislators.  I get it that states are hurting from dropping sales tax from brick and mortar stores and are looking for solutions, but it should have been obvious to anyone paying attention that this law would have very little effect when it’s this simple for places like Amazon to avoid it.

I was tempted to purge all links to Amazon from the blog – why send my good readers there for free?  ;)  But going forward, I guess I’ll try VigLink, which is sort of an affiliate of affiliates, and seems immune from this kind of thing, at least for now.  It looks trivial to switch over to w/o needing to go fix up any existing articles.  Hopefully it won’t make me look too craven; I’ll fine tune it as we go along.

June 18, 2013 01:52 PM

Dave Jones: Daily log June 17th 2013

Busy day. Hit a whole bunch of new bugs.

Daily log June 17th 2013 is a post from: codemonkey.org.uk

June 18, 2013 04:27 AM

June 14, 2013

Dave Jones: Daily log June 14th 2013

Looked over my 3.10-rc5 outstanding issues. Some more patches got merged, so things are starting to look better, as long as I don’t find any new problems next week.

Daily log June 14th 2013 is a post from: codemonkey.org.uk

June 14, 2013 08:10 PM

Dave Jones: Weekly Fedora kernel bug statistics – June 14th 2013

  17 18 19 rawhide  
Open: 244 424 134 72 (874)
Opened since 2013-06-07 2 25 12 7 (46)
Closed since 2013-06-07 10 9 3 1 (23)
Changed since 2013-06-07 16 39 32 13 (100)

Weekly Fedora kernel bug statistics – June 14th 2013 is a post from: codemonkey.org.uk

June 14, 2013 04:06 PM

June 13, 2013

Matt Domsch: Fedora Project Board Town Hall Thursday 1900 UTC

I have the pleasure of moderating the Fedora Project Board Town Hall today, 1900 UTC, having served on the board for five years previously.  Held on IRC, these Town Halls give project members a chance to ask questions directly of the five Board candidates, so that you can make a more informed decision when casting your vote.  I hope you can join us.

June 13, 2013 03:38 PM

Dave Jones: Daily log June 12th 2013

Daily log June 12th 2013 is a post from: codemonkey.org.uk

June 13, 2013 04:58 AM

June 12, 2013

Dave Jones: Daily log June 11th 2013

Daily log June 11th 2013 is a post from: codemonkey.org.uk

June 12, 2013 04:03 AM

June 11, 2013

Andy Grover: New screencast: 10 New Features in LIO and targetcli

I posted a new screencast that talks about ten new ease-of-use features that are new in Fedora 18.

10 New Features in LIO and targetcli

  1. Easier storage->ACL setup
  2. Name shows up as LUN model name
  3. Tags for initiator aliases and grouping
  4. ‘info’ command
  5. IPv6 portal support
  6. WWNs normalized
  7. Only show HW fabrics that are present
  8. 10 previous configs saved
  9. More info in summary
  10. iSER support
  11. Better sorting

June 11, 2013 04:38 PM

Dave Jones: Daily log June 10th 2013

My 3.10-rc5 outstanding issues.

Daily log June 10th 2013 is a post from: codemonkey.org.uk

June 11, 2013 04:12 AM

Dave Jones: git commit statistics.

Linus posted a git one-liner this morning that ended up intriguing me.

git log --pretty=%aD --author=davej@redhat.com | cut -c1-3 | sort | uniq -c | sort -n

The output looks like this..

     37 Sat
     42 Sun
     50 Fri
     73 Wed
     78 Mon
     79 Thu
    112 Tue

What I found interesting, is that on every single git repo I ran that command on, Tuesday was my most productive day.
As much as I hate Mondays, I think the real reason there is that I treat Mondays as ‘catch up’ day for the most part. As can be clearly seen, I don’t really work on weekends any more, so Mondays tend to be dealing with a backlog of email/git commits, bringing kernels on test machines up to date, and paging in context from whatever I was working on the prior week.

git commit statistics. is a post from: codemonkey.org.uk

June 11, 2013 01:17 AM

June 10, 2013

Pavel Machek: Dear William Hague

If you are not a criminal, please publish your PIN, bank account password, email/facebook passwords and complete medical history. You don't have anything to hide, do you?

June 10, 2013 09:00 AM

June 08, 2013

Eric Sandeen: Running the numbers on Minnesota’s solar mandate

At the end of the 2013 legislative session in Minnesota, legislators passed an omnibus energy bill which included, among other things, a requirement that investor-owned utilities in Minnesota (Read: Xcel Energy) must generate 1.5% of their electricity from solar by 2020.  There were a lot of other things in there as a result of the sausage law-making process for the solar mandate, including some that I’m not very fond of, but the bottom line of encouraging more solar development is a good thing in my book.  (Also, it was signed into law on my birthday!)

Solar Panels1.5% doesn’t sound like a whole lot, but what does it really mean in terms of physical solar PV deployments?  Numbers have been tossed around that this will require 450MW of new capacity in the next 7 years.

Assuming the 450MW number is correct, and picking 250W panels as a common panel size today, that’s 450,000,000 / 250 = 1,800,000 or 1.8 million panels installed by 2020.  That’s about 700 panels installed every day for 7 years.

If commodity sized (65x39cm) panels are used, that’s about 112 acres of panels (if they were laid out flat and edge to edge, which of course they aren’t) ;)  That’s roughly equivalent to 112 US football fields.

Is this possible?  Sure.  Austria installed 230MW in 2012 alone.  New Jersey installed 415MW in 2012.  And Minnesota gave itself 7 years to accomplish this goal.

Is 450MW the right number?  According to the NREL PVWatts calculator for Minneapolis, 450MW of optimally situated, fixed solar PV could be expected to generate 578,512 MWh of solar energy in the course of a year.

According to the EIA energy data browser, all utilities (including co-ops etc) in Minnesota generated 42,586,000 MWh in 2012.  578,512MWh is about 1.3% of that number.  Xcel is by far the largest generator, so if we take out the smaller co-ops etc, 450MW does seem like a reasonable ballpark number.

There are already large companies ready to jump at this.  Geronimo Energy has submitted a proposal to provide up to 100MW of capacity at up to 31 sites ranging from 2 to 10MW.  I honestly hope this isn’t the predominant mode of development.  We have an awful lot of flat roofs which would be well suited – for example, Ikea put 1MW on their Minnesota store last year.  100 acres or so isn’t all that much land, but I’d still rather see this go up on the built environment before we start using farmland & green space.

I’m excited to see how this works going forward.  Will my friends in the small-scale solar installation business stay busy?  Will SolarCity come to town?  Will companies like Geronimo make up the bulk of this with giant installations?  Will it reduce the need for new gas peaker plants?  Time will tell, but it’s an exciting time for solar in Minnesota, for sure.

June 08, 2013 10:23 PM

Dave Jones: Daily log June 7th 2013

My 3.10-rc4 outstanding issues.

Daily log June 7th 2013 is a post from: codemonkey.org.uk

June 08, 2013 04:07 AM

June 07, 2013

Dave Jones: Weekly Fedora kernel bug statistics – June 07 2013

  17 18 19 rawhide  
Open: 250 417 126 66 (859)
Opened since 2013-05-31 0 26 17 1 (44)
Closed since 2013-05-31 16 11 5 2 (34)
Changed since 2013-05-31 41 41 26 11 (119)

Weekly Fedora kernel bug statistics – June 07 2013 is a post from: codemonkey.org.uk

June 07, 2013 05:54 PM

Dave Jones: Daily log June 6th 2013

Daily log June 6th 2013 is a post from: codemonkey.org.uk

June 07, 2013 05:33 AM

June 06, 2013

Dave Jones: daily log June 5th 2013

daily log June 5th 2013 is a post from: codemonkey.org.uk

June 06, 2013 02:36 PM

June 05, 2013

Dave Jones: daily log June 4th 2013

daily log June 4th 2013 is a post from: codemonkey.org.uk

June 05, 2013 03:40 AM

Harald Welte: Attending HITCON and COSCUP in Taipei

It is my pleasure to attend the HITCON 2013 and COSCUP 2013 conferences in July/August this year. They are both in Taipei. HITCON is a hacker/security event, while COSCUP is a pure Free/Open Source Software conference.

At both events I will be speaking at the growing list of GSM related tools that are available these days, like OpenBSC, OsmcoomBB, SIMtrace, OsmoSGSN, OsmoBTS, OsmoSDR, etc. As they are both FOSS projects and useful in a security context, this fits well within the scope of both events.

Given that I'm going to be back to Taiwan, I'm looking very much forward to meeting old friends and former colleagues from my Openmoko days in Taipei. God, do I miss those days. While terribly stressful, they still are the most exciting days of my career so far.

And yes, I'm also going to use the opportunity for a continuation of my motorbike riding in this beautiful country.

June 05, 2013 02:00 AM

June 04, 2013

Dave Jones: daily log June 3rd 2013

daily log June 3rd 2013 is a post from: codemonkey.org.uk

June 04, 2013 03:58 AM

June 03, 2013

Andy Grover: Fedora for short-lifespan server instances

I read Máirín Duffy’s coverage of the Fedora Board’s userbase discussion. Really interesting. I wanted to add my take.

tl;dr: Puppet/Chef make Fedora’s short support period much less of an issue.

The OS is a building block

I’ve been watching a lot of videos on DevOps lately. Several close friends of mine are sysadmins and I’ve been learning a lot from them about the transformation that their profession is undergoing. From this year’s ChefConf, Adam Jacob’s keynote and the talk by Sascha Bates really impressed on me the big change in how admins should view machines — They’re not permanent, or even semi-permanent. They are ephemeral snowflakes that may live a year, or just an hour, so don’t get too attached.

Part of why admins like VMs is because the isolation they provide between different services. I used to run mail, DNS, and httpd from a single machine. Everything was mostly separate but not quite everything. They had separate userids but everyone’s config was in /etc, even touching the same files, sometimes. A full disk affected everybody. /var/log/messages didn’t split up their logging cleanly (by default, anyway.) It all just was built assuming there would be an admin at a command shell who could use their brain to resolve the conflicts to make everything play nice on a single OS image.

One service per instance

Admins adopted VMs for isolation, increased density, and better per-service resource allocation, but then ran into other problems. The setup that they did by hand once per new-hardware now was once per instance. (Editing /etc/sudoers for the fiftieth time gets old.) The tools then evolved further, until today one may keep no persistent state in an instance. Now, an instance is kickstarted into existence and configured automatically for the one job it will ever do. The sysadmin’s job isn’t to herd boxen any more, it’s to build ‘em, run ‘em, and then reap ‘em.

All the OS mechanisms for co-existing server processes, they’re now either obsolete or vestigial to some degree. What is important is the malleability of the OS to assume all the tasks it may be asked to – rather like a stem cell needs to be able to become a nerve or muscle, but never needs to be both.

Never upgrade, just redeploy

Let’s come back to the odd fact that Fedora is both a precursor to RHEL, and yet almost never used in production as a server OS. I think this is going to change. In a world where instances are deployed constantly, instances are born and die but the herd lives on. Once everyone has their infrastructure encoded into a configuration management system, Fedora’s short release cycle becomes much less of a burden. If I have service foo deployed on a Fedora X instance, I will never be upgrading that instance. Instead I’ll be provisioning a new Fedora X+1 instance to run the foo service, start it, and throw the old instance in the proverbial bitbucket once the new one works.

Cheap and easy virt and config management gives admins what they’ve always wanted — stability when they want it (run a LT support distro image, or for the VM host) or the latest stuff for their fast-moving business-oriented instances, by running a fast-update or rolling-release distro.

What Fedora should do

We’re already working on some of these to some degree — I think we should try to do even more to ensure Fedora is useful for the fast-update instance role.

First, Fedora needs to be able to be small. Nobody’s going to read the manpages on a throwaway instance, nobody’s even going to run vi. Image size matters when multiplied for each instance. Can we get by without /usr/share/doc/* and its thousands of copies of the GPL text? Fedora seems pretty good but there must be more we can do.

Second, we need to ensure Fedora supports the packages people are really using these days. Latest Ruby. Latest OpenStack. Vagrant. Django. Chef. Puppet. All the weird JS stuff that’s popular now on GitHub. :-) Continue to improve packaging tools so it’s easier for new contributors to do their first package, as well as for long-time packagers to maintain more packages. And not just package for contribution to Fedora, but for admins to package for solely internal distribution. Like Sascha Bates stresses in her talk, packaging is a huge benefit to automation, but it does require effort. It can be easier.

Finally, I think we need to continue to look at how easy it is to configure and manage an instance of the OS, and tailor it more for automated configuration. I believe the key to this is adding programmatic interfaces where they are lacking. See my “All Plumbing needs an API” talk. Since we’re probably being configured by another piece of code rather than a person at the shell, we need clear, unambiguous programmatic interfaces with good error handling. Chef should not be calling cmdline tools and checking error codes, there should be a Ruby configuration library that natively controls the whatever-it-is directly! We want configuring Fedora to be fast, straightforward, and reliable.

Conclusion: Stable+fast-update is better than stable+self-built

Practically the whole history of Linux distros has been the conflict between stability and new features. With virtualization, one still must make this choice, but at a much finer granularity than before. If you’re going to re-instance within 6 months anyways, why manually build your latest-Ruby and whatnot to support your app on top of a stable distro image? Maybe just use Fedora for those.

June 03, 2013 04:00 PM

Matthew Garrett: Dealing with UEFI non-volatile memory quirks

Since I wrote this, we've made some worthwhile progress on avoiding damaging Samsung hardware. The first is that the samsung-laptop driver appeared to be causing the firmware to attempt to write to an area of memory that was marked in the chipset, triggering a Machine Check Exception. That was what generated the pstore output that caused the problem originally. The driver now refuses to load if EFI is enabled, which avoids the problem. It's not ideal, since it's currently the only mechanism we have for certain functionality on Samsung laptops, but there you go.

The second problem was that avoiding crashing on boot didn't actually fix the problem in any fundamental way. Even with pstore disabled, it was possible for userspace to fill the nvram and trigger the same problem. Our first approach to this was to prevent any writes to nvram if the UEFI QueryVariableInfo() call reported that more than 50% of the nvram storage space would be used. That was safe, but led to another issue. The nvram storage area is typically implemented as part of the same flash chip as the firmware. Flash isn't arbitrarily accessible - changing the contents of a block typically involves rewriting the entire block. It's impractical to rewrite the entire nvram area on every write, so what actually happens is that deleting variables just results in them being marked as inactive but doesn't actually free up the space. The firmware can later perform some sort of garbage collection to free it up.

This caused us problems, since inactive space that hasn't been garbage collected yet isn't actually available, and as a result firmware implementations tend to count it as used. Say you had 64KB of nvram and wrote 32KB of variables. We'd then refuse to write any more because you'd drop below 50%. So you delete 16KB of the variables you've created and try again. Unfortunately, the firmware still thinks that there's 32KB in use and Linux would still refuse.

If you were lucky, rebooting would trigger a garbage collection run. If you weren't, it wouldn't. Problematic. Our next approach was to try to account for the space actually actively used by the variables, rather than relying on what the firmware told us via QueryVariableInfo(). This seems simple enough - just add up the size of all the variables and subtract that from the overall size to determine how much of the "used" space is actually just old inactive variables that can be ignored. However, there's still some problems there. The first is that each variable has some additional overhead associated with it, and the size of that overhead varies depending on the system vendor. We had to make a conservative guess, which could cause problems if systems had large numbers of small variables. The second is that the only variables the kernel can see are those that are flagged as runtime-visible. There may also be a significant quantity of nvram used to store variables that are only visible in boot services code. We could work around this by adding up sizes while we're still in boot services code, but on some systems calling QueryVariableInfo() before ExitBootServices() results in later calls to GetNextVariable() jumping to invalid addresses and crashing the kernel. Not a great approach.

Meanwhile, Samsung got back to us and let us know that their systems didn't require more than 5KB of nvram space to be available, which meant we could get rid of the 50% value and replace it with 5KB. The hope was that any system that booted with only 5KB of space available in nvram would trigger a garbage collection run. Unfortunately, it turned out that that wasn't true - some systems will only trigger garbage collection if the OS actually makes an attempt to write a variable that won't otherwise fit.

Hence this patch. The new approach is to ask the firmware how much space is available. If the size of the new variable would reduce this to less than 5K, we attempt to create a variable bigger than the remaining space. This should cause the firmware to realise that it's out of room and either (depending on implementation) perform a garbage collection run at runtime or set a flag that will cause the system to perform garbage collection on the next reboot. We then call QueryVariableInfo() again to see whether a garbage collection run actually happened, and if so check whether we now have enough space. If so, we go ahead and write the variable. If not, we tell userspace that there's not enough space.

This seems to work in all the situations I've tested, and it should avoid ending up in a situation where a Samsung can end up bricked. However, it's firmware, so who knows whether it's going to break things for someone else.

comment count unavailable comments

June 03, 2013 03:25 PM

Harald Welte: Rest In Peace, Atul Chitnis

Today, very sad news has reached me: Atul Chitnis has passed away. Most people outside of India will most likely not recognize the name: He has been instrumental in pioneering the BBS community in India, and the founder and leader of the Linux Bangalore and later FOSS.in conferences, held annually in Bangalore.

I myself first met Atul about ten years ago, and had the honor of being invited to speak at many of the conferences he was involved in. Besides that professional connection, we became friends. The warmth and affection with which I was accepted by him and his family during my many trips to Bangalore is without comparison. I was treated and accepted like a family member, despite just being this random free software hacker from Germany who is always way too busy to return the amount of kindness.

Despite the 17 year age difference, there was a connection between the two of us. Not just the mutual respect for each others' work, but something else. It might have been partially due to his German roots. It might have been the similarities in our journey through technology. We both started out in the BBS community with analog modems, we both started to write DOS software in the past, before turning to Linux. We both became heavily involved in mobile technology around the same time: He during his work at Geodesic, I working for Openmoko. Only in recent years his indulgence in Apple products was slightly irritating ;)

Only five weeks ago I had visited Atul. Given the state of his health, it was clear that this might very well be the last time that we meet each other. I'm sad that this now actually turned out to become the thruth. It would have been great to meet again at the end of the year (the typical FOSS.in schedule).

My heartfelt condolences to his family. Particularly to his wonderful wife Shubha, his daughther Anjali, his mother and brother. [who I'm only not calling by their name in this post as they deserve some privacy and their Identities is not listed on Atuls wikipedia page].

Atul was 51 years old. Way too young to die. Yet, he has managed to created a legacy that will extend long beyond his life. He profoundly influenced generations of technology enthusiasts in India and beyond.

June 03, 2013 02:00 AM

June 01, 2013

Pavel Machek: Scary copy&paste issues

This is scary. I wonder if it is worth filling security bugs against web browsers?

June 01, 2013 01:12 PM

May 31, 2013

Dave Jones: daily log May 31st 2013

Amusing URL of the day.

My 3.10-rc3 outstanding issues.

daily log May 31st 2013 is a post from: codemonkey.org.uk

May 31, 2013 10:11 PM

Dave Jones: Monthly Fedora kernel bug statistics – May 2013

  17 18 19 rawhide  
Open: 267 404 110 65 (846)
Opened since 2013-05-01 21 103 43 14 (181)
Closed since 2013-05-01 28 62 53 25 (168)
Changed since 2013-05-01 51 155 54 22 (282)

Quite a few CVE’s this month. Made some progress through the btrfs backlog.
The 3.9 rebase revealed some new “Can’t boot” cases, which looks like yet another UEFI bios bug.

Monthly Fedora kernel bug statistics – May 2013 is a post from: codemonkey.org.uk

May 31, 2013 05:39 PM

Dave Jones: Weekly Fedora kernel bug statistics – May 31st 2013

  17 18 19 rawhide  
Open: 267 404 110 65 (846)
Opened since 2013-05-24 7 36 14 1 (58)
Closed since 2013-05-24 14 11 8 4 (37)
Changed since 2013-05-24 16 63 24 4 (107)

Weekly Fedora kernel bug statistics – May 31st 2013 is a post from: codemonkey.org.uk

May 31, 2013 05:32 PM

Dave Jones: daily log May 30th 2013

daily log May 30th 2013 is a post from: codemonkey.org.uk

May 31, 2013 04:43 AM

May 30, 2013

Dave Jones: daily log May 29th 2013

Up early. Went into office. On arrival, found that laptop wouldn’t boot (now I remember why this was the ‘backup’ laptop).
Spent an hour trying to coax it to until giving up and attempting reinstall. Which failed.
Did some bugzilla triage on an ipad that I happened to have with me. (Surprisingly not a horrible way to do this, first time for everything). Noticed we seem to be getting more filesystem/vfs related bugs than we used to. Going to need to scrounge up some more disks/controllers soon I think.

Started feeling ‘not right’ around lunchtime. Skipped lunch. Headaches and nausea mid afternoon.
Not the most productive of days.

daily log May 29th 2013 is a post from: codemonkey.org.uk

May 30, 2013 12:59 AM

May 29, 2013

Dave Jones: daily log May 28th 2013

daily log May 28th 2013 is a post from: codemonkey.org.uk

May 29, 2013 03:28 AM

May 28, 2013

Matthew Garrett: Secure Boot isn't the only problem facing Linux on Windows 8 hardware

There's now no shortage of Linux distributions that support Secure Boot out of the box, so that's a mostly solved problem. But even if your distribution supports it entirely you still need to boot your install media in the first place.

Hardware initialisation is a slightly odd thing. There's no specification that describes the state ancillary hardware has to be in after firmware→OS handover, so the OS effectively has to reinitialise it again. This means that certain bits of hardware end up being initialised twice, and that's slow in some cases. The most obvious is probably USB, which has various timeouts as you wait for hardware to settle. Full USB support in the firmware probably adds a couple of seconds to boot time, and it's arguably wasted because the OS then has to do the same thing (but, thankfully, can at least do other things at the same time). So, looking for USB boot media takes time, and since the overwhelmingly common case is that users don't want to boot off USB, it's time that's almost always wasted.

One of the requirements for Windows 8 certified hardware is that it must complete firmware initialisation within a specific amount of time, something that Microsoft refer to as "Fast Boot". Meeting these requirements effectively makes it impossible to initialise USB, and it's likely that certain other things will also be skipped. If you've got a USB keyboard then this obviously means that your keyboard won't work until the OS starts, but even i8042 setup takes time and so some laptops with traditional PS/2-style keyboards may not set it up. That means the system will ignore the keyboard no matter how much you hammer it at boot, and the firmware will boot whichever OS it finds.

For a newly purchased device, that's going to be Windows 8. It's not too much of a problem with a fully installed Windows 8, since you can hold down shift while clicking the reboot icon and get a menu that lets you reboot into the firmware menu. Windows sets a flag in a UEFI variable and reboots the system, the firmware sees that flag and does full hardware initialisation and then drops you into the setup environment. It takes slightly longer to get into the firmware, but that's countered by the time you save every time you don't want to get into the firmware on boot.

So what's the problem? Well, the Windows 8 setup environment doesn't offer that reboot icon. Turn on a brand new Windows 8 system and you have two choices - agree to the Windows 8 license, or power the machine off. The only way to get into the firmware menu is to either agree to the Windows 8 license or to disassemble the machine enough that you can unplug the hard drive[1] and force the system to fall back to offering the boot menu.

I understand the commercial considerations that result in it ranging from being difficult to impossible to buy new hardware without Windows pre-installed, but up until now it was still straightforward to install an alternative OS without agreeing to the Windows license. Now, installing alternative operating systems on many new systems will require you to give up certain rights even if you want nothing other than to reach the system firmware menu.

I'm firmly of the opinion that there are benefits to Secure Boot. I'm also in favour of setups like Fast Boot. But I don't believe that anyone should be forced to agree to a EULA purely in order to be able to boot their own choice of OS on a system that they've already purchased.

[1] Which is a significant and probably warranty-voiding exercise on many systems, and that's assuming that it's not an SSD soldered to the motherboard…

comment count unavailable comments

May 28, 2013 09:41 PM

May 26, 2013

Daniel Vetter: i915/GEM Crashcourse: Overview

Now that the entire series is done I've figured a small overview would be in order.

Part 1 talks about the different address spaces that a i915 GEM buffer object can reside in and where and how the respective page tables are set up. Then it also covers different buffer layouts as far as they're a concern for the kernel, namely how tiling, swizzling and fencing works.

Part 2 covers all the different bits and pieces required to submit work to the gpu and keep track of the gpu's progress: Command submission, relocation handling, command retiring and synchronization are the topics.

Part 3 looks at some of the details of the memory management implement in the i915.ko driver. Specifically we look at how we handle running out of GTT space and what happens when we're generally short on memory.

Finally part 4 discusses coherency and caches and how to most efficiently transfer between the gpu coherency domains and the cpu coherncy domain under different circumstances.

Happy reading!
Update: There's now also a new article with a few questions and answers about some details in the i915 gem code.

May 26, 2013 02:42 PM

Daniel Vetter: i915/GEM Q&A

So apparently people do indeed read my my i915/GEM crashcourse and a bunch of follow-up questions popped up in private mails. Since I'm a lazy bastard I've clean some of the common questions&answers up to be able to easily point at them. And hopefully they also help someone else to clarify things a bit.

Question: What’s the significance of i915_gem_sw_finish_ioctl ? It seems to flush cpu caches, but only conditional on obj->pin_count != 0. Why does it no unconditionally flush the cpu caches like e.g. when we move an unsnooped/not LLC-cached object into a gpu domain?

Answer: i915_gem_sw_finish_ioctlis only used to flush out cpu rendering to the display (and in current userspace it's not used at all). obj->pin_count != 0 is used as a proxy for "this a scanout buffer". Obviously more intelligent userspace should know whether it is doing cpu rendering to a displayed buffer or not and force the expensive clflushing with e.g. the set_domain ioctl only when really required. But the sw_finish ioctl is called from the libdrm cpu mmap unmap function, which does not have this knowledge at hand, hence the check in the kernel. Furthermore for efficient integration of cpu rendering into the gpu render pipeline we want to use snoopable objects even on non-LLC platforms which means that this ioctl shouldn't really be used any more for new code.

Question: So the cpu can only access a GEM object through the GTT when it's in the mappable part of the GTT, i.e. when gtt_offset + size <= gtt_mappable_end. But the i915_gem_object_set_to_gtt_domain function does not check that whether this condition is satisfied or not and simply goes ahead with the domain change. Why is that done so, even though the cpu won't be able to access the buffer object at its current place?

Answer:The GTT domain is purely about coherency, i.e. a buffer object is in the GTT domain if reads/writes through the GTT would see the correct values. The other big domain is cpu domain, i.e. the data (when accessed directly in the physical memory location, not going through the GTT) is coherent with cpu caches. Shifting between these two domains requires flushing/invalidating cpu caches.

Note that on recent kernels that doesn't even mean that there's a global GTT mapping allocated for that buffer object: This is used to optimize away the redundant cache flushing when moving an object around, e.g. when moving it into the mappable range to serve a cpu access page fault. In the future this will be even more common once we have proper per-process GTT address spaces. Then an object could be fully coherent with the GTT domain, read by the gpu through a PPGTT mapping, but don't have an offset allocated for it in the global GTT at all.

The mappable GTT address range on the other hand is a different concept and simply means the object has a GTT mapping visible to the cpu (on gpus without PPGTT the global GTT can be up to 2g, but only 256m are usually visible in the pci bar). Note that GEM object can be mappable but can be (at the same time) in the cpu domain. This happens when userspace writes to the buffer object through the cpu mappings.

Question: How does the the i915_gem_fault function handle a page fault when it itself is invoked through a page fault in the i915 GEM kernel code? Like suppose if fault_in_pages_readable function is called which dereferences a user pointer - won't that cause issues with deadlocks?

Answer:Yes, this can happen and we need to be careful that we cannot possible deadlock with our own pagefault handlers. And it's not just theoretical, it happens in the wild when a GL client tries to use a pointer obtained from one of the texture mapping funtions (which can use a GTT memory mapping internally) to upload data (which could use the pwrite GEM ioctl).

These potential deadlocks are resolved by instructing the linux memory subsystem to not serve pagefaults when accessing userspace memory but instead fail it. Then our code can release any resources and locks required by our own page fault handler and retry the operation in a slowpath. Often this requires that we copy the data into a (unfaultable) temporary buffer in kernel's memory space. These atomic sections are often implicit, but we have a few places where we need to explicitly disable page fault handler with pagefault_disable/enable() calls.

Question: Is obj->fenced_gpu_access ever set on modern platforms - it seems not? Or could this cause a stall waiting for the gpu when all fences are in use and we need a few fence to handle a GTT page fault?

Answer: No, this is only ever set on Gen2/3 devices. Those gpus use the same GTT fences used on all platforms for detiling cpu access also for gpu access, at least for some gpu rendering functions. So this is irrelevant on modern platforms and can't lead to a stall in the pagefault handler when accessing an otherwise idle buffer object.

Question: What is this wedeged stuff - there's lots of references to it in the i915 GEM code?

Answer: This is part of the gpu hang detection and reset handling code, which I didn't really cover in my crashcourse. It is set when we've detected a hang but failed to reset the gpu. It will cause all subsequent command submission from userspace to fail with -EIO, which is used by userspace as a signal to fall back to software rendering. The i915 hang detection and reset code has been (and still is) under pretty active development and is nowadays a rather complex piece of code. I plan to cover it more in-depth hopefully soon.

Question: In the use_cpu_reloc function, why is the obj->cache_level != I915_CACHE_NONE condition used?

Answer: That's just crazy optimization - it's always faster to write relocations through cpu maps if LLC caching is enabled. But without caching it's faster to use global GTT access - but then only if we have the mappable mapping already set up. Note that pwrite ioctl code has similar tricks.

May 26, 2013 02:40 PM

May 25, 2013

Eric Sandeen: Enphase Solar Microinverter Clipping Analysis

I wanted to look at how much the “clipping” behavior of power-limited solar microinverters affected my annual energy production.  The TL;DR version is: at worst, only about 0.6% loss due to clipping.  For more, read on.

A photovoltaic inverter is a device which converts the DC energy from the panel into AC energy for the grid; it also manages optimum power point tracking.  Traditionally this was done with a big central inverter for all panels combined; recently companies such as Enphase Energy have started making microinverters, which are per-panel devices.  One advantage of these devices is that each panel operates independently so that if one panel is shaded, damaged, or dirty, it doesn’t affect the rest of the array.

I have 11 230W Solar PV panels on my roof with an Enphase M190 micronverter on each.  These are nominally 190W devices, though in practice they have a maximum output of 199W.  (Note, these are 3 year old models; Enphase now has microinverters with more capacity).  The fact that the panels can produce more than the microinverter can handle might seem like an issue; indeed on a cool, clear day we can see the effect:

clippingSo from the graph above it’s clear that I am losing a little energy production during that clipping.  What would normally be a smooth curve is flattened out at the top as I hit the 11x199W = 2189W limit.  (A few factors affect whether this clipping happens; obviously we need a clear day, but optimal sun angle and, perhaps more than anything, panel temperature affects it greatly).

I wanted to try to quantify this a bit – how much am I really losing from this behavior?

One cool thing about the Enphase units is that they report data every 5 minutes, and this data can be queried via a standard data API.  So I pulled down the past 365 days worth of data to see how often I was clipping.  I grabbed 5-minute data files for each day, and looked for when clipping seems to start, by looking at watt output around the clipping point, and how many 5-minute entries there were for each wattage:

$ egrep -w "21[789][0-9]" *.json | awk -F : '{print $8}' | sort | uniq -c
   9 2170,"enwh"
  10 2171,"enwh"
  15 2172,"enwh"
  11 2173,"enwh"
  17 2174,"enwh"
  17 2175,"enwh"
  11 2176,"enwh"
  20 2177,"enwh"
  24 2178,"enwh"
  14 2179,"enwh"
  17 2180,"enwh"
  21 2181,"enwh"
  32 2182,"enwh"
  51 2183,"enwh"
 107 2184,"enwh" <-- actual clipping start?
 166 2185,"enwh"
 134 2186,"enwh"
 119 2187,"enwh"
  97 2188,"enwh"
  62 2189,"enwh" <-- nominal clipping, 11x199
  21 2190,"enwh"
   2 2191,"enwh"

There’s a pretty big jump at 2184W, so I went with that as a definition of “when clipping starts” vs. the nominal 2189W.  Adding up the occurrences of clipping, I got 708 5-minute intervals of clipping out of the last 365 days.  That’s about 59 hours.

So how much energy is that?  My panels can nominally make 11x230W = 2530W of output, so 2530-2184 = 346W lost, at most, during clipping.  That’s actually too high; not every instance above is clipping, and not every interval would have been making the maximum output.  So we’ll take that as a high estimate.

346W x 59 hours is 20,414 watt-hours, or about 20kWh.  At around $0.10/kWh, that’s about $2.00 of lost value.  Over the same year period, my array made 3,356 kWh, so 20kWh lost is about 0.6% of that.  I would hope that the microinverters made up at least that much by virtue of keeping the array going over the winter when some panels were snow-covered, etc.  Remember, many of my assumptions above make this a high estimate.

One other interesting datapoint is to see when this clipping occured.  By month, here’s how it looks:

clipping_by_monthMarch was far and away the highest; this is probably due in large part to the cooler temperatures, which make the panels more efficient.

May 25, 2013 09:08 PM

May 24, 2013

Dave Jones: daily log May 24th 2013

and now: three day weekend. \o/

daily log May 24th 2013 is a post from: codemonkey.org.uk

May 24, 2013 11:20 PM

Andy Grover: Using qla2xxx with LIO on Fedora

In addition to turning your Fedora 18 box into an iSCSI target, LIO also supports other SCSI transport layers (‘fabrics’), such as Fibre Channel, with the qla2xxx fabric.

The most crucial bit is to verify that the qla2xxx driver has initiator mode disabled — it should be operating in target mode only. You can check this with:

cat /sys/module/qla2xxx/parameters/qlini_mode

It should say ‘disabled’. If it doesn’t, create a file called /usr/lib/modprobe.d/qla2xxx.conf and put:

options qla2xxx qlini_mode=disabled

in it. Then, run ‘dracut -f’ to rebuild your initrd, and reboot.

Some of you may be wondering: why /usr/lib/modprobe.d instead of /etc/modprobe.d ? This is because qla2xxx is likely loaded from the kernel’s initial ramdisk (initrd), and dracut, the initrd building tool, omits “host-specific” settings in /etc/modprobe.d. While you’re mucking around, also make sure the firmware package for your qla device, such as ql2200-firmware or similar, is also installed.

targetcli won’t let you create a qla2xxx fabric if qlini_mode is wrong. Once it lets you create the qla fabric, you can add luns to it and grant access permissions to acls exactly in the same manner as the other LIO fabrics.

May 24, 2013 04:29 PM

Dave Jones: Weekly Fedora kernel bug statistics – May 24th 2013

  17 18 19 rawhide  
Open: 276 380 104 66 (826)
Opened since 2013-05-17 4 32 6 3 (45)
Closed since 2013-05-17 3 11 4 3 (21)
Changed since 2013-05-17 16 48 14 5 (83)

Weekly Fedora kernel bug statistics – May 24th 2013 is a post from: codemonkey.org.uk

May 24, 2013 04:10 PM

Dave Jones: daily log May 23rd 2013.

My 3.10-rc2 outstanding issues:

Puzzling website of the day: pain registers.

daily log May 23rd 2013. is a post from: codemonkey.org.uk

May 24, 2013 04:25 AM

May 23, 2013

Dave Jones: daily log May 22nd 2013.

Going to try and continue yesterdays daily log format for a while.

Spent so much of the day bisecting/building/rebooting that I didn’t write much new code today. Ho-hum.

daily log May 22nd 2013. is a post from: codemonkey.org.uk

May 23, 2013 03:51 AM

May 22, 2013

Michael Kerrisk (manpages): More man pages now rendered at man7.org

As detailed in this blog post, I've expanded the set of man pages rendered in HTML at http://man7.org/linux/man-pages/ to include pages in addition to those provided by the man-pages project. This change has several purposes. One main purpose is to provide a up-to-date and regularly updated HTML renderings of these man pages. (Most online man page renderings are out-of-date to some extent--in some cases, extremely out of date.) The other main purpose is to provide information on where to report bugs in each man page. To this end, each HTML rendering includes a COLOPHON that describes the origin of the page, notes the date when it was extracted, and provides information on where to report bugs in the page. (The man-pages project has already done this since December 2007, with the result that many more man page bugs are nowadays reported.)

Currently, man pages from nearly 40 projects are rendered, raising the number of pages rendered at man7.org from around 950 to around 1750. The projects that I have so far included have a bias that matches my interests: man-pages, projects related to low-level C and system programming (e.g., the ACL and extended attribute libraries), toolchain projects (e.g., gcc, gdb, Git, coreutils, binutils, util-linux), and other relevant tools (kmod, strace, ltrace, procps, expect) and tools relevant to manual pages (e.g., groff, man-db). The full list of projects and the corresponding man pages that are rendered can be found in the man pages by project index. I'm open to adding further projects to the rendered set, if they seem relevant. If you think there is a project that should be added, take a look at this blog post.

May 22, 2013 01:18 PM

Dave Jones: a day in the life..

Got back from vacation today (since last Thursday). Here’s how I spent the day.

a day in the life.. is a post from: codemonkey.org.uk

May 22, 2013 03:36 AM

May 17, 2013

Greg Kroah-Hartman: Updated history of the 2.6.16-stable kernel

A few years ago, I gave a history of the 2.6.32 stable kernel, and mentioned the previous stable kernels as well. I'd like to apologize for not acknowledging the work of Adrian Bunk in maintaining the 2.6.16 stable kernel for 2 years after I gave up on it, allowing it to be used by many people for a very long time.

I've updated the previous post with this information in it at the bottom, for the archives. Again, many apologies, I never meant to ignore the work of this developer.

May 17, 2013 04:34 PM

May 16, 2013

Pete Zaitcev: Joe Arnold on Software-defined Storage

At Havana summit they were giving away a paper version of Joe Arnold's "Software Defined Storage with OpenStack Swift". Very useful book for anyone dealing with Swift, I would be glad to pay the cover price of $25. But even more interestingly than tips on care and feeding of Swift, Joe opens the whole book thus:

[...] a de-coupled management system so customers could achieve (1) amazing flexibility in terms of how (and where) they deployed their storage, (2) control of their data without being locked-in to a vendor and (3) private storage at public cloud prices.

These features are the essence of Software Defined Storage (SDS), a new term the meaning of which is being defined. [...] Key aspects of SDS are scalability, adaptability, and the ability to use most any hardware. Through this de-coupling, operators can now make choices on how their storage is scaled and managed and how users can store and access data — all driven programmatically for the entire storage tier, regardless of where the storage resources are deployed.

Parts of the above prompt questions. Firstly, what good is de-coupling in respect to lock-in? SwiftStack effectively locks in by owning the de-coupled management. Sure, you own your data and could, in theory, manage your Swift with another management plane... I do not expect anyone crazy enough to try switching by anything less than standing up a new cluster. In any case, that part is not important, IMHO. The important part is programmatic control.

The phrase "SDS" jumps off "Software-Defined Networking". When SDN came into OpenStack, I was quite skeptical about it. It seemed too much like vendor-driven marketing bullshit. However, as users deployed the Project Formerly Known as OpenStack Quantum, it became clear that SDN answers their needs. The chief need was the ability to shape networks programmatically, overlaid on top of the physical networking plant, in service of the VMs.

Before SDN, when all this cloud thing came about, practitioners also struggled with the definition of it, and in particular the difference from the plain old datacenter virtualization. The difference is the programmatic control throughout. RHEV (now oVirt) eventually grew an API, which blurred the lines. But in OpenStack it was the main feature from the start. So you can manage everything and anything programmatically, including, for example, running on bare hardware. One can say that cloud is "Software-Defined Computing".

So, how does this programmatic thing apply to Swift? Joe had interesting insights cunningly hidden in the book, like these:

In an SDS system, reliability is the responsibility of the software, not the hardware. Replication and data integrity tactics are used to ensure that data does not become corrupt and that lost data is recovered.

[...]

A crucial function of an SDS system is to orchestrate capacity — storage, networking, routing & services — for entire cluster.

Swift covers the first part well already. The second is missing, or "de-coupled".

For galactic fairness, he also wrote things that seem wrong-headed to me:

There is no application sharding or managing volumes which can drive operational knowledge and complexity into applications because the SDS system is one cohesive system. Users do not need to ask for or know 'which storage pool' should be used because there is only one namespace.

The problem with hiding the pools outside of namespace is that they become invisible to the programmatic control as well, and such control is essential to the very definition of SDS. Someone at Amazon made a brilliant decision to make buckets a unit of replication in S3, so they can be linked to a region. In effect this hides the complexity but exposes knowledge that an application needs. Thus, any S3 client can do what Joe coniders SDS, but without any de-coupling, through the namespace and inside the API (or it can chose not to do it and just use a default region, for simplicity).

Joe's employees are hard at work implementing the vision as he outlined it, using the concept of regions that are internal to Swift cluster. The problem for everyone else, however, is how the programmatic control of that stack is exclusive to SwiftStack (with some useful things leaking into Swift, such as changeable replica count).

So, in the end, today Swift offers a solid foundation and parts of an SDS system, but the orchestration is "de-coupled" away elsewhere. Seems like a clear challenge to OpenStack to (re-)create the missing pieces.

P.S. I'd love to see the missing parts inside the Swift API and even namespace, although we have a problem here. Our Accounts and Containers are not guaranteed to live anywhere specifically or even on the same nodes. Changing that would be a step that I prefer. But Joe prefers to give up on plugging programmatic orchestration into the Swift API and just "de-couple" the heck of it. John, our benevolent PTL, seems to toe that line. Maybe they are right.

P.P.S. The deal with the programmatic orchestration is something that "unified" storage projects have to address too. E.g. in GlusterFS a program can issue mkdir(2). Is this programmatic control? No, not enough. Okay, they have glusterfsd nowadays, I can create volfiles in there, is that SDS? That is getting closer!

May 16, 2013 04:19 PM

Dave Jones: PSA: OCZ Vector SSD firmware.

My bad luck with hardware continues.

At the beginning of this year, I bought an SSD for my laptop

I previously wrote about the need to update smartmontools, which should now be updated everywhere. One thing I was not aware of at the time however, was that there’s a firmware update available. Had I known this, I would have applied it, because as soon as I hit the “400GB of lifetime writes” counter (coincidence?), it lost the ability to write to any block. It won’t even respond to secure erase commands.

The failure is exacerbated by the fact that the disk contains journalling filesystems in need of recovery. So if anything tries to mount them, it tries to write to the disk, and then falls off the bus requiring a power cycle to even see the disk again. The recovery tools provided by OCZ apparently try to mount every partition it finds during boot up (derp).

So now it’s on its way back to OCZ for reflashing/replacement. Lesson learned.

If you have one of these, and hdparm -I shows you have firmware 1.03, you might want to update it to 2.0. There are flashing tools on ocz’s site.
(in the form of bootable linux images, using an insane desktop that looks like what hacker movies in the 1990s looked like). There’s no guarantee that the new firmware actually fixes whatever problem I’ve hit, due to the lack of changelogs, but given it was the first thing they asked me to try, I’m going to say there’s a strong possibility it’s a known bug.

PSA: OCZ Vector SSD firmware. is a post from: codemonkey.org.uk

May 16, 2013 04:16 PM

Dave Jones: CVE-2013-2094. Another day, another fuzzed bug.

Last month Tommi found a kernel bug in perf_swevent_init using trinity, and posted a fix upstream. This apparently turned out to be a local root. Someone released an exploit for it this week. (interesting dissection of the exploit by spender here).

The code to fuzz perf_event_open was added to Trinity in November 2011. Yet for some reason, we only started to hit this recently. The sanitise routine for this syscall is still pretty basic, even after I added a little more to it yesterday. There’s probably more fruit on that branch somewhere.

There’s a date in the exploit code that claims it was written shortly after the affected code was merged upstream in 2010. Assuming that’s true, it’s taken way too long to find this. Trinity should have found this a lot sooner.

CVE-2013-2094. Another day, another fuzzed bug. is a post from: codemonkey.org.uk

May 16, 2013 03:01 PM

May 15, 2013

Dave Jones: 3.10rc1 testing status

3.10rc1 came out a few days ago. At 12,000 changesets, lwn calls it the busiest such ever. Statements like that usually make me nervous. But things are generally in pretty good shape. Much better than 3.9rc1 was.

and that’s been about it.

Generally feeling pretty solid. Fedora 19 is still going to ship with 3.9, but we’ll likely have a 3.10.x update on day of release.

3.10rc1 testing status is a post from: codemonkey.org.uk

May 15, 2013 02:40 PM

Eric Sandeen: Distributed solar variability

One of the common arguments against solar as an energy source is that it’s just too variable.  You can never count on it when you need it.  What if clouds roll in and out? [1]

One counter-argument might be – well, you never know when anyone will turn on their AC, either, at least not minute-by-minute.  The grid is a balancing act; unpredictable, random loads have the same effect as unpredictable, random generators.

To which one might then counter yes, but there are so many AC units out there, they average out, more or less, turning on and off at random times and smoothing things out in aggregate.

To which the solar advocate might reply OK, then with enough solar the peaks and valleys of generation should cancel out too, as clouds move out of one area into another.  Does this seem likely out in practice?

To find out, I grabbed 5 minute data from about 40 Enphase systems in the twin cities on a highly variable, sporadically cloudy day.  Because we don’t yet have a whole lot of solar here, and I didn’t want the one or two large commercial systems in the group to swamp the smaller residential systems, first I normalized them all to a % of their max output.  (This might be cheating a little, but with a lot more systems randomly distributed in size and geography, the swamping-out effect should be minimiized.)  Here’s what just 4 of those systems looks like; each is indeed pretty messy and unpredictable at the 5-minute range:

solar_junkThen I averaged all of the systems.  Here’s what the average looks like, compared to one of the individual systems:

solar_smoothingIt appears that things certainly do smooth out when we look at geographically distributed systems.  If I were a grid operator, I might feel a lot better about that.

The caveats might be that this is a very wide geographic range – I grabbed systems from all of the twin cities and suburbs.  And that’s probably larger than the various sub-grids within the cities; what the variability is within those subgrids is, or how this solar variability affects them, I’m not sure.  And of course my initial normalization of all systems to the same size could be argued with.

There have been much more rigorous papers and presentations written on this as well, see for example “Quantifying PV Power Output Variability” by Thomas E. Hoff and Richard Perez in 1999, and “Implications of Wide-Area Geographic Diversity for Short- Term Variability of Solar Power” by Andrew Mills and Ryan Wiser at LBNL in 2010.  But with the advent of 5-minute monitoring from systems like Enphase, I wonder if even better results could be found from this wealth of data.

[1]  I’ll submit that a sporadically cloudy day is more trouble to a grid operator than a generally cloudy day.   We often know if a day will be cloudy well ahead of time, and that doesn’t yield the minute-to-minute variations of a sporadically cloudy day.  The grid is better, I think, at responding to these longer-term variations.

May 15, 2013 02:39 AM

May 14, 2013

Pavel Machek: Slides from openmobility talk

...are here. Recording was going on, but I'm not sure if it is online somewhere...

May 14, 2013 12:54 PM

May 13, 2013

James Morris: Slides from my Security Subsystem Overview at LinuxCon Japan 2012

Whoops. Looks like I forgot to post my slides from last year’s LinuxCon Japan talk on the Linux kernel security subsystem.

Here they are:

http://namei.org/presentations/kernel-security-state-linuxconjp-2012b.pdf

I’ll be giving an update at the upcoming LinuxCon Japan in Tokyo in a couple of weeks.

May 13, 2013 11:14 AM

May 10, 2013

Dave Jones: Weekly Fedora kernel bug statistics – May 10th 2013

  17 18 19 rawhide  
Open: 273 348 117 73 (811)
Opened since 2013-05-03 7 17 8 6 (38)
Closed since 2013-05-03 7 20 14 4 (45)
Changed since 2013-05-03 20 41 20 7 (88)

Nothing terribly exciting in this weeks new bugs. Backlog continues to slowly get beaten down. Next week should see a rebase to 3.9 for F18.

Weekly Fedora kernel bug statistics – May 10th 2013 is a post from: codemonkey.org.uk

May 10, 2013 04:18 PM

May 09, 2013

Pete Zaitcev: Viva la testing revolution

This is not something to brag about, but apparently I managed to program computers for about 30 years without writing unit tests. Today it's recitified by adding a test to one of my projects voluntarily. I encountered the goodness of build-time testing when working on Jeff Garzik's Project Hail. And of course, OpenStack, including Swift, had them since forever. Those weren't my projects, however.

May 09, 2013 08:12 PM

Pavel Machek: From squeeze to wheezy and back, and how not to backup your / filesystem

I'm still running squeeze on my X60... and I decided that with wheezy becoming "stable", it was good idea to upgrade. Before I started, I did back up my root filesystem (fortunately), with

cp -a --one-file-system / somewhere


Upgrade was a bit of fight (like aptitude trying to take hours of cpu time), but eventually I succeeded... Only to realize that system no longer boots into GUI and (worse) that gnome2 is gone. I'm not great fan of gnome3; definitely on X60, anyway. Its animations feel excessive even when system is unloaded, if there's some background load it quickly becomes unusable. I googled a bit, and it did not look like going back to gnome2 is not exactly easy.

So I went back from the backup. First, chromium refused to run because new version broke the config files. I restored those from backup. But next... strangely my self-compiled 3.9 kernel stopped working. Stock debian kernel kept running, but own kernel ran init then rsyslogd broke the boot.

Can you guess what went wrong?

pc jvgu bar svyr flfgrz bcgvba vf ernyyl onq vqrn; vg jvyy abg pbcl rirelguvat sebz lbhe / svyrflfgrz, va cnegvphyne vg jvyy abg pbcl /qri, orpnhfr gurer'f gz csf zbhagrq bire vg. Bhpu.

May 09, 2013 12:20 PM

May 07, 2013

Matthew Garrett: A short introduction to TPMs

I've been working on TPMs lately. It turns out that they're moderately awful, but what's significantly more awful is basically all the existing documentation. So here's some of what I've learned, presented in the hope that it saves someone else some amount of misery.

What is a TPM?

TPMs are devices that adhere to the Trusted Computing Group's Trusted Platform Module specification. They're typically microcontrollers[1] with a small amount of flash, and attached via either i2c (on embedded devices) or LPC[2] (on PCs). While designed for performing cryptographic tasks, TPMs are not cryptographic accelerators - in almost all situations, carrying out any TPM operations on the CPU instead would be massively faster[3]. So why use a TPM at all?

Keeping secrets with a TPM

TPMs can encrypt and decrypt things. They're not terribly fast at doing so, but they have one significant benefit over doing it on the CPU - they can do it with keys that are tied to the TPM. All TPMs have something called a Storage Root Key (or SRK) that's generated when the TPM is initially configured. You can ask the TPM to generate a new keypair, and it'll do so, encrypt them with the SRK (or another key descended from the SRK) and hand it back to you. Other than the SRK (and another key called the Endorsement Key, which we'll get back to later), these keys aren't actually kept on the TPM - the running OS stores them on disk. If the OS wants to encrypt or decrypt something, it loads the key into the TPM and asks it to perform the desired operation. The TPM decrypts the key and then goes to work on the data. For small quantities of data, the secret can even be stored in the TPM's nvram rather than on disk.

All of this means that the keys are tied to a system, which is great for security. An attacker can't obtain the decrypted keys, even if they have a keylogger and full access to your filesystem. If I encrypt my laptop's drive and then encrypt the decryption key with the TPM, stealing my drive won't help even if you have my passphrase - any other TPM simply doesn't have the keys necessary to give you access.

That's fine for keys which are system specific, but what about keys that I might want to use on multiple systems, or keys that I want to carry on using when I need to replace my hardware? Keys can optionally be flagged as migratable, which makes it possible to export them from the TPM and import them to another TPM. This seems like it defeats most of the benefits, but there's a couple of features that improve security here. The first is that you need the TPM ownership password, which is something that's set during initial TPM setup and then not usually used afterwards. An attacker would need to obtain this somehow. The other is that you can set limits on migration when you initially import the key. In this scenario the TPM will only be willing to export the key by encrypting it with a pre-configured public key. If the private half is kept offline, an attacker is still unable to obtain a decrypted copy of the key.

So I just replace the OS with one that steals the secret, right?

Say my root filesystem is encrypted with a secret that's stored on the TPM. An attacker can replace my kernel with one that grabs that secret once the TPM's released it. How can I avoid that?

TPMs have a series of Platform Configuration Registers (PCRs) that are used to record system state. These all start off programmed to zero, but applications can extend them at runtime by writing a sha1 hash into them. The new hash is concatenated to the existing PCR value and another sha1 calculated, and then this value is stored in the PCR. The firmware hashes itself and various option ROMs and adds those values to some PCRs, and then grabs the bootloader and hashes that. The bootloader then hashes its configuration and the files it reads before executing them.

This chain of trust means that you can verify that no prior system component has been modified. If an attacker modifies the bootloader then the firmware will calculate a different hash value, and there's no way for the attacker to force that back to the original value. Changing the kernel or the initrd will result in the same problem. Other than replacing the very low level firmware code that controls the root of trust, there's no way an attacker can replace any fundamental system components without changing the hash values.

TPMs support using these hash values to decide whether or not to perform a decryption operation. If an attacker replaces the initrd, the PCRs won't match and the TPM will simply refuse to hand over the secret. You can actually see this in use on Windows devices using Bitlocker - if you do anything that would change the PCR state (like booting into recovery mode), the TPM won't hand over the key and Bitlocker has to prompt for a recovery key. Choosing which PCRs to care about is something of a balancing act. Firmware configuration is typically hashed into PCR 1, so changing any firmware configuration options will change it. If PCR 1 is listed as one of the values that must match in order to release the secret, changing any firmware options will prevent the secret from being released. That's probably overkill. On the other hand, PCR 0 will normally contain the firmware hash itself. Including this means that the user will need to recover after updating their firmware, but failing to include it means that an attacker can subvert the system by replacing the firmware.

What about using TPMs for DRM?

In theory you could populate TPMs with DRM keys for media playback, and seal them such that the hardware wouldn't hand them over. In practice this is probably too easily subverted or too user-hostile - changing default boot order in your firmware would result in validation failing, and permitting that would allow fairly straightforward subverted boot processes. You really need a finer grained policy management approach, and that's something that the TPM itself can't support.

This is where Remote Attestation comes in. Rather than keep any secrets on the local TPM, the TPM can assert to a remote site that the system is in a specific state. The remote site can then make a policy determination based on multiple factors and decide whether or not to hand over session decryption keys. The idea here is fairly straightforward. The remote site sends a nonce and a list of PCRs. The TPM generates a blob with the requested PCR values, sticks the nonce on, encrypts it and sends it back to the remote site. The remote site verifies that the reply was encrypted with an actual TPM key, makes sure that the nonce matches and then makes a policy determination based on the PCR state.

But hold on. How does the remote site know that the reply was encrypted with an actual TPM? When TPMs are built, they have something called an Endorsement Key (EK) flashed into them. The idea is that the only way to have a valid EK is to have a TPM, and that the TPM will never release this key to anything else. There's a couple of problems here. The first is that proving you have a valid EK to a remote site involves having a chain of trust between the EK and some globally trusted third party. Most TPMs don't have this - the only ones I know of that do are recent Infineon and STMicro parts. The second is that TPMs only have a single EK, and so any site performing remote attestation can cross-correlate you with any other site. That's a pretty significant privacy concern.

There's a theoretical solution to the privacy issue. TPMs never actually sign PCR quotes with the EK. Instead, TPMs can generate something called an Attestation Identity Key (AIK) and sign it with the EK. The OS can then provide this to a site called a PrivacyCA, which verifies that the AIK is signed by a real EK (and hence a real TPM). When a third party site requests remote attestation, the TPM signs the PCRs with the AIK and the third party site asks the PrivacyCA whether the AIK is real. You can have as many AIKs as you want, so you can provide each service with a different AIK.

As long as the PrivacyCA only keeps track of whether an AIK is valid and not which EK it was signed with, this avoids the privacy concerns - nobody would be able to tell that multiple AIKs came from the same TPM. On the other hand, it makes any PrivacyCA a pretty attractive target. Compromising one would not only allow you to fake up any remote attestation requests, it would let you violate user privacy expectations by seeing that (say) the TPM being used to attest to HolyScriptureVideos.com was also being used to attest to DegradingPornographyInvolvingAnimals.com.

Perhaps unsurprisingly (given the associated liability concerns), there's no public and trusted PrivacyCAs yet, and even if they were (a) many computers are still being sold without TPMs and (b) even those with TPMs often don't have the EK certificate that would be required to make remote attestation possible. So while remote attestation could theoretically be used to impose DRM in a way that would require you to be running a specific OS, practical concerns make it pretty difficult for anyone to deploy that at any point in the near future.

Is this just limited to early OS components?

Nope. The Linux kernel has support for measuring each binary run or each module loaded and extending PCRs accordingly. This makes it possible to ensure that the running binaries haven't been modified on disk. There's not a lot of distribution infrastructure for setting this up, but in theory a distribution could deploy an entirely signed userspace and allow the user to opt into only executing correctly signed binaries. Things get more interesting when you add interpreted scripts to the mix, so there's still plenty of work to do there.

So what can I actually use a TPM for?

Drive encryption is probably the best example (Bitlocker does it on Windows, and there's a LUKS-based implementation for Linux here) - while in theory you could do things like use your TPM as a factor in two-factor authentication or tie your GPG key to it, there's not a lot of existing infrastructure for handling all of that. For the majority of people, the most useful feature of the TPM is probably the random number generator. rngd has support for pulling numbers out of it and stashing them in /dev/random, and it's probably worth doing that unless you have an Ivy Bridge or other CPU with an RNG.

Things get more interesting in more niche cases. Corporations can bind VPN keys to corporate machines, making it possible to impose varying security policies. Intel use the TPM as part of their anti-theft technology on education-oriented devices like the Classmate. And in the cloud, projects like Trusted Computing Pools use remote attestation to verify that compute nodes are in a known good state before scheduling jobs on them.

Is there a threat to freedom?

At the moment, probably not. The lack of any workable general purpose remote attestation makes it difficult for anyone to impose TPM-based restrictions on users, and any local code is obviously under the user's control - got a program that wants to read the PCR state before letting you do something? LD_PRELOAD something that gives it the desired response, or hack it so it ignores failure. It's just far too easy to circumvent.

Summary?

TPMs are useful for some very domain-specific applications, drive encryption and random number generation. The current state of technology doesn't make them useful for practical limitations of end-user freedom.

[1] Ranging from 8-bit things that are better suited to driving washing machines, up to full ARM cores
[2] "Low Pin Count", basically ISA without the slots.
[3] Loading a key and decrypting a 5 byte payload takes 1.5 seconds on my laptop's TPM.

comment count unavailable comments

May 07, 2013 05:18 PM

May 06, 2013

James Morris: Linux Security Summit 2013 (New Orleans) – Call for Participation

The CFP for the 2013 Linux Security Summit has been announced.

The summit will be held across the 19th and 20th of September in New Orleans, co-located again with LinuxCon and Linux Plumbers. Note that presenters and attendees at LSS must be registered as LinuxCon attendees.

We’ll be following a similar format to last year, with a day of refereed presentations, followed by subsystem updates and break-out sessions on the second day. We’ll probably finish up around lunchtime on the Friday for people needing to head home that day, but check the final schedule for details once it’s published.

The CFP is open until 14th June, with speaker notifications to be posted by 21st June.

If you’ve been doing cool and interesting work in Linux security, be sure to submit a proposal!

May 06, 2013 09:59 AM

May 04, 2013

Andi Kleen: TSX profiling

I published a quick overview on how to do TSX profiling with Linux perf: Intel TSX profiling with Linux perf

This is a technical overview that assumes some prior knowledge of profiling. I apologize for the cumbersome title.

May 04, 2013 02:53 AM

May 03, 2013

Dave Jones: Weekly Fedora kernel bug statistics – May 03 2013

  17 18 19 rawhide  
Open: 270 345 126 70 (811)
Opened since 2013-04-26 4 24 9 6 (43)
Closed since 2013-04-26 12 28 8 9 (57)
Changed since 2013-04-26 15 52 18 7 (92)

Weekly Fedora kernel bug statistics – May 03 2013 is a post from: codemonkey.org.uk

May 03, 2013 04:10 PM

April 30, 2013

Michael Kerrisk (manpages): man-pages-3.51 is released

I've released man-pages-3.51. The release tarball is available on kernel.org. The browsable online pages can be found on man7.org. The Git repository for man-pages is available on kernel.org.

This is a relatively small release that has various fixes across a number of pages. Among the more notable changes in man-pages-3.51 are the following:

April 30, 2013 07:31 AM