Kernel Planet

March 08, 2010

Pavel Machek: Commercial open source

...sponsored by Microsoft, tommorow at 18h. I decided to take a look, so there will be some fun ;-). [And I guess I can always run away when it gets too bad.]

March 08, 2010 09:09 PM

Dave Miller: strlen(), oh strlen()...

I've been going through the glibc sparc optimized assembler routines to see if anything can be improved. And I took a stab at seeing if strlen() could be made faster. Find first zero byte in string, pretty simple right?

The first thing we have to discuss is the infamous trick coined by Alan Mycroft, way back in 1987. It allows to check for the presence of a zero byte in a word in 3 instructions. There are 2 magic constants:

#define MAGIC1		0x80808080
#define MAGIC2		0x01010101
If you're checking 64-bits at a time simply expand the above magic values to 64-bits on 64-bit systems.

Then, given a word the check becomes:

	if ((val - MAGIC2) & ~val & MAGIC1)
		goto found_zero_byte_in_word;
Essentially we're subtracting MAGIC2 to induce underflow in each byte that has the value zero in it. Such underflows cause bit 8 to get set in that byte. Then we want to see if bit 8 is set after subtraction in any byte where bit 8 wasn't set before the subtraction.

To get the most parallelization on multi-issue cpus, we want to compute this using something like:

	tmp1 = val - MAGIC2;
	tmp2 = ~val & MAGIC1;
	if (tmp1 & tmp2)
		goto found_zero_byte_in_word;
to reduce the number of dependencies such that the computation of tmp1 and tmp2 can occur in the same cpu cycle.

Then there is all the trouble of getting the source buffer aligned so we can do the fast loop comparing a word at a time. The most direct implement is to read a byte at a time, checking for zero, until the buffer address is properly aligned. This is also the slowest implementation.

The powerpc code in glibc has a better idea. If dereferencing a non-word-aligned byte at address 'x' is valid, so is reading the word at 'x & ~3' (or 'x & ~7' on 64-bit). This is because page protection occurs on page boundaries, and x and 'x & ~3' are on the same page.

The only thing left to attend to is to make sure we don't match the alignment pad bytes with zero. This is solved by computing a mask of 1's and writing those 1's into the word we read before we do the Mycroft computation above. In C it looks something like:

	orig_ptr = ptr;
	align = (unsigned long) ptr & 3;
	mask = -1 >> (align * 8);
	ptr = (void *) ((unsigned long) ptr & ~3UL);
	val = *ptr;
	val |= ~mask;
	if ((val - MAGIC2) & ~val & MAGIC1)
		goto found_zero_byte_in_word;
At which point we can fall into the main loop.

Once we find the word containing a zero byte, we have to iteratively look for where it is in order to compute the return value. How to schedule this is not trivial, and it's especially cumbersome on 64-bit (where we have to potentially check 8 bytes as opposed to 4).

Anyways, let's analyze the 64-bit Sparc implementation I'm hacking on at the moment. I'm targetting UltraSPARC-III and Niagara2 for performance analysis. Simply speaking UltraSPARC-III can dual-issue integer operations, and Niagara2 is single issue and predicts all branches not taken (basically this means: minimize use of branches).

davem_strlen:
	mov	%o0, %o1
	andn	%o0, 0x7, %o0

	ldx	[%o0], %o5
	and	%o1, 0x7, %g1
	mov	-1, %g5
Save away the original string pointer in %o1. At the end we'll compute the return value as "%o1 - %o0". Align the buffer pointer and load a word as quickly as possible. We load the first word early so that we can hide the memory latency into all of the constant and mask formation we need to do before we can make the Mycroft test.

%g5 holds the initial part of the mask computation (-1, which gets expanded fully to 64-bits by this move instruction) and %g1 will have the shift factor.

	sethi	%hi(0x01010101), %o2
	sll	%g1, 3, %g1

	or	%o2, %lo(0x01010101), %o2
	srlx	%g5, %g1, %o3

	sllx	%o2, 32, %g1
	sethi	%hi(0x00ff0000), %g5
%o2 is going to hold the "0x01" expanded to 64-bits subtraction magic value. %o3 wil first hold the initial word mask, and then it will holds the "0x80" magic constant. We can compute the two 64-bit magic constants into registers in 5 instructions.

Pick either of the two constants, we choose the "0x01" here because we'll need it first. This is loaded first using "sethi", "or". This gives us the lower 32-bits of the constant, then we shift up a copy by 32-bits, then or that into the lower 32-bit copy to compute the final value. "0x80" is "0x01" shifted left by 7 bits so a simple shift is all we need to load the other 64-bit constant.

The "0x00ff0000" constant will be used while searching for the zero byte in the final word.

Next, we mask the initial word and fall through into the main loop.

	orn	%o5, %o3, %o5
	or	%o2, %g1, %o2

	sllx	%o2, 7, %o3
Mask in the pad bits using mask compute in %o3. Finish computation of 64-bit MAGIC1 into %o2, and finally put MAGIC2 into %o3. We're ready for the main loop:
10:	add	%o0, 8, %o0

	andn	%o3, %o5, %g1
	sub	%o5, %o2, %g2

	andcc	%g1, %g2, %g0
	be,a,pt	%xcc, 10b
	 ldx	[%o0], %o5
This is a real pain to schedule because there are many dependencies. But the "andn", "sub", "andcc" sequence is the Mycroft test, and those first two instructions can execute in one clock cycle on UltraSPARC-III. The ",a" annul bit on the branch means that we only execute the load in the branch delay slot if the branch is taken.

Now we have the code that searches for where exactly the zero byte is in the final word.

	srlx	%o5, 32, %g1
	sub	%o0, 8, %o0
We over advanced the buffer pointer in the main loop, so correct that by subtracting 8. Prepare a copy of the upper 32-bits of the word into %g1.
	andn	%o3, %g1, %o4
	sub	%g1, %o2, %g2

	add	%o0, 4, %g3
	andcc	%o4, %g2, %g0

	movne	%icc, %g1, %o5
	move	%icc, %g3, %o0
This is divide and conquer. Instead of doing 8 byte compares, we first see if the upper 32-bits have the zero byte. We essentially redo the Mycroft test on the upper 32-bits of the word.

If the upper 32-bits have the zero byte, we use %g1 for the comparisons. Otherwise we retain %o5 for the subsequent comparisons and advance the buffer pointer by 4 bytes. This is what the final two conditional move instructions are doing. Note that these conditional moves use '%icc', the 32-bit condition codes.

The astute reader may wonder why we just can't use the upper 32-bits of the Mycroft computation we made in the main loop? This doesn't work because the underflows can carry and cause false positives in upper bytes of the word. For example, consider a value where bits 35 down to 24 have hex value "0x0100". The subtraction of MAGIC2 will result in "0x8080". The real zero byte is the lower one, not the upper one. So we can't merely use the upper 32-bits of the already computed 64-bit Mycroft mask, we have to recompute it over 32-bits by hand.

Now we're left with 32-bits to check for a zero byte, we make extensive use of conditional moves to avoid branches:

	mov	3, %g2
	srlx	%o5, 8, %g1

	andcc	%g1, 0xff, %g0
	move	%icc, 2, %g2

	andcc	%o5, %g5, %g0
	srlx	%o5, 24, %o5
	move	%icc, 1, %g2

	andcc	%o5, 0xff, %g0
	move	%icc, 0, %g2

	add	%o0, %g2, %o0
We check starting at the low byte up to the highest byte. Because the highest byte, if zero, takes priority. We add the offset of the zero byte to the buffer pointer.

Finally:

	retl
	 sub	%o0, %o1, %o0
We compute the length and return from the routine.

Many many moons ago, in 1998, Jakub Jelinek and his friend Jan Vondrak wrote the routines we use now on sparc. And frankly it's very hard to beat that code especially on multi-issue processors.

The powerpc trick to align the initial word helps us beat the existing code for all the unaligned cases. But for the aligned case the existing code holds a slight edge.

So now I've been trimming cycles as much as possible in the new code trying to reach the state where the aligned case executes at least as fast as the existing code. I'll check this work into glibc once I accomplish that.

The Mycroft trick extends to other libc string routines. For example for 'memchr' you replicate the search character into all bytes of a word, let's call it 'xor_mask' and in the inner loop you adjust each word by using:

	val ^= xor_mask;
Then use the Mycroft test as in strlen(). Another complication with memchr, however, is the need to check the given length bounds.

This can be done in one instruction by putting the far bounds into your base pointer register (called '%top_of_buffer' below), then using offsets starting at "0 - total_len" (referred to as '%negative_len' below).

Then your inner loop can do something like:

	ldx	[%top_of_buffer + %negative_len], %o5
	addcc	%negative_len, 8, %negative_len
	bcs	%xcc, len_exceeded
	 ...
We exit the loop when adding 8 bytes to the negative len causes an overflow.

If you're interested in this kind of topic, bit twiddling tricks and whatnot, you absolutely have to own a copy of "Hacker's Delight" by Henry S. Warren, Jr.

March 08, 2010 05:09 PM

March 07, 2010

Pavel Machek: Riding for years, and still does not know how to stop

Yep, that's me; and yes, I know what the cue to slow down the horse is -- lean back and use both reins. And yes, you can stop the horse by doing "slow down" three times...

But that's not a way to stop the horse. If you are going full gallop and need to stop, you want full stop now cue, not three slow down cues.

Now, I knew some horses that were actually very good at stopping, and yes, there's huge difference between stop now and slow down to full stop. Cue those horses were trained to was "whoa"...

So I tried teaching that cue to young stallion here, and it does not really work. Or rather... it works a bit too well.

I know many horses where "whoa" means slow down so I sometimes utter it when I want to just slow down... and then the horse comes abrubtly to full stop. What is worse, many other words trigger same response -- I guess they are too similar for stallion's ears.

There must be some reasonable cue, that is impossible to mistake for the horse, and unlikely to be given accidentally by the rider... unintended full stop is almost "and now climb back to the horse" event... but what is it? For now I know "whoa" is neither :-(.

(And for the record, I probably could teach horse to do full stop on something completely crazy -- like hand touching his tail -- he's learning almost too quick.)

March 07, 2010 07:50 PM

March 05, 2010

Harald Welte: OsmocomBB now performing location updating procedure against GSM cell

I haven't had much time for blogging recently, too much exciting work going on at OsmocomBB:

There are still many limitations, but this is a major milestone in the project: We have working bi-directional communication from the phone to the network!

The limitations include:

However, most of those are more or less simple we know what needs to be done, its just a matter of getting it done kind of tasks. There are no big unknowns involved, and particularly no further reverse-engineering of the hardware is required.

Also, the existence of a stable bi-directional communications channel between the network and the phone means that anyone interested in working on the higher layers can now actually do so. Completing and testing layer2 as well as RR/MM/CC on layer3 is a major task in itself, and it definitely requires the lower layers to be there.

The other good part is that development of layer2 and layer3 can happen entirely on the host PC, where debugging is much easier and there's no need for cross-compilation and we can use all the usual debugging options (gdb, valgrind, ...)

I'm now almost heading off for holidays (starting March 10), so don't expect any major progress from me anytime soon. I hope other interested developers will be able to take it from here and fill in some missing gaps until I'll get back.

March 05, 2010 01:00 AM

March 04, 2010

Kernel Podcast: Updates coming!

Folks,

Sorry for the delay. I should have updates out before the end of the week. Thanks. Remember, this is a spare time project and takes a lot of effort to do properly.

Jon.

March 04, 2010 08:50 AM

Evgeniy Polyakov: Elliptics changes

They are quite dramatical, but are very small yet - I committed search protocol changes. Now node stores transactions with IDs greater or equal than node's ID (it stored smaller or equal IDs previously), which is incompatible with current node searching, but allows to maintain human readable and logical (for humans) ID generation.

So, when node has ID, say, 0100..., it will host data transactions, which start from 01 (its the highest byte). It is much more convenient to configure nodes with this in mind, than to calculate what is less than 01, namely FF... IDs.

I also committed initial metadata support, but neither low level IO backend supports that yet, and I will leave only Tokyo Cabinet DB and file backends, BerkeleyDB support will be dropped, because of its slowliness. It is still in a development stage, since there is no clear vision on where this functionality should live - client or server.
I.e. it is possible that client will tell that it wants to insert metadata X into given object, and server will read/modify/write metadata blob itself, or it is possible that client will download whole metadata blob, update it locally and then write it back to server, which will replace old one with the new data. Likely I will use the former case, since it simplified client development, which should be a higher priority than server simplification.

We also found an interesting bug or feature of the storage - in some cases it is not possible to remove object, it will be recovered from the dead. Let's say we have two object copies and one node was turned off. Automatic recovery (not present yet though) will create another copy from the first one on alive nodes. Subsequent object removal will kill both copies on running nodes. When turned off node goes online again, autoamtic recovery tool will resurrect removed object from the copy presented on this node.

To date it is all a pure theory, since there is no separate metadata in the storage, thus no automatic recovery (admin should run special tool with properly crafted log file currently) and it does not remove objects from the storage. But still, described problem will hit us badly when we will actively use it.
And while there is no merge implemented either (it is kind of being materialized in my mind while we talk), solution will involve new history entry creation instead of actual data removal. Thus transaction log will contain a note that given object was removed. In case of network split and parallel object removal and update in different parts (which can not contact each other during this event) of the storage, this will also allow to implement correct and complete transaction history log by synchronization daemon.

Thus object will never be deleted from the storage, and instead its history will be updated to store a note about its status. File system checker will be extended to support a mode, when it will actually remove objects from the storage after they were marked (and resolved during merge with other logs if needed) as deleted after some timeout, which should be big enough to eliminate such ghost nodes appearence.

And the last but not least discussed issue concerns storage size and related limitations. Let's say that we reached our current storage capacity and want to add several another machines, which will add 50% of the current volume. We want to spread data equally between all nodes, thus we will need to update every node's ID to shift it a little, so that new nodes entered addressing ring and formed a fair ID distribution. Amount of transaction copies in this case is quite large - more than a half of all data will have to be transferred over the network, which will take a while.
Also, when we add new empty node into the storage, it will kind of hide data it is supposed to host (according to ID distribution) until it is copied to the new node from the neighbour. Thus there should be a poilicy, which will forbid simultaneous update of all servers, since there is a possibility that suddenly all added nodes hide all copies of some objects. It will be recovered of course, but it will take some time, which in some cases is not appropriate.

One of the solutions for the described storage size issue is different storage policy. We can implement multiple virtual datacenters, where each new virtual datacenter corresponds to newly added set of machines. In this case we will extend write application so that it could 'touch' old hash functions (and thus old virtual datacenters) first to determine whether it can store data there and move to the new machines if there is no space in the old ones. Reading can issue a parallel lookup to all virtual datacenters asking for given object ID.

This scheme has latency limitations as well as network traffic growing with new virtual datacenters involved, but it can be a good decision for smaller setups though.

Virtual datacenters (or configurable hash/transformation functions used to generate transaction ID) becomes one of the most flexible 'tools' to implement different storage setups.

Stay tuned, there will be more news soon!

March 04, 2010 12:08 AM

March 01, 2010

Paul E. Mc Kenney: Parallel Programming: Administrators as Architects

Although IT professionals should take care to avoid engineering envy, it is often useful to learn from the experiences of other engineering disciplines. In this posting, I will compare and contrast construction of a building to implementation of a large software project.

Leaving aside financial engineering, building construction starts with an architect, who lays out the general shape and look of the building. A structural engineer creates a detailed design, with an eye towards ensuring that the building will remain standing despite the best efforts of wind, gravity, and plate tectonics. A construction engineer works out the details of the construction process — for example, it is good if the building can support itself while being built as opposed to doing so only when completed. Other engineering specialties may be required as well, for example, HVAC (heating, ventilating, and air conditioning).

Once the building is built, different skills are needed, including operating engineers, maintenance personnel, and janitors.

A very similar sequence of events can play out for a large software application. Software architects (for better or worse) lay out the general shape of the project, developers design and code it, and others ensure that it is built, tested, and safely ensconced in some source-code management system.

However, once the application is completed, it is likely that its care and feeding will be taken over by application, database, and system administrators. The architects and developers will switch to other projects (possibly version N+1 of this same application), and perhaps even retire or otherwise move on. Of course, if the application runs at multiple sites, there might well be a separate set of administrators for each site. But for simplicity, let's assume that this application runs at only one site.

Now suppose that it is necessary to parallelize this application.

This is tantamount to major structural change to the building, such as adding several new floors. A structural change of this nature is clearly not a job that you would normally entrust to operating engineers, maintenance personnel, or janitors.

But what else can you do if the original architects and developers are gone?

March 01, 2010 01:43 AM

Harald Welte: Looking for documentation on sunplus SPMA100B

In the Motorola/Compal C155 phone supported by OsmocomBB, we have found a ringtone melody chip called SPMA100B from sunplus.

As strange as it might seem, this is the only part used in the phone for which we have not been able to find any kind of programming information. So if you know anything about how to program this part from software (register map, programming manual, ...) please let me know!

And no, we don't need electrical/mechanical data sheets, thanks :)

March 01, 2010 01:00 AM

February 28, 2010

Pavel Machek: Visiting old mine

Visited an old mine today... Actually for an orienteering run. And seen some pretty impressive tech...

Mine is actually from 1890 or so, and it was running up to 1997 or so. They had some wonderful hacks -- like steam engine, still powering the elevator up to 1997, but running on compressed gas.

And because they did not use the computers to control the elevator, they had to use two-operators, and blackbox type device recording elevator speeds and communication over single rope. Speedometer used mercury. Impressive.

But... on the other hand they kept things simple. Steel pipe was used for communications 500 meters underground. Single part. In 1980, they'd probably use two analog phones and a battery, about 10 parts total. Today, we'd probably use two computers, running VOIP over ethernet, for about 1000 milion parts total. Is not progress wonderful?

February 28, 2010 09:30 PM

Evgeniy Polyakov: Two days of snow

I used to hate skiing - I wasted 3 years in running ski section in univercity, while I could play football or, let's say, chess. Well, there was no chess section, but whatever else it could be more interesting than ski.

And this year I opened myself alpine ski. I did it about 15-20 years ago previously when was in school, and it was simple small plastic skis. Technology made a significan progress since then and I got ability to test real skis.

That's what I did this and previous weekends - two days in Stepanovo ski resort. It was essentially the first time I tried big slope (not that big compared to real resorts in Europe of course, just about a kilometer or less and 100 meters drop) and real snow. And it was fucking incredible - it is fast, it is long enough to feel the speed and ground, it is quite different - there are multiple traces and a lot of small roads from main trace, where one can ride over hummocks and small ski jumps.

I bought myself all equipment except skis itself - want to touch different things first, but I believe I will get my own next time. With the proper equipment it is not cold, warm or wet, it is just ubercool. Getting that I basically have no technique, I open lots of cases for myself all the time. And I believe that I have some progress, maybe not that good, but very pleasant for myself.

I tried long blue trace previously, but today I started a red one. And it was fucking beautiful - so fast and so strong. No boring places and long waits, just pure pleasure of speed and control. On this trace I found myself moving noticebly more technically than on a simpler trace.

I started to sit lower, put legs closer and change ski edges using mass center and not ass or legs, pipe changing arcs became shorter and with longer radius, which increased speed compared to plain skiing.

Of course it was not always perfect, and frankly I believe it looked like crap and was a real crap from good technique point of view, but it was very pleasant for me, and that's what matters. I want to get another hour or so with good teacher, who will tell me where main problems are, since I can not see how I made a slope. Sometimes I flew over the trace couple of meters and than landed in 'different positions' usually already without skis moving on my body another dozen of meters. But I like it too - it shows complex cases and sharps instincts.

Currently I believe there are no somewhat big parts of my body, which do not try to scream and ache. Especially shine bones (hard to move or stay long enough) and various leg muscles, but it is not a problem - I will be fresh again in a day, and hundred or so of "The Glenrothes" and couple of hours playing piano and trumpet will quickly help me. So plan is to make another turn next weekend or preferably move to ski resort couple times.

Fucking incredible. Just love it!

February 28, 2010 06:43 PM

Rik van Riel: Thank you, PSNH crews

Due to the big storm Thursday night, we spent two days without power. After freezing ourselves on Friday, we decided to spend Saturday at a friend's place (thank you Aris, Chris and Sarah). While checking on our house, there were always crews at work trying to clear up the fallen trees, reopen roads and reconnect power and communications lines. A big thank you goes out to the power and telco crews who are working around the clock to clear up the mess and reconnect New England.

February 28, 2010 05:24 PM

February 27, 2010

Michael Kerrisk (manpages): man-pages-3.24 is released

I've uploaded man-pages-3.24 into the release directory (or view the online pages). The most notable changes in man-pages-3.24 are the following:

February 27, 2010 05:00 PM

February 26, 2010

Matthew Garrett: Nook update (again)

Barnes and Noble released the nook source code last week. This includes the code to busybox, uboot and their kernel. Unfortunately, the uboot and kernel code both appear to be missing swathes of code found statically linked in the binaries that they're distributing. License compliance is hard, let's flail wildly.

February 26, 2010 06:31 PM

Dave Airlie: GPU switching update

Okay I've been busy elsewhere but dragged myself back to try and finish this for upstream

v10 of the patch is up
http://people.freedesktop.org/~airlied/vgaswitcheroo/0001-vga_switcheroo-initial-implementation-v10.patch

changes are mainly that mjg59 was right about keeping ugly things in the drivers.

adding ATRM support to get the ROMs on ATI hybrid for the discrete card was actually a pain with the previous code design,
so I moved lots of it around again, and now the discrete ROM can be retrieved via the ATRM method.

I've tested it on the W500 and it works as well as before, which means still the 3rd or 4th switch fails and locks the machine up,
I need to debug this further.

The refactored code should hopefully make it easier to fill in the nvidia/nvidia and intel/nvidia blanks for mjg59.

Update 1: v11 is now up
http://people.freedesktop.org/~airlied/vgaswitcheroo/0001-vga_switcheroo-initial-implementation-v11.patch
It should fix the failure to switch to IGD the 2nd time hopefully.

Update 2: v13 is now up, it blindly implements nvidia DSM changing, but I've no idea if it works. Hopefully someone can test it and give me some feedback. Its nearly all guesswork from work mjg59 did.

February 26, 2010 05:04 AM

February 25, 2010

Pete Zaitcev: Is OpenSolaris dead?

Chris asks where OpenSolaris is headed. My reaction: nobody cares anymore. FreeBSD established itself as the alternative to Linux, and that leaves Solaris with no niche. So, whatever. It is much more important what is going to happen to OpenOffice and MySQL. Also, Sun carried a pretty large assortment of lesser projects, such as Lustre.

February 25, 2010 11:13 PM

February 24, 2010

Matthew Garrett: You know it's a bad day when:

ld gives you "Can not allocate memory".

(turned out to be a corrupt object file)

February 24, 2010 07:21 PM

Linus Torvalds: Turst me, I know what I'm doing...


I'm probably moving my office to be above the garage.

In preparation for that, I did the whole "get CAT6 networking to the new location" thing, which has involved re-acquainting myself with our crawlspace. Spending my days crawling around, hoping I'm not going to encounter any dead mice (or live ones, for that matter).

I obviously already had cable going to various locations in the house, but the way that had happened, I'd done them one at a time, and my current office ended up being the hub for it all. And since I really wasn't going to re-route all the cables and make the new office be another hub of chaos, and I certainly wasn't going to leave the hub in what will become a kids bedroom, the above is the result.

Beautiful it ain't. It's a real media center enclosure, but the networking hubs that are meant for those things are overpriced and generally just pitiful 4-port 100Mbps switches with dubious firewall capabilities, so I'm just installing my own. And some day, I'll actually add the screws that hold the boxes where they are supposed to go, rather than just sitting in a pile on top of each other at the bottom of the box.

I haven't had the energy to fix the telephone wiring. As you can see, I now have the header for getting that particular mess sorted out too, but I'm not the person who created that particular "rat king" of cabling under our house in the first place. So I'm not feeling the need quite acutely enough to spend another few hours crawling around straightening out all that wiring. Same goes for TV cabling. You can kind of tell what part of the house wiring I actually care about...

February 24, 2010 10:38 AM

February 23, 2010

Pavel Machek: Symbian

FLOSS weekly podcast has interview with someone from Symbian foundation. Interesting point is, that even Symbian people acknowledge Android as good, but will try to attack it from below, by using less power and running on smaller device. They even have a blog.

What they do not have is working system on real hardware... which is quite interesting. They claim to be using qemu and beagleboard, citing lack of drivers and claiming no open devices exist. I guess someone should show them OpenMoko or HTC Dream (ADP1). Plus they do have their own c++ dialect, with proprietary compiler and

Ouch, and what they do have is design by comitee. Actually design by 4 comitees :-(.

Anyway, it is great to see more opensource competition in cellphones; and I hope it does not mean death of Maemo platform.

February 23, 2010 09:20 PM

Pavel Machek: do androids dream of electric sheep?

Ok, so I got paper version of Blade Runner... and I enjoyed it, even through I expected a bit more.

But now... my android seems to be downloading electric sheep at 100MB/night rate. And yes, it continues during the day, too, and would probably do more if I had better connection than GPRS.

No, rebooting the phone did not help. According to 'spare parts', component responsible for the traffic is 'media'... which is alias for 'download manager' and pretty much opaque. So I tried plain old tcpdump, to find that it is talking to 1e100.net; I have custom rom but it was still trying to download updates.

Solution is "easy": disable background data. Unfortunately, it also disables market and gtalk. Is there better solution?

Oh and it is now clear. Androids do not dream of electric sheep, they dream of digital donuts.

February 23, 2010 05:16 PM

February 22, 2010

Pete Zaitcev: F12, BIND, and stable releases

Ran "yum update" today on F12 and the rewrite of BIND configuration produced a fail-to-start again. Only instead of a blatant syntax error with unbalanced braces like when DNSSEC was first enabled, they merely referred a non-existing file (/etc/pki/dnssec-keys//named.dnssec.keys). BTW, I looked everywhere, it's not a part of any package we ship in Fedora. What a facepalm, in the middle of stable release too. You know, the anti-Rawhide people always bring it up how Rawhide is "not guaranteed" to work. Well, is F12 "guaranteed"?

For about four recent releases it became noticeable that Fedora folks put a lot of effort into the QA and polish, but once release is out of the door, controls are relaxed and all sorts of dubious code flows freely in the guise of "security" updates. The S-word is some kind of a magic key that trumps any basic quality. The net result is going to be people installing releases and then never updating, once they catch up on what's happening. What's worse, once this folk wisdom gets established, it cannot be easily reversed even if updates become quality checked.

February 22, 2010 08:41 PM

David Woodhouse: 22 Feb 2010

My God, I've been vaguely aware of the HTML5 video train wreck but I hadn't realised just how much of a fucking abortion the rest of the HTML5 'standard' is.

I had the misfortune to read the section on character encodings over the weekend, and it almost made me lose my lunch.

Not only does it codify the crappy and unreliable practice of applying heuristics to guess character encodings, it also requires that a user agent deliberately ignore the explicitly specified character set in some cases — for example, text explicitly labelled as US-ASCII or ISO8859-1 MUST be rendered as if it were Windows-1252!

It justifies this idiocy, which it admits is a 'willful violation', on the basis that it aids compatibility with legacy content. By which of course it means "broken content", since this was never actually necessary for anyone who published content correctly even with older versions of HTML.

But that doesn't make any sense — surely legacy content won't be identifying itself as HTML5? It might be reasonable to do these stupid things for legacy content, but not HTML5. The complete mess we have with charset labelling is a prime example of where the RFC1122 §1.2.2 approach of being lenient in what you accept has turned out to be massively counter-productive — if we'd simply refused to make stupid guesses about character sets in the first place, then people would have actually started getting the labelling right.

The sensible approach to take with HTML5 would just have been to say "All content which identifies itself as HTML5 MUST be in the UTF-8 character encoding. A conforming user agent MUST NOT attempt to interpret content as if it has any other encoding; any invalid UTF-8 byte sequences MUST be shown using the Unicode replacement character U+FFFD (�) or equivalent."

Or, if we really must continue to permit the legacy crap 8-bit character sets, it should have said that the content MUST be in the character set specified in the HTTP Content-Type: header or equivalent <META> tag.

Keep the stupid heuristics for legacy content by all means, but it should be forbidden to render HTML5 content in a character set other than the one it is labelled with, and all invalid characters (including the C1 control characters in ISO8859-1 which in Windows-1252 would map to extra printable characters like the Euro sign) MUS be shown as U+FFFD (�). And then the people who publish broken crap would see that they're publishing broken crap, rather than thinking it's OK because the browser they use just happens to assume the same character set as the system they're publishing from.

To me, HTML5 looks less like a standard and more like a set of broken hackish kludges to work around the fact that people out there aren't actually capable of following a standard.

February 22, 2010 12:31 PM

February 21, 2010

Pavel Machek: armored vehicle from Nokia

I knew 6230 is a good phone, and yes, it seems to come back. I lost it in a bus twice already (and good people returned it both times), lost it from bycicle and a horse back...

I went to the mountains, and estimated the trip from bus to Petraska at 4 hours (arriving at cca 23:30). But I selected
shorter way over ski slope and made it under two... only to realize that I lost 6230 somewhere.

I was told I had no chance to find it; but in nice, quiet night ringing and blinking phone is rather easy to find so I disagreed, and went back for a rescue -- 6230 still had signal and was ringing
somewhere in the mountains.

But I was pretty suprised when I found the 6230 -- it was 5 centimeters under the snow, getting direct hit from snow gun for about 2 hours... I only found it because of light. Battery was low, but phone is alive and continues to work.

To whoever designed 6230: thanks!

February 21, 2010 07:26 AM

February 20, 2010

Rusty Russell: Rusty’s Travels

Headed through Germany 26th through 3rd March or so, then Lithuania via Poland.  Back via Singapore on 24/25 March.

My email will be intermittent (I hope!) but if you’re around and want to grab a meal or a beer with us, ping me!

February 20, 2010 07:02 AM

Harald Welte: Restructuring OpenBSC and OsmocomBB code

I've spent the better part of the day with , renaming files/functions/include paths, Makefiles, autotools and the like.

The result of this is a new sub-project called libosmocore that gathers all the shared code between the network-side GSM implementation OpenBSC and the phone-side implementation OsmocomBB. The library is portable enough that it can run on a proper OS (like GNU/Linux) but also be cross-compiled to work on the actual phone without any OS.

On the other hand we now have a master Makefile in OsmocomBB to build libosmocore for host PC and target (phone), as well as the osmocon and layer2 host programs and the phone firmware itself.

Let's hope I can now return to writing actual code...

February 20, 2010 01:00 AM

February 19, 2010

Matthew Garrett: Pittsburgh

As I mentioned, I headed to Pittsburgh last week to give some talks at CMU and find out something about what they're doing there. Despite the dire weather that had closed the airport the day before, I had no trouble getting into town and was soon safely in a hotel room with a heater that seemed oddly enthusiastic about blasting cold air at me for ten seconds every fifteen minutes. Unfortunately, it seems that life wasn't as easy for everyone - ten minutes after I arrived, I got a phone call telling me that the city had asked CMU to cancel classes the next day.

This turned out to be much less of a problem than I'd expected - whether because of their enthusiasm to learn about ACPI or because they simply hadn't noticed the alert telling them about the cancellation, a decent body of students turned up the next morning. After a brief chat with Mark Stehlik, the assistant dean for undergraduate education, I headed off to the lecture hall. The fact that I can now just plug my laptop into a VGA cable and have my desktop automatically extend itself continues to amaze me, as does OpenOffice's seemingly unerring ability to get confused about which screen should have my content and which should be showing me the next slide. Nevertheless, facts were imparted and knowledge dropped on those assembled. I'm even reasonably sure that the contents were factually accurate, which is a shame because the most attractive part of teaching always struck me as being able to lie to students who will then happily regurgitate whatever you tell them because in case it turns up on the exam. Perhaps this is why I'm safer out of academia.

Lunch offered an opportunity to visit the Red Hat sponsored lab, which was pleasingly located somewhere other than a basement. The guy on the right of the picture is Greg Kesden, the director of undergraduate laboratories in CS there - it was wonderful to get an opportunity to see the machines getting used, and students seemed genuinely appreciative of the facility.

After lunch I spent a while talking to Satya about the Internet Suspend and Resume project. This is an impressive combination of virtualisation and migration, using a Fedora-based live image to bring up an OS on arbitrary hardware before downloading a machine image and launching it. The majority of the data is pulled in on demand, meaning that initial performance can be slow but ensuring that data is only downloaded if it's needed. When the user is finished, the delta between the original image and the new one can be pushed back to the server while remaining cached on the local machine in case the image is used again.

It's an interesting approach, combining the flexibility of thin clients with the advantages of having actually useful computing power at the local end. There's a few functional awkwardnesses, such as some VMs being unhappy if images are migrated between machines with different CPU features, and it obviously benefits from having significant bandwidth. But the idea of being able to combine the convenience of a floating session with the knowledge that you can still keep copies of your data on you is an attractive one, and I'd love a future where I can move my session between my laptop and a desktop.

After that there was some time to talk to Bill Scherlis and Philip Lehman about the software engineering courses that CMU run. Part of the minor in software engineering includes a course requirement to make a meaningful contribution to an existing software project, from design through to submission and upstream acceptance. I had the opportunity to talk to a couple of the students about this and the differences they found between working with the Mozilla and Chrome communities, which I'll try to write up at some point.

Finally I gave a presentation on Fedora and some of the issues that we face in providing a useful OS when patents and recalcitrant hardware vendors do their best to thwart us. Despite the ice outside and the significantly-below-freezing temperatures, enough people turned up that sorties had to be sent out to find extra chairs. It was great to see how interested people were in learning about what we do, although it's probably the case that the free pizza did help encourage people.

After that it was an early trip back to the airport, where I found that my plane was delayed and the only "restaurant" still open was McDonalds. Even so, I left with the feeling that it had been an interesting and educational visit. Many thanks to David Eckhardt, who runs the OS course I presented to and who looked after me all day - thanks too to Joshua Wise who picked me up when David was running late due to the ground being covered with blocks of ice.

February 19, 2010 09:35 PM

Linus Torvalds: Demons? Really?

So I was in Costco waiting for a car tire rotation and check yesterday. Wasting time, I blew three bucks on a slice of pizza and a sundae, and looked around for a place to sit down and pig out. The place was packed, and it was the middle of the day.

So I sat down next to this group of people, and realized that one reason it was busy was that apparently people use the Costco foodcourt as a lunch place. Fair enough. A couple of bucks gets you a long way there.

Sitting there, I can't but help overhear that it's apparently some religious discussion going on. Ok, so it's the local God Squad having their lunch meeting, no biggie. They're apparently talking about Africa, and about life and death decisions etc - at least one of them is a missionary.

And that's when it gets strange. One of them starts to seriously talk about praying demons away, and then after the prayer has driven the demon out of the person, you have to support the person so that the demon doesn't come back. And nobody laughs at him.

Seriously? What year is it again? I'm pretty sure they didn't have Costco foodcourts in the middle ages, but maybe there was some time warping going on.

What the hell is wrong with people?

February 19, 2010 11:51 AM

Harald Welte: Announcing OsmocomBB: Free Software / Open Source GSM Baseband firmware

Last, but not least, I am proud to announce the OsmocomBB project publicly. During the last 7 weeks, a small group of skilled developers has been working on this

It has now reached a point where we can

Since this in itself is a valuable and useful milestone of the project, it was the ideal opportunity to take this project public.

There's still a lot of work to be done in many areas. Most of them are not even related to the GSM air interface. So if you're familiar with C development on an ARM7TDMI based microcontroller, know your way around I2C and SPI, are familiar with the GNU toolchain for ARM and want to help us out: Please join the baseband-devel mailing list right away!

February 19, 2010 01:00 AM

February 18, 2010

Evgeniy Polyakov: Elliptics network background fsck

Its original draft could be read previously, but I believe it became a little bit outdated, so requires some highlighting.

But first, let's clear the status of fsck log checker. I completed its implementation, which is now capable of supporting consistent number of copies in the storage. It does not allow to merge different transaction logs yet.

To determine object to check it uses special text log file, which among other info contains name of the object and transformation functions to work with. Each transformation function will produce unique ID, which will be checked in the storage. For example we can put there sha1 and md5 transformation functions, so we will have two IDs equal to appropriate hash of the input name (and optionally hash of the transactions content).

When some objects are not presented in the storage, checker will download first existing copy and try to upload it using transformation functions corresponding to missing objects. So, if object with ID being equal to md5(name) is present and sha1(name) isn't, then checker will download all transactions stored in the existing object and upload them using sha1 transformation, thus recovering requested number of copies.

Checker currently requires log file to get information from and admin to start the process.
Background fsck is supposed to eliminate both needs.

Basic idea is to store some metadata with each object, which will tell origin of the given object and how it was supposed to be stored in the elliptics network. Thus we can timely or on request parse metadata for all objects in the given node (or only part of them), create a log file and run existing checker against it.

It becomes similar to what extended attributes are in the existing filesystems. Metadata can contain information not only about what object is, but also its IO permissions or access policies, owner information and anything else we would like to have there, which will allow to implement at least basic security model for elliptics network as well as simplify POHMELFS port.

February 18, 2010 08:32 PM

Jaya Kumar: Theft of Xorg funds by Paypal

I just read on the xorg mailing list that Paypal stole USD$5k from xorg and another 5k to some Brazilian bankers. I have only used Paypal once and they gave me a USD conversion rate which was half that of legitimate banks and it was all a big drama and felt really unfair. So I really hope that Xorg is able to recover those funds. Maybe one of yous fellas is a lawyer and can help Xorg?

February 18, 2010 04:08 PM

February 17, 2010

Matthew Garrett: Gobi 2000

Anssi Hannula posted a patch to add Gobi 2000 support to qcserial and provided me with support for gobi_loader. I've added the gobi_loader code here. You'll need Anssi's kernel patch from here, and probably also my followup patch with extra IDs from here. Note that the 2000 devices need an extra firmware file (UQCN.mbn) as well as the apps.mbn and amss.mbn files.

The qcserial driver is currently broken in 2.6.32 and later. It's due to the switch to using kfifo for usb serial, but we haven't been able to work out the actual cause. I'm looking at alternative approaches.

February 17, 2010 09:56 PM

Kernel Podcast: 2010/02/14 Linux Kernel Podcast

Audio: http://media.libsyn.com/media/jcm/linux_kernel_podcast_20100214.mp3

This podcast is brought to you by the colour blue and way too much coffee, together reminding you to check out the awesome power of the BeagleBoard Open Source hardware project at http://www.beagleboard.org/. My new Rev C. board was responsible for the delay getting this issue out…too much fun was had.

For the weekend of February 14th, 2010, I’m Jon Masters with a summary of the weeks’s LKML traffic.

In this issue: Linux 2.6.33-rc8, x86 bootmem, NFS, OOM, Performance Counters, Relaxation, Stack Sizes, and SysFS mutability.

Linux 2.6.33-rc8. Linus Torvalds announced the release of version 2.6.33-rc8 on Friday February 12th 2010 at 11:49 am Best Coast Time (PST), saying that he hoped it would be the last before 2.6.33 final. He added that, “A number of regressions should be fixed, and while the regression list doesn’t make me _happy_, we didn’t have the kind of nasty things that went on before -rc7 and made me worried”. This kernel includes fixes for the netfilter bugs that I discovered, as well as some KMS regression fixes. In a separate discussion thread started by John Hawley (warthog9), it was debated when kernel.org should move over to using xz (LZMA2) as a replacement for bzip2 compression (remember when bzip2 was trendy and new?). John proposed various migration options before the thread verred off into a discussion around when an eventual 3.0 Linux kernel would come, and what that would actually mean in practical terms – just an arbitrary future release? I expect that LWN will have a typically witty writeup of this discussion sometime this week.

Bootmem. Back in October last year, Ingo Molnar had stated that the kernel may not need the “bootmem” allocator on x86. At the time, he noted that there were 5 different allocators on x86, depending upon the boot stage (to say nothing of the other core allocator options): the generic allocator, the early allocator (bootmem), the very early allocator (reserve_early), the very very early allocator (early brk model), and the very very very early allocator (basically just build time allocation). By initializing the x86 page allocator earlier in the boot process, Yinghai Lu attempts to do just what Ingo had suggested, now in version 6 of his patchset.

NFS. Hirofumi Ogawa noticed (2.6.33-rc6) that recent kernels could not mount remote NFS version 3 shares, because of a userspace visible change in the kernel nfsd server. If he specified “vers=3″ at mount time, all was well, but the kernel was not falling back to v3 correctly when v4 fails due to a change in error handling. Bruce Fields noted that this change was actually intentional and that the userspace tools had been updated, but decided to revert the patch that caused this change for the time being – at least until the new versions of the mount tools are much more widespread than right now. Bruce sent a patch entitled (”informingly”) “2.6.33 fix” to Linus.

OOM. David Rientjes posted a patchset re-implementing the OOM killer, in the wake of a number of discussions concerning its brokenness. It includes a complete rewrite of the badness() heuristic, which he is then described in some detail within the corresponding patch. Quoting David, ‘The baseline for the heuristic is a proportion of memory that each task is currently using in memory plus swap compared to the amount of “allowable” memory. ” Allowble,” in this sense, means the system-wide resources for unconstrained oom conditions, the set of mempolicy nodes, the mems attached to current’s cpuset, or a memory controller’s limit. The proportion is given on a scale of 0 (never kill) to 1000 (always kill), roughly meaning that if a task has a badness() score of 500 that the task consumes approximately 50% of allowable memory resident in RAM or in swap space.”

Performance counters. Christoph Hellwig had complained that a patch had been merged back in September from Arjan van de Ven entitled “perf_core: provide a kernel-internal interface to get to performance counters”. That was intended to facilitate in-kernel use of the performance counters framework, but it was Christoph’s opinion that it had no users and should be reverted. Ingo Molnar countered that there actually were a growing number of users, now including the latest work by Don Zickus to create a generalized NMI watchdog handler.

Relax. Michael Breuer posted an interesting analysis of the implementation of the function cpu_relax on x86 systems. This function is called during spinlock spinning cycles in order to give the CPU a break (power management, etc.). Apparently, that function currently uses a nop, but both the Intel and AMD documentation recommend the PAUSE instruction instead (partly because it can be detected on recent CPUs and used to give special treatment to guest instances running under virtualization that are wasting CPU cycles when multiple vpus are allocated and some are spinning away). Arjan van de Ven, and others too, seemed to find this odd, and Artur Skawina wondered if this might be an odd alignment issue. Nonetheless, Michael detects a noticeable performance impact in various tests between these two instructions.

Stack sizes. The kernel contains various task startup code that will create a vma region for its stack use. Existing kernels make this size determination based upon the PAGE_SIZE for the architecture, even though this really is independent of the userspace code that will use the stack, and even given existing rlimits that might see the stack theoretically larger than has been allowed by system limits. Michael Neuling sent a patch to decouple stack sizing from PAGE_SIZE and to default to basing it upon the rlimit.

SysFS. Amerigo Wang posted an RFC patch implementing “mutable sysfs files”. The basic idea is that all potentially “mutable” (that is to say, files that may be yanked out from underneath at any time a hotplug or other operation occurs) files should use a specific API to avoid warnings.

In today’s miscellaneous items: An interesting discussion started by Salman Qazi (Google) centered around a missunderstanding of the ptrace API (and eventual iteration from Oleg Nesterov that the existing API sucks), a January XFS update from Christoph Hellwig (noting new support for netlink provided quota communication, better power saving in XFS kernel threads), Mel Gorman posted version 2 (v2r12) of his “Memory Compaction” patch series that is intended to “defragment” memory by reconciling GFP_MOVABLE pages, and another one of Al Viro’s entertaining rants, this time about pohmelfs and its use of direct access to the current->fs->{root,mnt} entries.

In today’s announcements:

Git version 1.6.6.2. Junio C Hamano announced an update to the 1.6.6 series of the Git SCM tool, releasing version 1.6.6.2. This contains a few fixes.

Git version 1.7.0. Junio C Hamano also announced version 1.7.0 of the Git SCM had been released. This is the latest official version and includes a number of behavioral changes to “git push”, “git send-email”, and other commands as previously noted in this podcast. Users should read the release notes before upgrading if they want to make sure they catch all of the improvements.

Linux 2.6.32.8. Greg Kroah-Hartman, apologizing for the slight delay due to a few crashes that had been reported and a need to verify a security fix, as well as various travel plans, announced the release of 2.6.32.8. It contains a few fixes 2.6.32 users really should have on their systems.

The Linux Storage and Filesystems Summit. James Bottomley announced that the annual Linux Storage and Filesystems summit will take place concurrently with the VM summit on the two days before LinuxCon in Boston (Sunday and Monday), on the 8th and 9th of August. Interested parties can visit either the Linux Foundation website, or email agenda topics to the program committee at lsf10-pc@lists.linuxfoundation.org.

Userspace RCU 0.4.1. Mathieu Desnoyers announced the latest release of his Userspace RCU implementation (remember, patent encumbered, but with a waiver for GPL projects). Version 0.4.1 contains a compilation fix for s390.

As a followup to last weekend’s kerneloops statistics, Arjan van de Ven also posted statistics purely for the 2.6.33 at that time. In his statistics, he showed that the most popular oops was in memcpy_toiovecend (found 391 times).

The latest kernel release is 2.6.33-rc8.

Andrew Morton announced an mm-of-the-moment mmotm for 2010-02-11-21-15.

Don’t forget to read my latest blog posting on jonmasters.org for more information on using the Cyclades TS-3000 with kgdb for remote target debugging, and don’t forget to support Jason Wessel’s proposed kgdb and kdb merge for 2.6.34. You know it makes sense to get this out there widely.

That’s a summary of the week’s Linux Kernel Mailing List traffic, for further information visit www.kernel.org. I’m Jon Masters.

February 17, 2010 01:35 PM

February 16, 2010

Valerie Aurora: Brief union mounts update


For those of you wondering, I’m still working on union mounts, just heads-down on a major rewrite to fix the hairiest problems. Right now I’m perhaps 90% of the way through rewriting the actual lookup code, the dense nutty core of union mounts. This will fix one of the most difficult problems with the current code, massive code duplication between cached, real, and restricted real (“hash”) lookups:

http://lkml.indiana.edu/hypermail/linux/kernel/0910.2/01572.html

I rewrote this to be one function with a loop centered around __lookup_hash() and it’s looking pretty good.

This rewrite is one of the hardest coding problems I’ve ever worked on, and I have a lot of respect for the original union mount authors, Jan Blunck, Bharata Rao, Miklos Szeredi, David Woodhouse, and everyone else who has ever worked on a unioning file system. Not to mention the regular VFS authors – the cost of pathname lookup is one of the most crucial elements of operating system performance and it takes a lot of work to make it go fast.

February 16, 2010 09:04 PM

Jaya Kumar: Lack of used goods markets

I'm in KL now struggling to find cheap used electronics. I needed a bunch of microsd and SD cards, didn't matter what size or make, used or new, I just wanted them really cheap (RM5 or less). So I posted on the fleamarket forums, Mudah and Cari but few sellers. It seems like most people here trash the stuff rather than selling it on. I wanted a handycam that can write either mp4 or wmv to SD or microSD for filming demos, so condition didn't matter much and image quality could be average, and was willing to put down around RM150 but no sellers for that either. Where does all the old electronics go? Into the trash? Melted down? What a pity for cheapskates like me. Even stuff like perspex boards only gets sold new which means it is all expensive. Where does all the used stuff go?

February 16, 2010 04:50 PM

Pete Zaitcev: Keyboard

My PS/2 adapter seems to have died at last... Or, actually, it still works, but it takes several reboots to have it "grab" and start working. It was getting worse gradually, perhaps a capacitor is dying somewhere or whatnot. So, I hooked up a Belkin keyboard that I obtained many years ago for some kind of USB testing, and what do you know: I worked with computers for 27 years now and this is probably the second or third worst keyboard that I ever touched (the so-called "Cuban Videoton" or "CID" - the terminal made in the Island of Cuba - was the worst, and it had a couple of good competitors, one of which hailed from Yerevan, Armenia). The problem is subtle: keys of the Belkin Scorpius 980 Plus have a random friction in them. To write it in a blog, it sounds like a ridiculously petty complaint, but it's real. Typing anything correctly is a pain, and I have to program in C on it, goddamit.

I was thinking about killing two birds with one stone by getting one with a built-in touchpad in the laptop position. My trusty old ALPS touchpad is great and all, but it developed a peculiar problem: its feet became hard with age and it slides. The common Adesco keyboards get mixed reviews and the listed sizes are contradictory or not credible. Amazon has one SolidTek type that seems like the right size and design. One problem though: $40 price. Isn't it a bit high for what seems like a rather dubious quality? It's not like I am on welfare, it's just... not an Apple or Daimler-Benz product to command a price like that.

So, yeah.

UPDATE: Peter Zijlstra pointed out the Lenovo UltraNav, which is definitely a quality unit, but it has all me (mis)features of a ThinkPad: left Ctrl and Fn are swapped, buttons that go along with the nipple offset the touchpad down, Esc is way far up. I already have a T400 and I hate all of that. Otherwise, it's perfect.

UPDATE 2010/03/01: After some consideration, I went with the the IBM keyboard because of (a) quality and (b) 100% key pitch.

True, it has all the disadvantages of the Thinkpad layout, but at least to type on it is not painful. BTW, no Microsoft button.

Oh, and the ALPS touchpad is finally retired after 13 years of service without reproach. It probably is the oldest computer peripheral in the house by far, because usually I recycle ruthlessly.

February 16, 2010 06:11 AM

Rusty Russell: Followup: lrzip

Mikael noted in my previous post that Con Kolivas’s lrzip is another interesting compressor.  In fact, Con has already done a simple 64-bit enhance of rzip for lrzip, and on our example file it gets 56M vs 55M for xz (lrzip in raw mode, followed by xz, gives 100k worse than just using lrzip: lrzip already uses lzma).

Assuming no bugs in rzip, the takeaway here is simple: rzip should not attempt to find matches within the range that the backend compressor (900k for bzip2 in rzip, 32k for gzip, megabytes for LZMA as used by lrzip).  The backend compressor will do a better job (as shown by similar results with lrzip when I increase the hash array size so it finds more matches: the resulting file is larger).

The rzip algorithm is good at finding matches over huge distances, and that is what it should stick to.  Huge here == size of file (rzip does not stream, for this reason).  And this implies only worrying about large matches over huge distances (the current 32 byte minimum is probably too small).  The current version of rzip uses an mmap window so it never has to seek, but this window is artificially limited to 900MB (or 60% of mem in lrzip).   If we carefully limit the number of comparisons with previous parts of the file, we may be able to reduce them to the point where we don’t confuse the readahead algorithms and thus get nice performance (fadvise may help here too) whether we are mmaped or seeking.

I like the idea that rzip should scale with the size of the file being compressed, not make assumptions about today’s memory sizes.  Though some kind of thrash detection using mincore might be necessary to avoid killing our dumb mm systems :(

February 16, 2010 01:21 AM

February 15, 2010

Pete Zaitcev: Report from the proprietary cesspool

I mostly read the article about Coverity's experience in the trenches as something I would read at The Daily WTF. Which I don't read, let alone daily: it's too far removed from my world. Still, some of that may come handy one day. Like this:

How to handle cluelessness. You cannot often argue with people who are sufficiently confused about technical matters; they think you are the one who doesn't get it. They also tend to get emotional. Arguing reliably kills sales. What to do? One trick is to try to organize a large meeting so their peers do the work for you. The more people in the room, the more likely there is someone very smart and respected and cares (about bugs and about the given code), can diagnose an error (to counter arguments it's a false positive), has been burned by a similar error, loses his/her bonus for errors, or is in another group (another potential sale).

But other than that, bah humbug. My universe is gcc (or maybe LLVM at the most). The heroic tales of fighting people who write C in StudlyCaps mean nothing to me. The only real import of the article is how Sparse needs more attention. If nothing else, Free Software developers need to counter-patent everything in Sparse for when Coverity comes for us, we'll be ready.

February 15, 2010 07:53 PM

Rusty Russell: xz vs rzip

As the kernel archive debates replacing .bz2 files with .xz, I took a brief glance at xz. My test was to take a tarball of the linux kernel source (made from a recent git tree, but excluding the .git directory):

     linux.2.6.tar 395M

For a comparison, bzip2 -9, rzip -9 (which uses bzip2 after finding distant matches), and xz:

     linux.2.6.tar.bz2 67M
     linux.2.6.tar.rz 65M
     linux.2.6.tar.xz 55M

So, I hacked rzip with a -R option to output non-bzip’d blocks:

     linux.2.6.tar.rawrz 269M

Xz on this file simulates what would happen if rzip used xz instead of libbz2:

     linux.2.5.tar.rawrz.xz 57M

Hmm, it makes xz worse!  OK, what if we rev up the conservative rzip to use 1G of memory rather than 128M max?  And the xz that?

     linux.2.6.tar.rawrz 220M
     linux.2.6.tar.rawrz.xz 58M

It actually gets worse as rzip does more work, implying xz is finding quite long-distance matches (bzip2 won’t find matches over more than 900k).  So, rzip could only have benefit over xz on really huge files: but note that current rzip is limited on filesize to 4G so it’s a pretty small useful window.

February 15, 2010 07:56 AM

February 14, 2010

Evgeniy Polyakov: Elliptics network: 2.6.4 release

It took a while to prepare a new release of the distributed hash table storage elliptics network, but here we go. This is still a minor version bump, although amount of changes is rather large for small update.

Likely this will be the last releae in 2.6 release cycle, since in parallel we are cooking up a completely new versioning and merge logic as well as data synchronization. Btw, this release breaks to some degree that logic, but there is a tool to fix things up. It will be automated in the next versions.

But let's dig into details and changelog:

Modulo possible bugs, main work is concentrated on the filesystem checker. There are two problems to solve.
The first one is absence of transaction log made by requested transformation function, or in plain words - absence of copy of the object in the storage. This happens when some node went offline and returned empty or was replaced. Or did not return at all. In this case fsck application will check how many copies are present in the storage and automatially download one of them (the first one from config) and upload with given ID.

Second issue to resolve is transaction merge. Elliptics network by default uses transactions for every update, so there is no object as is in the storage, instead reader will download transaction log, parse it and select transactions which cover requested object range. It is hidden in API of course, but it is possible to manually select needed transactions, for example to support versioning and data snapshots. As tasty effect two fully equal transactions (objects) will not use two times more space, since there are appopriate transaction reference counters.

Currently there are multiple (5) merge strategies, but practice shows that they introduce more harm or misunderstanding at best, than actual goodness. So I decided to drop them all in favour of trivial timestamp based merge algorithm. Of course it is possible to merge transactions based on private algorithm, which can be called from fsck daemon. We have request to allow external modules to merge objects based on actual data.

This version disables content synchronization during node joining. Instead admin has to call fsck application with externally stored log of the uploaded data to check whether things are ok and fixup what was broken. It will be automated and no external log will be required in the next versions.

Fsck application log file should look like this:

3 0,0,0 sha1,md5 object_name

where '3' is object creation flags - without transactions, just like those created by FSCK frontend. Will be removed in the next version.
'0,0,0' is a placeholder for object parsing information meaning start,end,update_existing. Start and end are positions of the starting and ending symbol in the object_name used to generate ID. Zeroes mean automatic detection. Update_existing is not currently supported, in the next version if set will upload local file named object_name into the storage no matter if its copies are already present.
sha1,md5 - transformation functions used to generate ID from object_name. This setup uses two copies - each one created by appropriate hash.
object_name - name of the uploaded object. Its hash (or actually transformation of the name using presented functions, it is allowed to be some other function than plain hash) will be object ID.

Stay tuned, work is boiling and results are very close!

February 14, 2010 05:38 PM

February 13, 2010

Valerie Aurora: Sleeping with the enemy


Jonathan Schwartz’s resignation via Twitter reminded me of a strange facet of Sun company culture: I’ve never known so many married couples working for the same company. Some them even worked on the same project together. For the same boss. From home.

Now, the exact percentage of married couples in a company can’t be used to compare companies directly – after all, it depends heavily on things like industry, age, and local marriage laws – but it seems linked to another facet of Sun company culture: Complete, almost embarrassing disconnect from public opinion.

The post-Google standard company perks – free food, on-site exercise classes, company shuttles – make it trivial to speak only to fellow employees in daily life. If you spend all day with your co-workers, socialize only with your co-workers, and then come home and eat dinner with – you guessed it – your co-worker, you might go several years without hearing the words, “Run Solaris on my desktop? Are you f—ing kidding me?

Schwartz’s “the financial crisis did it” explanation for Sun’s demise is a symptom of an inbred company culture in which employees at all levels voluntarily isolated themselves from the larger Silicon Valley culture. Tech journalists write incessantly about the exchange of expertise and best practice between companies as a major driver of the Bay area’s success. But you have to actually talk to your competition to do that – over a beer, or maybe a pillow.

February 13, 2010 03:22 AM

Harald Welte: In six weeks from bare hardware to receiving BCCHs

After six weeks of full-time hacking, with the help of a few friends, we have made it to receiving actual BCCH data from a GSM cell.

So what does this mean? As I have indicated publicly at the 26C3 conference: Now, that we have managed to create a working GSM network-side implementation (OpenBSC) during the last year, we will proceed to do the same with the phone side.

Initially we spent quite a bit of thinking on building our own custom hardware. But while planning for the first prototype, we realized that it would simply distract us too much from what we actually wanted to do. We don't want to take care of component sourcing, prototype generations, quality assurance in production, production testing, etc. -- All we want is to write a Free Software GSM protocol implementation for a phone.

Unfortunately (as usually in the industry), the silicon and device makers do not publish sufficient documentation about their devices to enable third-party developers to go ahead and write their own software: The never ending problem of Free Software in many areas beyond more-or-less standardized hardware like in the PC industry.

So, if you want to write Free Software for such a device, you have two options:

I've been involved in both approaches multiple times while looking only at the application processor (the PDA side) of mobile phones: OpenEZX and gnufiish are two more or less abandoned projects aimed at reverse engineering. Openmoko was the project that had to build its own hardware as a dependency to be fulfilled before writing software.

If you're not a company and don't want to sell anything, the reverse engineering approach looks more promising. You can piggy-back on existing hardware, don't need to take care of sourcing/production/certification/shipping and other tedious bits.

If you are a company and want to generate revenue, then of course you want to build the hardware and ship it, as it is what you derive your profits from.

So, just to be clear on this: Neither OpenEZX, nor gnufiish nor Openmoko were ever about writing Free Software for the GSM baseband processor, i.e. the beast that exchanges messages with the actual GSM operator network. But this is what we're working on right now.

It's about time, don't you agree? after 19 years of only proprietary software on the baseband chips in billions of phones, it is more than time for bringing the shining light of Freedom into this area of computing.

To me personally, it is the holy grail of Free Software: Driving it beyond the PC, beyond operating systems and application programs. Driving it into the billions of embedded devices where everyone is stuck with proprietary software without an alternative. Everybody takes it for granted to run megabytes of proprietary object code, without any memory protection, attached to an insecure public network (GSM). Who would do that with his PC on the Internet, without a packet filter, application level gateways and a constant flow of security updates of the software? Yet billions of people do that with their phones all the time.

I hope with our work there will be a time where the people who paid for their phones will be able to actually own and control what it does. If I have paid for it, I determine what software it runs and when it send which message or doesn't.

Oh, getting back to what our work: It will be published as soon as it is sufficiently stable and fit for public consumption. You won't be able to make phone calls yet, but we'll get there at some later point this year.

February 13, 2010 01:00 AM

February 12, 2010

Paul E. Mc Kenney: Parallel Programming: Selective Struggling

An Eminent Reader privately indicated some distaste for the non-technical nature of recent parallel programming posts. Given that many of the obstacles to successful development of parallel software are non-technical, there will be future non-technical posts, but there is no reason not to take a technical break from these issues. And so, just for you, Eminent Reader, I present this parallel programming puzzle.

This puzzle stems from some researchers’ very selective struggles with parallel algorithms. Of course, it should be no surprise that many people, researchers and developers included, will struggle quite happily with their “baby”, but will even more happily bad-mouth competing approaches, even when (or perhaps especially when) those approaches requiring much less struggling. And yes, some might accuse me of favoring RCU in just this manner, but this is my answer to the likes of them.

Such selective struggling seems to have given rise to an interesting urban legend within the concurrency research community, namely that allowing concurrent access to both ends of a double-ended queue is difficult when using locking.

Can you come up with a lock-based solution that permits the two ends of a double-ended queue to be manipulated concurrently?

February 12, 2010 11:33 PM

Rusty Russell: Code review: libreplace

libreplace is the SAMBA library (also used in ctdb) to provide working implementations of various standard(ish) functions on platforms where they are missing or flawed.  It was initially created in 1996 by Andrew Tridgell based on various existing replacement hacks in utils.c (see commit 3ee9d454).

The basic format of replace.h is:

    #ifndef HAVE_STRDUP
    #define strdup rep_strdup
    char *rep_strdup(const char *s);
    #endif

If configure fails to identify the given function X, rep_X is used in its place.  replace.h has some such declarations, but most have migrated to the system/ include directory which has loosely grouped functions by categories such as dir.h, select.h, time.h, etc.  This works around the “which header(s) do I include” problem as well as guaranteeing specific functions.

Other than reading this code for a sense of Unix-like paleontology (and it’s so hard to tell when to remove any of these helpers that cleanups are rare) we can group replacements into three categories:

  1. Helper functions or definitions which are missing, eg. strdup or S_IRWXU.
  2. “Works for me” hacks for platform limitations, which make things compile but are not general, and
  3. Outright extensions, such as #define ZERO_STRUCT(x) memset((char *)&(x), 0, sizeof(x)) or Linux kernel inspired likely()

Since it’s autoconf-based, it uses the standard #ifdef instead of #if (a potential source of bugs, as I’ve mentioned before).  I’ll concentrate on the insufficiently-general issues which can bite users of the library, and a few random asides.

I’m not sure Samba compiles on as many platforms as it used to; Perl is probably a better place for this kind of library to have maximum obscure-platform testing. But if I were to put this in CCAN, this would make an excellent start.

February 12, 2010 08:53 AM

February 10, 2010

Pete Zaitcev: Cloud Forum 2010

I'm "attending" the Red Hat Cloud Thing. The Deltacloud guy is presenting, Jeff Garzik is next with our own Hail. To get this working, I had to add thomson-webcast.net to Flash whitelist, otherwise the site said "No Scripting". It's about time somebody started a company streaming presos in Theora or something...

February 10, 2010 06:44 PM

Kernel Podcast: 2010/02/07 Linux Kernel Podcast

Audio: http://media.libsyn.com/media/jcm/linux_kernel_podcast_20100207.mp3

This podcast is brought to you by the awesome power of Jason Wessel’s kgdb patches, helping to support those who believe in kernel debuggers find hard to reach kernel bugs since 2009. Kernel debuggers: the way of the future.

For the weekend of February 7th, 2010, I’m Jon Masters with a summary of the week’s LKML traffic.

In today’s issue: Linux 2.6.33-rc7, regressions, Google Summer of Code, IMA, OOM, and sys_membarrier.

Linux 2.6.33-rc7. Linus Torvalds announced the 2.6.33-rc7 release of the Linux kernel on Saturday, February 6th, 2010 at 2:44pm (14:44) Best Coast Time (PST). In his announcement, Linus remarked, “I have to admit that I wish we had way fewer regressions listed by this time, so I hereby would like to point every developer to” a link to a recent post to the linux wireless mailing list archive on gmane.org showing a copy of a recent email from Rafael J. Wysocki detailing known kernel regressions between 2.6.32 and 2.6.33-rc6 as posted originally to the LKML. He added, “But we’ve certainly fixed a few things, and it’s been a week, so here’s -rc7″. Most of the changes are in PowerPC defconfigs (default configs), but there are even more i915 updates, radeon KMS updates, and lots of other smaller bits all over the tree. Linus also wondered (in another email) whether it was worth making the .gz files any more given that bzip2 has been around more than long enough by now. Some thought the gzip files were still useful on systems without bzip2 or for some really slow systems that apparently handle gzip files more easily.

Regressions. Rafael J. Wysocki followed up to Linus’ 2.6.33-rc7 announcement (as he had also done with 2.6.33-rc6) with a list of outstanding regressions beteen 2.6.32 and 2.6.33-rc7. There are currently 20 “unresolved” issues in the list of regressions given. Rafael also noted that Maciej Rutecki has, “generously volunteered to work on the tracking of kernel regressions”. The work done by Rafael (and now, hopefully Maciej also) is very valuable to the community and we really do owe them our gratitude for helping out. Arjan van de Ven also posted a list of oops and warning reports on kerneloops.org from the week, including a very common ext4/quota issue in Fedora.

Google Summer of Code. Luis Rodriguez stated that, “Google has confirmed it will have a Google Summer of Code for 2010″, then mentioned that last year’s effort (4 suggested projects, of which 3 were accepted) resulted in only one success. Witold Sowa followed up saying that he didn’t know he was the only student who completed his project, but that the work to add an AP mode to NetworkManager, “with use of wpa_supplicant’s newly developed AP mode” was relatively easy to accomplish and so he had worked on other things also. Apparently, the initial GSoC work is now available in NetworkManager. Nonetheless, it sounds as if Luis is keen to see a higher than 33% success rate if any entries are accepted this year under the Linux Foundation.

IMA. Mimi Zohar replied to an email from Shi Weihua concerning a NULL pointer deference bug in the IMA security code (ima_file_free), which Al Viro and others had previously discussed solutions for.

OOM. Lubos Lunak and David Rientjes resurrected the OOM killer discussion again after Lubos posted some analysis of various KDE processes running on his system, and wondered why the OOM killer uses VmSize rather than RSS to determine tasks that should be killed (in other words, why should it not favor tasks actually resident in memory at the time?). This discussion has been had recently, and David Rientjes explained that the kernel favors overall VmSize in its calculations so as to catch memory leakers as a preference (which are often not resident at the time). David did seem to like the suggestion of catching the the child with the highest badness calculation before killing its parent, and posted an untest patch. He also suggested that the KDE process tree example was “a textbook case for using /proc/pid/oom_adj to ensure a critical task, such as kdeinit is to you, is protected from getting selected for oom kill”. Lubos replied with some very good points about how simply setting oom_adj doesn’t scale, and Balbir Singh was amongst those still favoring a switch to RSS-like accounting but with support for shared pages (for example “PSS”) eventually. Rik van Riel noted that he had no strong opinion one way or the other. David posted various patches proposing an alternative fine grained oom_adj mechanism.

sys_membarrier. Mathieu Desnoyers posted a three part patch series implementing sys_membarrier, a new system call that can be used to “distribute the overhead of memory barriers asymmetrically”. In particular, he wants it for his urcu userspace RCU implementation (for use within the synchronize_rcu call). Sensibly, Mathieu proposes incremental additions to each architecture (even though he believes that it “should be portable to other architectures as-is”), reserving the system call numbers now, then implementing gradually.

In today’s miscellaneous items: Matti Aarnio posted to let everyone know that a recently discovered hole in the bayesian filtering system as used by the vger.kernel.org mailing list server to reduce SPAM has been plugged (it had been possible to reach the list using a specific “backend” majordomo domain), Catalin Marinas decided to simply patch the USB HCD driver that had resulted in cache coherency problems when using USB storage (and noted that a followup posting to linux-arch would call for a flush_dcache_range function), some miscallenous rewrites of obsolete syscall handlers to use generic versions from Christoph Hellwig, a request for an opinion on mergeing the kFIFO rewrite in 2.6.34 from Stefani Seibold, a potential issue with the kernel implementation of LZO compression reported by Nigel Cunningham (for which he will switch back to LZF in TuxOnIce again for the moment), Stephen Rothwell wondered aloud whether Linus would really be interested in taking the percpu changes currently sigging in percpu “next”, and Mathieu Desnoyers announced he is switching email from his academic address in Montreal (where he recently completed his PhD around LTTng) to a consulting firm he is involved with at http://efficios.com.

In today’s announcements: Greg Kroah-Hartman posted review patches for the 2.6.32.8 stable series kernel.

Scott James Remnant announced the release of upstart version 0.6.5. It includes a large number of fixes, amongst which is the completion of the splitting out of libnih into its own project. There is a new /sbin/reload command for reloading upstart daemons, a restored sync() before reboot, improved documentation, and more goodies.

Junio C Hamano announced version 1.7.0.rc2 of the Git SCM, which includes a number of forthcoming behavior changes as mentioned in this podcast when discussing the rc1 release from the previous week.

Subrata Modak announced that the Linux Test Project (LTP) for January 2010 has been released. It now contains over 3000 tests. Separately, Garrett Cooper noted a rather severe bug in the top level LTP Makefile that could result in an “rm -rf /” in the wrong circumstances, suggesting that all LTP users comment out three lines from that file.

Willy Tarreau (re-)announced the release of 2.4.37.9. The previos 2.4.37.8 hadn’t actually contained the required e1000 backport with a CVE fix that had triggered the previous release. Willy noted, “I don’t know how I managed to do that because it once was OK and I could successfully build it. Well, whatever I did, the result is wrong and the issue it was supposed to fix is still present in 2.4.37.8. So here comes 2.4.37.9 with the real fix this time”.

The latest kernel release is 2.6.33-rc7.

Andrew Morton posted an mm-of-the-moment (mmotm) for 2010-02-03-20-09.

That’s a summary of today’s Linux Kernel Mailing List traffic, for further information visit www.kernel.org. I’m Jon Masters.

February 10, 2010 05:05 PM

Kernel Podcast: 2010/01/31 Linux Kernel Podcast

Audio: http://media.libsyn.com/media/jcm/linux_kernel_podcast_20100131.mp3

This podcast is brought to you by the power of Al Viro’s ima_file_free fix, saving in-progress crashed podcast recordings since February 2010, and now powering the all new 2010 2.6.33 series Linux kernel with all wheel drive.

For January 31st, 2010, I’m Jon Masters with a summary of the week’s LKML traffic.

In this week’s issue: Linux 2.6.33-rc6, ide2libata, kFIFO, lock types, netfilter connection tracking, netperf regressions, sparse, and USB storage.

Linux 2.6.33-rc6. Linus Torvalds announced Linux 2.6.33-rc6 on Friday January 29th 2010 at 2:20pm (14:20) Best Coast Time (PST), again describing it as containing “nothing earch-shattering”. About 50% of the changes were architecture updates, and 40% were drivers, with the remaining being mostly filesystem and networking updates. He called for people seeing regressions to begin making “loug noises”, since ‘things mostly should “just work”‘.

ide2libata. Bartlomiej Zolnierkiewicz posted a 68 part patch series entitled “ide2libata” that does roughly what it sounds like – it facilitates a conversion of sorts such that legacy IDE driver code can use a small “translation” layer to share source with the libata codebase. It doesn’t remove IDE but it does (allegedly) make it far easier to maintain both until IDE finally does go away. Alan Cox and others weren’t convinced. Alan thought that, “it will be a nightmare for maintenance with all the includes and the like plus the ifdefs making it very hard to read the drivers and maintain them”. He saw value in the effort, but more as a means to find subtle differences between drivers, and thought IDE was “drifting” a little too much to truly be described as in “maintainance mode” at this point.

kFIFO. Stefani Seibold posted an “enhanced reimplementation of the kfifo API”, which is apparently the last in the series of RFC patches intended to rework the kFIFO implementation (to be generic) without changing the existing API. Stefani included some analysis of the impact of the patch upon text section usage and found that it wasn’t much larger, but that the “hand optimized” inline code was substantially faster than the previous implementation.

lock types. Mitake Hitoshi posted an RFC patch (most for the review of Peter Zijlstra) that adds lock type information to the output of lockdep, as used by tools such as perf. As he points out, “Of course, as you told, type of lock dealing with is clear for human. But it is not clear for programs like perf lock”. On a related note, Frederic Weisbecker stated that he really liked the perf lock report layout, but would love to see a tree view that “can tell you which lock is delaying another one”. He gave varous examples of how this might be visualized as well as describing the benefits.

netfilter connection tracking. I discovered that one of my test systems was falling over on all recent 2.6 series kernels, when using KVM. I wasn’t alone (as I would find out later, looking at Fedora bug reports). The backtrace was variable, but typically involved some kind of IPv6 packet. After mailing the netfilter guys (”PROBLEM: reproducible crash KVM+nf_conntrack all recent 2.6″) and getting some general advice, I spent the entire weekend solid debugging the issue with the aid of Jason Wessel’s kgdb-next tree. The problem was that libvirt (the KVM server management daemon) would attempt to create a second network namespace (netns) on startup – just to see if it would be possible to also support containers – and autostart KVM guests started at that moment would crash because conntrack was missing various chunks of support code for dealing with multiple namespaces. This resulted in hash corruption, kmem caches that would get corrupted, and eventual panics.

netperf regressions in 2.6.33-rc1. Lin Ming performed a bisect analysis and determined that a “sched: Rate-limit newidle” commit had once again introduced a loopback regression (on the order of 50%) in the netperf benchmark, when run on an Intel Nehalem system. Lin assumed that this was due a large amount of rescheduling IPI (inter-processor interrupt) traffic, as evidenced by the perf top data, and /proc/interrupts output. Others could not reproduce this issue.

sparse. Tejun Heo posted a series of percpu patches intended to instrument modular use of percpu data, for the benefit of the sparse source checker utility recognizing that such data lives in a separate data section. Tejun included various descriptions within the individual patches, which only affect building when using the sparse checking tool.

USB mass storage. Catalin Marinas posted a message (mostly aimed at Matthew Dharm) concerning cache coherency of the kernel’s USB mass storage driver. In the case of Harvard Architecture (split I/D caches) ARM processor cores, when using PIO based USB host controllers, root mounted filesystems generating a page fault will only fault the requested page into the data cache, but the USB storage driver fails to call flush_dcache_page to ensure I-cache visibility and results in incoherency between the two. Catalin asked Matthew if he might add support for explicit flushes when doing PIO rather than DMA for IO. Oliver Neukum thought that this belonged in the HCD driver rather than USB storage, due to the wide range of possible underlying layers beneath USB storage, and Matthew Dharm agreed, “Given that an HCD can choose, on the fly, it it’s using DMA or PIO, the HCD driver is the only place to reasonably put any cache-synchronization code. That said, what do other SCSI HCDs do?”.

In today’s miscellaneous items: Chinang Ma posted a comparitive performance analysis between RHEL5.4 kernel 2.6.18 and upstream 2.6.33-rc4 in which he found a 0.8% OLTP performance regression, Simon Kagstrom send a “provoke crash” mail in which he described a module to force crashes for testing, Mark Lord wondered why he was seeing a large number of “page allocation failure” messages on upgrade from 2.6.31.5 to 2.6.32.5, a continuation of previous style discussions concerning 80 character line length “limits” in the kernel, a question from Andi Kleen as to whether the PnP probe code (for PS/2 mice in this particular instance) is racy as he experiences variable probe behavior, Christoph Lameter posted version 15 of “one of these year long projects to address fundamental issues in the Linux VM”, aka “SLAB fragmentation reduction”,Alex Chiang posted a patch to increase the maximum number of Infiniband HCAs per system from 32 to 64 in a “backwards-compatible manner” (hence only raising the limit to 64), and Al Viro posted an informative message entitled “Open Intents, lookup_instantiate_filp() And All That Shit(tm)” on his plans for handling atomic file open+possible create for NFS in the grand future.

In today’s announcements: Greg Kroah-Hartman announced the release of the 2.6.32.7 kernel (having previously announced the 2.6.32.6 earlier in the week and posting a series of review patches for 2.6.32.7). He also announced the 2.6.27.45 “long term release” kernel.

Clark Williams announced the latest version 0.63 of the rt-tests package is now available. This includes various utilities used to verify and experiment with the RT patchset that Thomas Gleixner and others maintain.

Mathieu Desnoyers announced the release of version 0.4.0 of his Userspace RCU library, which includes a few “minor API changes” as previously described. urcu is available for download at http://lttng.org/urcu.

Junio C Hamano announced version 1.7.0-rc1 of the Git SCM. The forthcoming release has a number of items in the draft release notes, including some behavior changes to “git push”, “git send-email” (no deep threads by default), “git status”, “git diff”, and various other goodies.

The latest kernel release was 2.6.33-rc6.

Andrew Morton posted an mm-of-the-moment (mmotm) for 2010-01-28-01-36.

Willy Tarreau announced version 2.4.37.8 of the 2.4 series kernel. It mainly includes fixes for a recentl discovered vulnerability in the e1000 network driver that could allow a carefully crafted frame to skip over filtering.

That’s a summary of today’s Linux Kernel Mailing List traffic, for further information visit www.kernel.org. I’m Jon Masters.

February 10, 2010 01:15 PM

Kernel Podcast: 2010/01/24 Linux Kernel Podcast

Audio: http://media.libsyn.com/media/jcm/linux_kernel_podcast_20100124.mp3

For the weekend of January 24th, 2010, I’m Jon Masters with a summary of the week’s LKML traffic.

Linux 2.6.33-rc5. Linus Torvalds announced the release of the 2.6.33-rc5 kernel, noting that he didn’t “think there is anything earth-shaking here”. Mostly, the only new stuff was in the i915 and (new) DVB “Mantis” driver. Rafael J. Wysocki followed up with his usual list of regressions since the release of 2.6.32, for which there were no know fixes yet in Linus’ tree. The number has fallen a little, but there were still 23 unresolved.

devtmpfs. The devtmpfs filesystem is a shared memory filesystem used to mount /dev nodes that are needed even before udev starts on modern Linux systems (or for those systems that do not use udev, to provide a minimum environment). The suggestion had been made to remove the EXPERIMENTAL flag on its configuration option and enable it by default. The latter received complaints as a change in behavior that would be visible to users, even if many of them would need to have devtmpfs enabled for the most recent Linux distributions.

Interruptions. Steven Rostedt, and Peter Zijlstra did some analysis of the kernel source tree, looking for inappropriate setting of TASK_*INTERRUPTIBLE (which should never be done explicitly, and in general one should always use the set_current_state macro). They found a fairly large number of incorrect code paths and posted a list of “examples of likely bugs”. David Daney replied, asking what kind of barrier should be implied in using set_current_state, as pertains to the visibility of this assignment by other CPUs.

IO error semantics. Nick Piggin started a thread entitled “IO error semantics”, in which he raised the ugly issue of kernel IO error handling behavior once again, as he said he had done during Andi Kleen’s posting of HWPOISON patches. Nick sought to clearly define specific anticipated behaviors in response to “read IOs”, “write IOs”, and so forth – how many retries? etc. He also made the point that write IO errors should not invalidate the data before an IO error is returned to “somebody” (fsync or synchronous write syscall).

NOIO. Rafael J. Wysocki posted an initial PM patch implementing forced GFP_NOIO during suspend operations (preventing the kernel from attempting to allocate memory by going to e.g. disk to offload some existing unused pages), this was largely in reaction to specific issues with the Nvidia closed source binary driver, but was something that had apparently been on the cards for some time. The problem with the patch was that it changed the VM according to the state of the system, rather than relying upon drivers to do the right thing in using explicit GFP_NOIO allocations during suspend and resume routines.

In the week’s miscellaneous items: Tejun Heo posted version 3 of his concurrency managed workqueue patches, Peter Anvin proposed the rapid removal of CONFIG_X86_CPU_DEBUG (since all such information is already exposed elsewhere), the addition of “nopat” boot option documentation to Documentation/kernel-paramters by Jiri Kosina, ongoing discussion of generalization of certain PCI functions in the wake of and intention to merge various Xilinx PCI support bits, a cache coherency problem with mmaped writes on ARM systems posted by Anfei Zhou, a patch correcting priority inheritance deboosting in the RT kernel patchset to be POSIX compliant, Dimitry Golubovsky inquired as to the current state of UML (User Mode Linux, not the silly and pointless modelling technique) development, some Restricted Access Register (Intel MID platform) patches from Mark Allyn, and a large number of floppy (yes, floppy) cleanups from Joe Perches.

In the week’s announcements: Linux 2.6.31.12 and 2.6.32.5 (proceeded by the 2.6.32.4 kernel earlier in the week) were released by Greg Kroah-Hartman. Greg stated that he no longer intended to update the .31 stable kernel short of “something really odd happening”. Greg repeated his previous assertions that the .27 kernel would live on as a “long term” stable release (but probably only for 6 more months of viability), and that the .32 kernel would also be a “long term release” because a number of distributions were apparently basing their distributions around it. His efforts depend upon engineers working on those distributions to help.

Len Brown announced that the Linux Power Management Mini-Summit would be held in Boston on Monday, August 9th 2010, the day before the LinuxCon 2010. For further information, refer to http://events.linuxfoundation.org/.

Mathieu Desnoyers (whose excellent PhD thesis was published recently and covered by LWN) announced an updated LTTng 0.187 for the 2.6.32.4 kernel.

Junio C Hamano announced Git 1.6.6.1 is now available from the kernel.org site at http://www.kernel.org/pub/software/scm/git/. The latest version contains fixes for issues such as “git blame” not working when a commit lacked an author name, “git count-objects” not handling packfiles larger than 4G on platforms with a 32-bit off_t, “git rebase -i” not aborting cleaning if it failed to start the user’s EDITOR, some issues with
the GIT_WORK_TREE environment variable, and more besides.

Thomas Gleixner announced the release of 2.6.31.12-rt20 RT patchset. This was a forward port to 2.6.31.12, which included a number of RCU assumption fixes, the aforementioned PI POSIX compliance fix, and so forth. Thomas noted the delay in releasing a new version of the patch, but noted that various locking infrastructure changes had gone upstream (advancing the cause of mainlining various bits of RT). There will be no 2.6.32-rt, but will skip directly over to 2.6.33. He also let us know about a new “housemate” of his: http://tglx.de/~tglx/housemate.png.

Sorry for the delay in getting this episode released.

February 10, 2010 08:36 AM

February 09, 2010

Paul E. Mc Kenney: Parallel Programming: Questionable Quality Assurance

An earlier post noted that parallel programming suffers more potential failures in planning than does sequential programming due to the usual suspects: deadlocks, memory misordering, race conditions, and performance/scalability issues. This should lead us to suspect that parallel programs might need better quality assurance (Q/A) than do sequential programs. Q/A activities include validation, verification, inspection, review, and of course testing.

Traditionally, Q/A groups serve many roles:


  1. Run tests and find bugs.
  2. Break in new hires, who, strangely enough, are sometimes reluctant to irritate developers.
  3. Distract developers who are already behind schedule with pesky bugs.
  4. Act as scapegoat for schedule slips.
  5. Act as a target of complaints from developers who are tired of debugging either their new features or any bugs located by the Q/A group.

Although there are many highly effective Q/A groups in many software development organizations, it is not hard to find Q/A groups that find bugs, but that either cannot or will not get developers to pay attention to them. It is also not hard to find Q/A groups that are overridden whenever they point out problems that might cause a schedule slip. One way to avoid these problems is via enlightened management based on (for example) bug trends over time, and another way is for the Q/A organization to report high up into the organization. Of course, with this latter approach, one wonders just how often the Q/A organization can get away with yanking on the silver chain connecting to their executive sponsor.

Of course, FOSS communities have their own Q/A challenges, but the fact that the maintainers are usually responsible for the quality of their code adds a breath of fresh air to the process. Not least, their gatekeeper role enables them to vigorously enforce any design and coding guidelines that their FOSS community might have.

But what are the technical effects of parallel software on Q/A?

February 09, 2010 07:44 PM

Kernel Podcast: Updates coming!

Folks,

A couple of weeks of updates are coming, hopefully tonight. I am planning to get back into a routine here. Thanks for being patient!

Jon.

February 09, 2010 04:07 PM

February 08, 2010

Dave Airlie: whats in drm-radeon-testing?

I'll try and post these regularly when I make major additions/removals.

drm-radeon-testing is the cutting edge KMS radeon branch, it is going to be rebased and things will be added/removed as they are worked on by developers. So you can base patches on it but you should talk to the developer who owns the area first.

git://git.kernel.org/pub/scm/linux/kernel/git/airlied/drm-2.6.git drm-radeon-testing

I've just pushed a rebased tree now with the following:

latest i2c algo + hw i2c engine code + all fixes squashed: This adds support for hw i2c engines found on radeons and
exposes them + sw i2c buses to userspace so i2c tools can use them. (agd5f).

pll algorithm reworking + quirks: cleans up the code to allow for the selection of the old pll algorithm on some hardware. (agd5f)

pm support so far: Adds all the current PM patches - just does engine reclocking so far using the power tables from the BIOS. (Zajec/agd5f)

Evergreen (Radeon HD 5xxx) support: basic KMS support for the evergreen range of devices - no irqs or accel yet. (agd5f)

radeon unlocked ioctl support (airlied)

bad CS recording (glisse)

misc cleanups/fixes - Dell/Sun server support ported from userspace hopefully.

The tree did contain Jerome's r600 CS checker but I've dropped it for now at his request as he has newer patches
in testing.

February 08, 2010 11:58 PM

Valerie Aurora: Linux Storage and Filesystems Workshop


The 2010 Linux Storage and Filesystems Workshop has been announced:

http://lkml.org/lkml/2010/2/8/221

One of the things I like most about the file systems workshop is the avoidance of canned presentations:

Presentations are allowed to guide discussion, but are strongly discouraged. There will be no recording or audio bridge, however written minutes will be published as in previous years [...]

Edward Tufte would be proud.

February 08, 2010 08:59 PM

February 07, 2010

Pete Zaitcev: ncld is here, and what it is for

I sent out the "ncld" (a pun on ncurses) that I mentioned before. It is tested, and I already have tabled switched over to it. All is left is to get Jeff to apply it.

Savings in the code size are pretty good, but more importantly it is not impenetrable spagetti from 1968 anymore. And this is important if we want anyone ever hack on CLD voluntarily. People actually pay attention to shit code. To quote:

Fedora uses yum, which originally was developed by Yellow Dog. I forgot who told me this (I think it was either Jeremy [Katz] or Notting [Bill Nottingham]), but the story was that they (e.g. MSW [Matthew Wilson], Notting, and Jeremy) looked at things like apt-rpm, urpm, and yum as a base for [the] next up2date. Only yum passed the test "not to puke while looking at the source". That's how it came to be.

If you tilt your head just right, CLD is somewhat similar to Zookeeper in function (or so I heard), and one time someone asked why not just use Zookeeper. Jeff answered, "Zookeeper's API is too complex". I was concerned that someone would look at the code and think we were NIH hypocrites, because CLD's API before ncld was complex too (for no good reason - it was assembly-level complexity). Well, not anymore. What Jeff actually meant, I think, was that Zookeeper intrinsic architecture was too complex for what we want in Hail.

February 07, 2010 07:46 PM

Pete Zaitcev: Why 1e100.net?

Jon Masters posted a somewhat ambiguous twit: "Why does Google use 1e100.net?" Obviously, 1e100 is the exponent syntax for gugol, but aside from that, why is it necessary to use a separate domain? It seems like a trick that is done often. Here's a short list:

Youtube uses ytimg.com. They started doing it before the acquisition by Google, and apparently it was used to host the static content. Youtube used 3-rd party CDNs for hot videos back then, but always with youtube.com for a domain. They continue that practice, except that ytimg.com now serves other random stuff now.

Google uses 1e100.net, which seems to pop randomly.

Facebook uses fbcdn.net. Obviously it means "Facebook CDN".

So, using a second SLD is clearly a common practice of some kind, that everyone in the business agrees is valuable. But how exactly does it work? Is it about the performance or security? Why only 2 domains and not 10 or 1000?

UPDATE: I have two friends called "Jon M." and other of them says:

It's for security. If you don't fully trust the security of your CDN, then you put it on a separate domain, so that content served from it can't access your users sessions.

Presumably it's because cookies are matched by domain. And Peter Jones adds:

It cuts down on http headers - especially cookies. If you put images on a second domain, it means /far/ fewer headers transferred, and fewer db lookups for the things /in/ them.

Those cookies!

February 07, 2010 04:31 PM

Dave Miller: STT_GNU_IFUNC

I've always wanted to work on support for STT_GNU_IFUNC symbols on sparc. This is going to solve a real problem distribution makers have faced on sparc64 for quite some time.

What is STT_GNU_IFUNC?

Well, normally a symbol is resolved by the dynamic linker based upon information in the symbol table of the objects involved. This is after taking into consideration things like symbol visibility, where it is defined, etc.

The difference with STT_GNU_IFUNC is that the resolution of the reference can be made based upon other criteria. For example, based upon the capabilities of the cpu we are executing on. The most obvious place this would be very useful is in libc, where you can pick the most optimized memcpy() implementation.

Normally the symbol table entry points to the actual symbol location, but STT_GNU_IFUNC symbols point to the location of a "resolver" function. This function returns the symbol location that should be used for references to this symbol.

So when the dynamic linker resolves a reference to a STT_GNU_IFUNC type symbol "foo". It calls the resolver function recorded in the symbol table entry, and uses the return value as the resolved address.

Simple example:

void * memcpy_ifunc (void) __asm__ ("memcpy");
__asm__(".type foo, %gnu_indirect_function");

void *
memcpy_ifunc (void)
{
  switch (cpu_type)
    {
  case cpu_A:
    return memcpy_A;
  case cpu_B:
    return memcpy_B;
  default:
    return memcpy_default;
    }
}
So, references to 'memcpy' will be resolved as determined by the logic in memcpy_ifunc().

These magic ifunc things even work in static executables. How is that possible?

First, even though the final image is static, the linker arranges to still create PLT entries and dynamic sections for the STT_GNU_IFUNC relocations.

Next, the CRT files for static executables walk through the relocations in the static binary and resolve the STT_GNU_IFUNC symbols.

There are some thorny issues wrt. function pointer equality. To make that work static references to STT_GNU_IFUNC symbols use the PLT address whereas non-static references do not (they get fully resolved).

Back to the reason I was so eager to implement this. On sparc we have four different sets of optimized memcpy/memset implementations in glibc (UltraSPARC-I/II, UltraSPARC-III, Niagara-T1, Niagara-T2). Right now the distributions have to thus build glibc four times each for 32-bit and 64-bit (for a total of 8 times).

With STT_GNU_IFUNC they will only need to build it once for 32-bit and once for 64-bit.

I've just recently posted patches for full support of STT_GNU_IFUNC symbols to the binutils and glibc lists.

February 07, 2010 03:46 PM

February 06, 2010

Linus Torvalds: Happy camper

I broke down and bought a Nexus One last week.

I got the original G1 phone from google when it came out, and I hardly ever used it. Why? I generally hate phones - they are irritating and disturb you as you work or read or whatever - and a cellphone to me is just an opportunity to be irritated wherever you are. Which is not a good thing.

At the same time I love the concept of having a phone that runs Linux, and I've had a number of them over the years (in addition to the G1, I had one of the early China-only Motorola Linux phones) etc. But my hatred of phones ends up resulting in me not really ever using them. The G1, for example, ended up being mostly used for playing Galaga and Solitaire on long flights, since I had almost no reason to carry it with me except when traveling.

But I have to admit, the Nexus One is a winner. I wasn't enthusiastic about buying a phone on the internet sight unseen, but the day it was reported that it finally had the pinch-to-zoom thing enabled, I decided to take the plunge. I've wanted to have a GPS unit for my car anyway, and I thought that google navigation might finally make a phone useful.

And it does. What a difference! I no longer feel like I'm dragging a phone with me "just in case" I would need to get in touch with somebody - now I'm having a useful (and admittedly pretty good-looking) gadget instead. The fact that you can use it as a phone too is kind of secondary.

February 06, 2010 01:22 PM

Pavel Machek: Welcome to ugly world of windows

So you want example sources, cca 100K total. You have to download 15MB
slac341.zip, that extracts into 15MB Chronos-Setup.exe. That in turn
blackmails you into accepting about 20 pages of ugly legaleese
presented in tiny window. You get past that (in wine), select directory, and then InstallJammer tells you that you dont have permissions to the directory, oops.

February 06, 2010 07:06 AM

February 05, 2010

Evgeniy Polyakov: Elliptics network got new on-disk format

Eventually any storage should go into production mode, which implies not only data storage itself but also access restrictions. Distributed hash table systems, like elliptics network, do not have dedicated servers which could store that information and manage access permissions, so each object should have its own set of rules. Although without proper security framework on top of network media this will not guarantee required data access granularity, but even in this model it is still possible to implement IO permissions to some degree.

Until now elliptics network did not have even a slight mechanism for doing this. And even that rudimentary supported metadata was stored in the transaction log and did not allow any kind of extensions or proper updates.

And although I did not yet write any line of code to deal with metadata, I already broke old-style transaction logs, which now contain only and only transaction information. There are no metadata objects at all, but I will update appropriate parts of the library to generate them and store in the separate entities.

It is possible to store metadata in the different objects like the ones being indexed by the hash of the original object's name plus some extension, but this will force system to perform two lookups to find out needed object and its metadata.
Another way is to add new object type to existing transaction and history log objects - all metadata will be stored close to the object itself and could be fetched using only object ID. In the filesystem backend where each object is stored as separate file, metadata will be indexed by the '.metadata' extension or similar - just like we have $ID.history for transaction logs. In the database backend (BDB and Tokyo Cabinet, although I seriously consider to drop the former, since it is unacceptibly slow compared to TC) it will be a separate table, indexed by the object ID.

Metadata will have flexible format (maybe even human-readable one based on strings?) to allow extensions without breaking backwards compatibility.

But first I should fix background log checker, which although syncs all kinds of objects currently (i.e. when there is no some object in the storage, but there is its copy with different ID, it will upload missing data from that copy), it does it slightly wrong way, namely messing with hashes and producing unneded additional transaction references. When checker is ready, whole storage fsck process will just combine a log based on metadata objects, and start check process for it.

Stay tuned, we are very close to the next major release, which will draw the line of the serious features and changes!

February 05, 2010 11:45 PM

Matthew Garrett: Shaping young minds

I'm off to CMU at the weekend, in order to do a couple of talks on Monday (the 8th). I'll be giving an introduction to ACPI to the operating systems class in the morning, and an open presentation on Fedora, some of the challenges we face and how to get involved in Linux in the afternoon. This is as a result of our cooperation with CMU, which has led to things like the request on the right. How could we refuse?

February 05, 2010 06:54 PM

Evgeniy Polyakov: Opened skiing season

Of course its downhill skiing, I used to hate runnng skiing wasted 3 years in the section when was in the university.

And actually I not only opened a season, but tried it first time. Some years ago I made a downhill run on the board, but I weared 'grinders' shoes instead of special shoes :)

Anyway, I do not know how to downhill, so I took an hour ski lesson, found myself can not being able to perform even the simplest things like V-deceleration (I do not know how it is called in english, but it is supposed that feet form kind of V figure).
Apparently I found a way to learn this quickly - when you flight into the wall on the bone-breaking speed it is better to find out a way to decelerate and stop. One of them is to fall, but it has own and rather serious problem - pain, haematomas and ego drop.

First trainig was rather painful and without interesting results except that I found myself very liking this stuff. So I went there in a day and enjoyed the hell skiing from the top. Even multiple times down to the bottom without falls. And EVEN once (or at least half-once) I moved the way and speed I wanted.
I'm sure, my carving and V-turns are likely ugly as hell, I like how things go.

A good news is that whole-year hill center (well, I was lazy to move to the real hills :) is very conveniently located between my home and office, so I can spent 1-2 morning hours there. I always wanted to be able to ski and its quite possible now with car.


Not me :)

February 05, 2010 03:13 AM

Harald Welte: Symbian is Open Soruce - Really?

In recent news, the Symbian Foundation announced that "All 108 packages containing the source code of the Symbian platform can now be downloaded from Symbian's developer web site". This is great news!

This morning I tried to look at the parts most interesting to me: phonesrv (implementing call engine, cell broadcast and SIM toolkit APIs) and poc (implementing push-to-talk). Their pages don't have the usual "source code" tab at the bottom right which links to mercurial and tarball download pages!

Either I'm too stupid, or I am unable to find any source code for those two components. I'm quite sure something essential like the API's for making phone calls are considered part of the Symbian platform. So how does that match with the statement that all packages containing the Symbian platform can now be downloaded?

February 05, 2010 01:00 AM

February 04, 2010

Pavel Machek: turning $3000, 4GFLOPS Unix workstation into... way too heavy wristwatch

I started mychronos project on sourceforge. Laurent Arditi asked me what it is about, but I could not write him back, so... let me explain here.

For now I do not have the watch, so I set up
simple emulator... that will also make development easier later. It is extremely hacky, but works well enough to show time and allow you to set it. Just run it with input redirected from /dev/input/eventX -- from your keyboard. And you may want to comment out first few lines of main().

I'd like to make the firmware compilable under Linux, then extend it
with new features ... like time of sunset/sunrise, etc.

BTW...does someone have official chronos firmware in .tar or .zip? All I could find was .exe, and I'd prefer not to install wine.

February 04, 2010 11:38 PM