Category Archives: en

Disks and equipment

Here is a quick overview of the disks used in my tests:

Brand Model Capacity Speed Interface Notes
Western Digital WD4000AAKS 400 GB 7,200 rpm SATA 3 Gbps  
Western Digital WD10EARS 1 TB > 5,400 rpm SATA 3 Gbps 1
Western Digital WD20EARS 2 TB > 5,400 rpm SATA 3 Gbps 2
Western Digital WD20EADS 2 TB > 5,400 rpm SATA 3 Gbps 3
Hitachi HDS722020ALA330 2 TB 7,200 rpm SATA 3 Gbps 3

The computer runs FreeBSD 9 on an Intel E6600 with an ICH9 chipset and 4 GB RAM. For convenience, the disks were tested in an Akasa Duo Dock connected by eSATA cable to one of the ICH9 SATA ports.

1 Kindly provided by Alastair Hogge

2 Kindly provided by GetOnline Ltd.

3 Kindly provided by Dansk Scanning AS

Benchmarking Advanced Format drives

Important: due to a bug in my benchmark program, the tps numbers in this post are incorrect. See here for the correct numbers.

In the previous post, I discussed Western Digital's “Advanced Format” drives and the problems caused by their misreporting their real, physical sector size.

I wrote a benchmark utility to demonstrate the performance penalty of unaligned accesses and uncover a drive's physical sector size. What it does is write blocks of zeroes varying size at regular intervals. For each block size, it writes a total of 128 MB at intervals of four times the block size, and at an offset that varies from 512 bytes up to half of the block size.

With the default settings, the first pass will write 131,072 1,024-byte blocks at n × 4,096, and the second pass will do the same at n × 4,096 + 512. The third, fourth and fifth passes will write 65,536 2,048-byte blocks each at n × 8,192, n × 8,192 + 512 and n × 8,192 + 1,024. It will make four more passes with 4,096-byte blocks and five with 8,192-byte blocks.

Here's the idea: most passes will be very slow (up to half an hour per pass), but when we hit the right block size and alignment, performance will skyrocket; so on—let's say—a WD20EARS with factory settings, passes 6 (4,096 bytes at offset 0), 10 (8,192 bytes at offset 0) and 14 (8192 bytes at offset 4,096) should stand out from the crowd. In fact, here are the results for passes 6 through 9:

   count    size  offset    step        msec     tps    kBps
   32768    4096       0   16384       19503     138    6720
   32768    4096     512   16384     1216537       2     107
   32768    4096    1024   16384     1213479       2     108
   32768    4096    2048   16384     1214623       2     107

Pass 6 takes 20 seconds, while passes 7, 8 and 9 take 20 minutes.

Let me rephrase that: properly aligned non-sequential writes are faster than misaligned ones by a factor of sixty.

Sixty. Six zero.

We really, really need to get that fixed somehow.

That's not the whole story, though. Let's see how it compares to a 7,200 rpm, 2 TB Hitachi Deskstar (HDS722020ALA330) with 512-byte physical sectors:

   count    size  offset    step        msec     tps    kBps
   32768    4096       0   16384        8803     307   14889
   32768    4096     512   16384        8701     310   15063
   32768    4096    1024   16384        8735     309   15004
   32768    4096    2048   16384        8705     310   15056

The Hitachi blows through the test so fast you don't even have time to make yourself a cup of coffee, let alone drink it.

This is a 7,200 rpm, 400 GB Caviar SE16 (WD4000AAKS)—more than three years old, so don't expect too much:

   count    size  offset    step        msec     tps    kBps
   32768    4096       0   16384       21348     126    6139
   32768    4096     512   16384       21674     124    6047
   32768    4096    1024   16384       20799     129    6301
   32768    4096    2048   16384       21031     128    6232

So, about the same as we get from the WD20EARS with aligned writes.

Now, here's the kicker. The last drive in my test lineup is a WD20EADS—almost the same as the WD20EARS, but with 512-byte sectors and only 32 MB cache (although cache doesn't mean anything here—I made sure my test program writes enough data to blow through the cache on every pass).

   count    size  offset    step        msec     tps    kBps
   32768    4096       0   16384       22811     118    5745
   32768    4096     512   16384       19552     138    6703
   32768    4096    1024   16384       36945      73    3547
   32768    4096    2048   16384       50102      53    2616

Ouch. It's not just slow, it's also very inconsistent. I have no idea what to make of that.

Note 1: I did not mention rotational speed for the WD Green disks, because Western Digital themselves do not specify one; the spec sheet just says “IntelliPower”. Not sure what to make of that, either. Tom's Hardware contradict themselves, saying in one review that it means 5,400, and in another that it means it varies. Meanwhile, my supplier claim the WD20EARS rotates at 7,200 rpm. Go figure.

Note 2: I also have a 1 TB WD10EARS, but I haven't tested it yet. I expect it to perform pretty much as well (or as poorly, depending on your perspective) as the WD20EARS.


Update: the results for the WD10EARS are in. Strangely, it is much faster at unaligned writes than the WD20EARS, although it's a little slower at aligned writes.

   count    size  offset    step        msec     tps    kBps
   32768    4096       0   16384       23105     116    5672
   32768    4096     512   16384       79285      34    1653
   32768    4096    1024   16384       75814      35    1728
   32768    4096    2048   16384       79920      33    1640

A naïve sequential-write benchmark (diskinfo -t) suggests that it's about 20% slower overall. It is possible that both disks use a striped layout internally, so the WD20EARS gets better results because it has more platters. If that is the case, it should be possible to modify phybs to detect the stripe size.

Exploring WD Advanced Format drives

I've been playing with WD Green disks, trying to solve the 4,096-byte sector problem. To summarize, Western Digital have started to move from 512-byte sectors to 4,096-byte sectors in order to reduce overhead and thereby increase the amount of data that can be stored on the same amount of platters with the same density. These disks (specifically, the EARS and AARS series) emulate 512-byte sectors for compatibility with older BIOSes and operating systems, but the problem is that they report 512-byte logical and physical sectors instead of 512/4,096.

If the length of a write operation is not a multiple of 4,096, or it does not begin at an address divisible by 4,096, either the beginning or the end of the operation, or both, will cover only part of a sector. This requires the disk to do a read-modify-write operation, meaning that it has to read a complete 4,096-byte sector, update parts of it, and write it back. This is extremely inefficient, as I will demonstrate later.

The reason why this matters so much is subtle. For efficiency reasons, most modern filesystems use on-disk structures of 4,096 bytes or more, so it shouldn't matter, right? But on PCs, for legacy reasons, the first filesystem on a disk (or rather, the first partition) usually starts at sector 63, and 63 × 512 is not a multiple of 4,096. This means that every write operation will be misaligned.

In most cases, you can work around this by making sure, when you partition a new disk, that the first partition starts on a 4,096-byte boundary - say, sector 64 instead of 63. In addition to that, the WD EARS and AARS disks have a jumper setting that makes the disk offset every read or write operation by exactly one logical sector, so what the computer thinks is logical sector 0 is actually logical sector 1, and what the computer thinks is logical sector 63 is actually logical sector 64. Unfortunately, this means that systems that use the whole disk, starting at address 0, or that already take care to align their writes on 4096-byte boundaries, are screwed.

There is another problem: ZFS. ZFS operates on variable-sized blocks of any power of two between 512 bytes and 128 kilobytes. The only way to prevent ZFS from using block sizes smaller than 4,096 bytes is to build your vdevs from devices which advertise 4,096-byte sectors.

The ideal solution is to either a) force the disk to advertise its true physical sector size, or b) hack FreeBSD so it recognizes disks with 4,096-byte sectors.

Regarding the first option, it might be possible to lobby Western Digital to release a firmware upgrade like they did for the auto-idle issue.

As for the second solution, there is an important question: should we do this unconditionally? If we do, then misaligned filesystems on existing disks will become inaccessible. However, one could argue that those filesystems are already essentially unusable due to atrocious performance.

Exploring WD Advanced Format drives

I've been playing with WD Green disks, trying to solve the 4,096-byte sector problem. To summarize, Western Digital have started to move from 512-byte sectors to 4,096-byte sectors in order to reduce overhead and thereby increase the amount of data that can be stored on the same amount of platters with the same density. These disks (specifically, the EARS and AARS series) emulate 512-byte sectors for compatibility with older BIOSes and operating systems, but the problem is that they report 512-byte logical and physical sectors instead of 512/4,096.

If the length of a write operation is not a multiple of 4,096, or it does not begin at an address divisible by 4,096, either the beginning or the end of the operation, or both, will cover only part of a sector. This requires the disk to do a read-modify-write operation, meaning that it has to read a complete 4,096-byte sector, update parts of it, and write it back. This is extremely inefficient, as I will demonstrate later.

The reason why this matters so much is subtle. For efficiency reasons, most modern filesystems use on-disk structures of 4,096 bytes or more, so it shouldn't matter, right? But on PCs, for legacy reasons, the first filesystem on a disk (or rather, the first partition) usually starts at sector 63, and 63 × 512 is not a multiple of 4,096. This means that every write operation will be misaligned.

In most cases, you can work around this by making sure, when you partition a new disk, that the first partition starts on a 4,096-byte boundary - say, sector 64 instead of 63. In addition to that, the WD EARS and AARS disks have a jumper setting that makes the disk offset every read or write operation by exactly one logical sector, so what the computer thinks is logical sector 0 is actually logical sector 1, and what the computer thinks is logical sector 63 is actually logical sector 64. Unfortunately, this means that systems that use the whole disk, starting at address 0, or that already take care to align their writes on 4096-byte boundaries, are screwed.

There is another problem: ZFS. ZFS operates on variable-sized blocks of any power of two between 512 bytes and 128 kilobytes. The only way to prevent ZFS from using block sizes smaller than 4,096 bytes is to build your vdevs from devices which advertise 4,096-byte sectors.

The ideal solution is to either a) force the disk to advertise its true physical sector size, or b) hack FreeBSD so it recognizes disks with 4,096-byte sectors.

Regarding the first option, it might be possible to lobby Western Digital to release a firmware upgrade like they did for the auto-idle issue.

As for the second solution, there is an important question: should we do this unconditionally? If we do, then misaligned filesystems on existing disks will become inaccessible. However, one could argue that those filesystems are already essentially unusable due to atrocious performance.

Old history

I am the current maintainer of OpenSSH for FreeBSD, and have been since 2002. I am also the author and maintainer of the PAM implementation used by FreeBSD, and of several of the accompanying PAM modules. Finally, I was a member of the FreeBSD Security Team for several years, served as Assistant Security Officer and Acting Security Officer, and authored or co-authored around 20 security advisories between 2002 and 2004.

I have been asked to comment on SecurityFocus advisories 7467 and 7482, regarding timing attacks against certain versions of OpenSSH that were distributed with FreeBSD 4.x and 5.x releases.

The short version is that no FreeBSD 4.x or 5.x release was ever vulnerable. Read on for the long version.

The two issues are closely related, albeit not identical. They both allow an attacker to confirm information that they already have, or that they have guessed, and the attack mechanism is roughly the same. In both cases, a different code path is taken depending on the outcome of the authentication process. In the affected versions of OpenSSH, there are three possible outcomes:

  1. non-existent user: no attempt to verify password
  2. existent user, wrong password: authentication fails
  3. existent user, correct password: authentication succeeds

Additionally, the user may be administratively blocked—for instance, if PermitRootLogin is off, root will not be able to log in at all. This is the default in FreeBSD.

An attacker that can distinguish between cases 1 and 2 can verify that a particular user exists on the target system. This is the core of issue 7467.

An attacker that can distinguish between cases 2 and 3, even if they do not ultimately gain access, can verify that a particular password is correct for a particular user on the target system. This is the core of issue 7482.

Outwardly, OpenSSH will behave in the exact same manner in all three cases: if you try to log in as a non-existent user, or as a user that is administratively blocked, you will still be asked for your password.

Internally, however, the code paths are different. In case 1, OpenSSH will not invoke PAM authentication to verify the password provided by the client, since it already knows the outcome. In cases 2 and 3, it will invoke PAM authentication. The attack relies on being able to determine whether PAM authentication was invoked, and whether it was successful, even if you are unable to log in.

Here is the crux of the matter: some operating systems have PAM modules that will delay for two seconds when authentication fails. Therefore:

  • An attacker that attempts to log on as a particular user with a made-up password will see a two-second delay after each password prompt if and only if the user exists.
  • An attacker that attempts to log on as a user which they know exists, but which is administratively blocked (e.g. root, see above), will see a two-second delay after each password prompt if and only if they typed the wrong password.

FreeBSD's PAM modules do not delay when authentication fails. There will still be minute differences in timing, but they will be too small to measure across the network.

Therefore, FreeBSD is not vulnerable to these attacks.

Finally, I should mention that the first issue was addressed in OpenSSH 3.6.1p2 by invoking PAM authentication even for non-existent users, and the second in OpenSSH 3.9p1 by passing an invalid password to PAM when someone tries to log in as root on a system where RootPermitLogin is off.

[edited for clarity on 2008-08-04]

Old history

I am the current maintainer of OpenSSH for FreeBSD, and have been since 2002. I am also the author and maintainer of the PAM implementation used by FreeBSD, and of several of the accompanying PAM modules. Finally, I was a member of the FreeBSD Security Team for several years, served as Assistant Security Officer and Acting Security Officer, and authored or co-authored around 20 security advisories between 2002 and 2004.

I have been asked to comment on SecurityFocus advisories 7467 and 7482, regarding timing attacks against certain versions of OpenSSH that were distributed with FreeBSD 4.x and 5.x releases.

The short version is that no FreeBSD 4.x or 5.x release was ever vulnerable. Read on for the long version.

The two issues are closely related, albeit not identical. They both allow an attacker to confirm information that they already have, or that they have guessed, and the attack mechanism is roughly the same. In both cases, a different code path is taken depending on the outcome of the authentication process. In the affected versions of OpenSSH, there are three possible outcomes:

  1. non-existent user: no attempt to verify password
  2. existent user, wrong password: authentication fails
  3. existent user, correct password: authentication succeeds

Additionally, the user may be administratively blocked—for instance, if PermitRootLogin is off, root will not be able to log in at all. This is the default in FreeBSD.

An attacker that can distinguish between cases 1 and 2 can verify that a particular user exists on the target system. This is the core of issue 7467.

An attacker that can distinguish between cases 2 and 3, even if they do not ultimately gain access, can verify that a particular password is correct for a particular user on the target system. This is the core of issue 7482.

Outwardly, OpenSSH will behave in the exact same manner in all three cases: if you try to log in as a non-existent user, or as a user that is administratively blocked, you will still be asked for your password.

Internally, however, the code paths are different. In case 1, OpenSSH will not invoke PAM authentication to verify the password provided by the client, since it already knows the outcome. In cases 2 and 3, it will invoke PAM authentication. The attack relies on being able to determine whether PAM authentication was invoked, and whether it was successful, even if you are unable to log in.

Here is the crux of the matter: some operating systems have PAM modules that will delay for two seconds when authentication fails. Therefore:

  • An attacker that attempts to log on as a particular user with a made-up password will see a two-second delay after each password prompt if and only if the user exists.
  • An attacker that attempts to log on as a user which they know exists, but which is administratively blocked (e.g. root, see above), will see a two-second delay after each password prompt if and only if they typed the wrong password.

FreeBSD's PAM modules do not delay when authentication fails. There will still be minute differences in timing, but they will be too small to measure across the network.

Therefore, FreeBSD is not vulnerable to these attacks.

Finally, I should mention that the first issue was addressed in OpenSSH 3.6.1p2 by invoking PAM authentication even for non-existent users, and the second in OpenSSH 3.9p1 by passing an invalid password to PAM when someone tries to log in as root on a system where RootPermitLogin is off.

[edited for clarity on 2008-08-04]

Ten years

That's how long, to the day, I have been a FreeBSD committer.

Ten years seems like a long time when you write it down on paper, or say it out loud, or try to imagine who and where you will be in ten years' time; but when I think back on my time as a FreeBSD committer, it's hard to believe it's really been that long.

The strangest part is seeing younger (or rather, more recently anointed) committers defer to me. I'm not the old tenured professor! I'm not the sage on the mountain! Look at phk, he's the old fart, not me! I'm still a rookie! I practically haven't done anything for the project! I mean, apart from libfetch, and pseudofs, and the PAM stack, and OpenSSH, and the Tinderbox, and stints as Bugmeister and Security Officer, and...

This is where my train of thoughts derails, when I realize how much I've actually done (although I don't even come close to people like phk, jhb, or rwatson), and oh shit, it's actually been ten years!

Update: when I told my wife about this, her immediate reaction was “and they say men can't commit to anything...

Ten years

That's how long, to the day, I have been a FreeBSD committer.

Ten years seems like a long time when you write it down on paper, or say it out loud, or try to imagine who and where you will be in ten years' time; but when I think back on my time as a FreeBSD committer, it's hard to believe it's really been that long.

The strangest part is seeing younger (or rather, more recently anointed) committers defer to me. I'm not the old tenured professor! I'm not the sage on the mountain! Look at phk, he's the old fart, not me! I'm still a rookie! I practically haven't done anything for the project! I mean, apart from libfetch, and pseudofs, and the PAM stack, and OpenSSH, and the Tinderbox, and stints as Bugmeister and Security Officer, and...

This is where my train of thoughts derails, when I realize how much I've actually done (although I don't even come close to people like phk, jhb, or rwatson), and oh shit, it's actually been ten years!

Update: when I told my wife about this, her immediate reaction was “and they say men can't commit to anything...”

I broke Béranger’s heart

Béranger, the author of the long rant on which I have commented twice before, seems deeply hurt by my comments. Deeply enough, at least, to spend most of his after game report lambasting me, and to post a complaint on freebsd-advocacy.

Read it if you like. He deliberately misunderstands me, twists my words (including some from private conversation), pounces on strawmen, and still can't understand that the FreeBSD Foundation is a different entity from the FreeBSD Project, because apparently if the Foundation licenses and distributes software that runs on FreeBSD but isn't included in FreeBSD, then the Foundation is FreeBSD.

And he still can't get my name right.

I won't bother rebutting.

I broke Béranger’s heart

Béranger, the author of the long rant on which I have commented twice before, seems deeply hurt by my comments. Deeply enough, at least, to spend most of his after game report lambasting me, and to post a complaint on freebsd-advocacy.

Read it if you like. He deliberately misunderstands me, twists my words (including some from private conversation), pounces on strawmen, and still can't understand that the FreeBSD Foundation is a different entity from the FreeBSD Project, because apparently if the Foundation licenses and distributes software that runs on FreeBSD but isn't included in FreeBSD, then the Foundation is FreeBSD.

And he still can't get my name right.

I won't bother rebutting.

SATA is not SCSI… or is it?

One further comment on The sorry state of open source today, which I did not want to include in my previous entry as I felt it would distract from my main point, which was the inaccuracies in the author's discussion of FreeBSD.

On page 19, Béranger discusses problems with the disk drivers in Linux 2.6.20. These problems are real (though hopefully transient), and I have myself been bitten by them, as on one machine, Ubuntu's linux-image-2.6.20-14-386 would not recognize the disks at all; I could boot an older kernel, but then of course nvidia-glx, which had been updated to match the newer non-working kernel, would not load.

Where Béranger stumbles is where he asserts—or implies—that there are fundamental differences between PATA, SATA and SCSI, and that it therefore does not make sense to use similar names (/dev/sdX) for them all.

The reality is far more fluid. Not only is the line between these technologies shifting, it is also gradually disappearing. With the arrival of Serial Attached SCSI, for instance, SATA and SCSI drives share the same cables, connectors and electrical specs, and can be attached to the same controller (although the protocol remains different). Going further back, ATAPI (the standard for attaching removable-media drives to the ATA bus) uses SCSI commands for everything except actual data transfer.

The questions Béranger should be asking are not why PATA, SATA and SCSI hard disks need to share a name, but rather:

  • why they haven't done so since the start,
  • why the common name should refer explicitly to only one of these standards (sd is short for "SCSI disk"), and
  • why removable-media and fixed-media drives shared a name (/dev/hdX) for so long, especially when that name refers explicitly to fixed-media disks (hd being short for "hard disk").

Compare with FreeBSD, which also uses different names for ATA and SCSI devices, but at least differentiates between the various types of media: ad ("ATA disk") and da ("direct access") for PATA/SATA and SCSI fixed-media disks respectively, acd and cd for CD and DVD drives, ast and st for tape streamers. There is work under way to bring the ATA driver into the CAM framework (which was introduced in 3.0 to clean up SCSI device handling), erasing these differences.

SATA is not SCSI… or is it?

One further comment on The sorry state of open source today, which I did not want to include in my previous entry as I felt it would distract from my main point, which was the inaccuracies in the author's discussion of FreeBSD.

On page 19, Béranger discusses problems with the disk drivers in Linux 2.6.20. These problems are real (though hopefully transient), and I have myself been bitten by them, as on one machine, Ubuntu's linux-image-2.6.20-14-386 would not recognize the disks at all; I could boot an older kernel, but then of course nvidia-glx, which had been updated to match the newer non-working kernel, would not load.

Where Béranger stumbles is where he asserts—or implies—that there are fundamental differences between PATA, SATA and SCSI, and that it therefore does not make sense to use similar names (/dev/sdX) for them all.

The reality is far more fluid. Not only is the line between these technologies shifting, it is also gradually disappearing. With the arrival of Serial Attached SCSI, for instance, SATA and SCSI drives share the same cables, connectors and electrical specs, and can be attached to the same controller (although the protocol remains different). Going further back, ATAPI (the standard for attaching removable-media drives to the ATA bus) uses SCSI commands for everything except actual data transfer.

The questions Béranger should be asking are not why PATA, SATA and SCSI hard disks need to share a name, but rather:

  • why they haven't done so since the start,
  • why the common name should refer explicitly to only one of these standards (sd is short for "SCSI disk"), and
  • why removable-media and fixed-media drives shared a name (/dev/hdX) for so long, especially when that name refers explicitly to fixed-media disks (hd being short for "hard disk").

Compare with FreeBSD, which also uses different names for ATA and SCSI devices, but at least differentiates between the various types of media: ad ("ATA disk") and da ("direct access") for PATA/SATA and SCSI fixed-media disks respectively, acd and cd for CD and DVD drives, ast and st for tape streamers. There is work under way to bring the ATA driver into the CAM framework (which was introduced in 3.0 to clean up SCSI device handling), erasing these differences.

The sorry state of The Jem Report

Jem Matzan's The Jem Report is running a so-called editorial by Radu-Cristian Fotescu (aka. Béranger) titled The sorry state of open source today. I say so-called, because it is more of a rant than an editorial: 26 pages long and not entirely coherent.

I won't waste your time with a point-by-point rebuttal of this piece, not least because most of what he writes is pure opinion and interpretation. I don't necessarily agree with it—I find him a little too radical and a little too confrontational—but he's entitled to it.

(I do agree with his views on the differences between the GPL and the BSD license, but that's neither here nor there)

What I take exception to are factual errors in his discussion of *BSD, and specifically of FreeBSD.

First, on page 7, he writes:

[members of] the *BSD family [are] either backed by 501(c)(3) non-profit organizations like The FreeBSD Foundation or the NetBSD Project, or the task of individuals like Theo de Raadt for OpenBSD and Matt Dillon for DragonFly BSD

This is inaccurate as far as FreeBSD goes. The FreeBSD Foundation was established because a recognized legal entity (which the FreeBSD Project itself is not) was required for three purposes:

The Foundation contributes to the Project in the sense that it owns some of the hardware that the Project uses, organizes and funds events such as the Developer Summits, which are held twice a year at BSDCan and EuroBSDCon, and occasionally funds development work.

This does not mean that the FreeBSD Project is backed by the FreeBSD Foundation. The Foundation does not control the Project in any way; it does not control the Project's servers and repositories, it does not approve new committers, write roadmaps or schedule releases. The democratically elected Core Team does that.

This also invalidates the following quote from page 3:

even the FreeBSD foundation [sic] is an American subject

The implication is that since the FreeBSD Foundation is subject to American law, the FreeBSD Project is subject to American software patents.

Once again, the Foundation does not control FreeBSD. Moreover, the bulk of FreeBSD development today—at least the cutting-edge work—happens in Europe (especially Eastern Europe, but there are also strong contingents in Denmark, Italy and the UK) and Asia.

The NetBSD situation is different, as The NetBSD Foundation, Inc. appoints the NetBSD Core Group and various executive committees, and handles applications from new developers, who must sign a membership agreement with the Foundation before obtaining write access to the repositories.

The next objection I have relates to the following quotes from pages 10 and 22:

Maybe the first sign was when FreeBSD tried to mimic Linux and to be "more user-friendly". It was rather wrong, and I can only hope they realized the mistake. The gradual recovery of the once legendary FreeBSD stability can be seen with 6.1 and 6.2. Hopefully the trend will not be reversed again.
Unfortunately, starting with 5.0, the quality and stability of FreeBSD have deteriorated, or maybe it was my hardware that was less supported. Compromises have been made to match the popularity of Linux, and the price to pay was rather high.

These two paragraphs are based on a completely incorrect understanding of what happened between FreeBSD 4 and FreeBSD 6. User friendliness, popularity, Linux envy: none of this ever played any part. I will attempt to briefly explain what actually happened.

FreeBSD first gained support for symmetric multiprocessing in 3.0. The approach chosen then was a simplistic one: the entire kernel was serialized under a so-called "giant lock". This saved a lot of work, as there was no need to make the various kernel subsystems reentrant. It also meant that FreeBSD's SMP performance was not very good, except for computationally intensive tasks (i.e. tasks which did not involve the kernel much), but it was better than nothing.

After FreeBSD 4 was released, work was started on a project called SMPng, which would replace the giant lock with a number of fine-grained locks, allowing separate kernel threads to run simultaneously on separate processors, performing separate tasks.

This is where two important mistakes were made.

Firstly, the release schedule was feature-bound rather then time-bound. This meant that FreeBSD 5 would be released "when it was done", not "when it was time". Consequently, there was no deadline for developers to focus on, and little effort was made to partition work into manageable pieces to avoid having the source tree in an unreleasable state for long periods of time.

Secondly, and most importantly, it was decided very early on to go for a complicated M:N scheduling model, based on a variant of scheduler activations called KSE. With KSE, a userland process with N threads (1 <= N) would receive M kernel threads (1 <= M <= N) on which to run. The userland threads library would be responsible for scheduling application threads on top of the kernel threads. Within the kernel, threads would be scheduled using a two-tier system involving KSE groups (groups of threads belonging to the same process) and individual threads.

Scheduler activations had academic backing and the M:N model had proved workable in Solaris, and NetBSD was moving in the same direction, but this decision was made before there was even proof-of-concept code to demonstrate its viability in FreeBSD.

(Ironically, around that time, Sun decided to abandon the M:N model in Solaris, and NetBSD is now moving to 1:1)

To cut a long story short, KSE was a disaster. It took years to implement, delaying other work which depended on it, and it never worked properly. FreeBSD 5 was going nowhere fast until a developer who until then had not been involved in either SMPng or KSE lost patience and implemented a 1:1 thread library on top of KSE. In the end, KSE was never even fully implemented, and it is now being slowly replaced with a pure 1:1 model.

So there you have it. FreeBSD 5 was the result of a failed experiment in M:N threading. FreeBSD 6 was an attempt to correct those mistakes, and FreeBSD 7 will dispose of KSE entirely and complete the fine-grained locking work that started with FreeBSD 5.

Finally, the FreeBSD Project learned an expensive lesson with FreeBSD 5. We now have a separate repository where experimental code can mature and high-risk changes can be tried out before entering CVS. We have also switched our release model to time-based releases rather than feature-based releases. If, as a release date approaches, we realize that a particular feature won't be ready in time, we simply leave that feature out of the release rather than change our schedule.

The sorry state of The Jem Report

Jem Matzan's The Jem Report is running a so-called editorial by Radu-Cristian Fotescu (aka. Béranger) titled The sorry state of open source today. I say so-called, because it is more of a rant than an editorial: 26 pages long and not entirely coherent.

I won't waste your time with a point-by-point rebuttal of this piece, not least because most of what he writes is pure opinion and interpretation. I don't necessarily agree with it—I find him a little too radical and a little too confrontational—but he's entitled to it.

(I do agree with his views on the differences between the GPL and the BSD license, but that's neither here nor there)

What I take exception to are factual errors in his discussion of *BSD, and specifically of FreeBSD.

First, on page 7, he writes:

[members of] the *BSD family [are] either backed by 501(c)(3) non-profit organizations like The FreeBSD Foundation or the NetBSD Project, or the task of individuals like Theo de Raadt for OpenBSD and Matt Dillon for DragonFly BSD

This is inaccurate as far as FreeBSD goes. The FreeBSD Foundation was established because a recognized legal entity (which the FreeBSD Project itself is not) was required for three purposes:

The Foundation contributes to the Project in the sense that it owns some of the hardware that the Project uses, organizes and funds events such as the Developer Summits, which are held twice a year at BSDCan and EuroBSDCon, and occasionally funds development work.

This does not mean that the FreeBSD Project is backed by the FreeBSD Foundation. The Foundation does not control the Project in any way; it does not control the Project's servers and repositories, it does not approve new committers, write roadmaps or schedule releases. The democratically elected Core Team does that.

This also invalidates the following quote from page 3:

even the FreeBSD foundation [sic] is an American subject

The implication is that since the FreeBSD Foundation is subject to American law, the FreeBSD Project is subject to American software patents.

Once again, the Foundation does not control FreeBSD. Moreover, the bulk of FreeBSD development today—at least the cutting-edge work—happens in Europe (especially Eastern Europe, but there are also strong contingents in Denmark, Italy and the UK) and Asia.

The NetBSD situation is different, as The NetBSD Foundation, Inc. appoints the NetBSD Core Group and various executive committees, and handles applications from new developers, who must sign a membership agreement with the Foundation before obtaining write access to the repositories.

The next objection I have relates to the following quotes from pages 10 and 22:

Maybe the first sign was when FreeBSD tried to mimic Linux and to be "more user-friendly". It was rather wrong, and I can only hope they realized the mistake. The gradual recovery of the once legendary FreeBSD stability can be seen with 6.1 and 6.2. Hopefully the trend will not be reversed again.
Unfortunately, starting with 5.0, the quality and stability of FreeBSD have deteriorated, or maybe it was my hardware that was less supported. Compromises have been made to match the popularity of Linux, and the price to pay was rather high.

These two paragraphs are based on a completely incorrect understanding of what happened between FreeBSD 4 and FreeBSD 6. User friendliness, popularity, Linux envy: none of this ever played any part. I will attempt to briefly explain what actually happened.

FreeBSD first gained support for symmetric multiprocessing in 3.0. The approach chosen then was a simplistic one: the entire kernel was serialized under a so-called "giant lock". This saved a lot of work, as there was no need to make the various kernel subsystems reentrant. It also meant that FreeBSD's SMP performance was not very good, except for computationally intensive tasks (i.e. tasks which did not involve the kernel much), but it was better than nothing.

After FreeBSD 4 was released, work was started on a project called SMPng, which would replace the giant lock with a number of fine-grained locks, allowing separate kernel threads to run simultaneously on separate processors, performing separate tasks.

This is where two important mistakes were made.

Firstly, the release schedule was feature-bound rather then time-bound. This meant that FreeBSD 5 would be released "when it was done", not "when it was time". Consequently, there was no deadline for developers to focus on, and little effort was made to partition work into manageable pieces to avoid having the source tree in an unreleasable state for long periods of time.

Secondly, and most importantly, it was decided very early on to go for a complicated M:N scheduling model, based on a variant of scheduler activations called KSE. With KSE, a userland process with N threads (1 <= N) would receive M kernel threads (1 <= M <= N) on which to run. The userland threads library would be responsible for scheduling application threads on top of the kernel threads. Within the kernel, threads would be scheduled using a two-tier system involving KSE groups (groups of threads belonging to the same process) and individual threads.

Scheduler activations had academic backing and the M:N model had proved workable in Solaris, and NetBSD was moving in the same direction, but this decision was made before there was even proof-of-concept code to demonstrate its viability in FreeBSD.

(Ironically, around that time, Sun decided to abandon the M:N model in Solaris, and NetBSD is now moving to 1:1)

To cut a long story short, KSE was a disaster. It took years to implement, delaying other work which depended on it, and it never worked properly. FreeBSD 5 was going nowhere fast until a developer who until then had not been involved in either SMPng or KSE lost patience and implemented a 1:1 thread library on top of KSE. In the end, KSE was never even fully implemented, and it is now being slowly replaced with a pure 1:1 model.

So there you have it. FreeBSD 5 was the result of a failed experiment in M:N threading. FreeBSD 6 was an attempt to correct those mistakes, and FreeBSD 7 will dispose of KSE entirely and complete the fine-grained locking work that started with FreeBSD 5.

Finally, the FreeBSD Project learned an expensive lesson with FreeBSD 5. We now have a separate repository where experimental code can mature and high-risk changes can be tried out before entering CVS. We have also switched our release model to time-based releases rather than feature-based releases. If, as a release date approaches, we realize that a particular feature won't be ready in time, we simply leave that feature out of the release rather than change our schedule.