A Security Officer bulletin has been issued recently to announce the end-of-line for FreeBSD 6.x - which is an opportunity to discuss how FreeBSD release cycle generally works (for those not familiar with it).
Archive for the 'FreeBSD' Category
A while ago a wind turbine was installed not far away from my place. It is far enough to not disturb us, and it is near enough to notice that it turns a lot (IIRC I have seen it only once not turning).
This triggered a question. How much energy would such a device (smaller of course) produce at my place?
The answer depends upon several factors. The wind speed, the wind direction and the wind-speed-to-power-output curve of the device. If you do not take a device which rotates around the horizontal axis but the vertical axis, the wind direction can be taken out of the question (probably not completely, but to answer my question this simplification should be ok). The output-power curve depends upon the device, and I hope it is easy to get it from the vendors. The remaining open question it the wind speed at my place. Is there enough wind with enough speed?
To answer this question I bought a weather station with an anemometer (wind speed sensor). I searched a little bit until I decided to buy a specific one (actually I bought three of them, some coworkers got interested too but they found only much more expensive ones, so soon there will be three more weather stations in use in Belgium, France and Germany). The main point is, I can connect it to an USB port of a PC and there is some software for Linux to read out the data. It also comes with some other outdoor-sensors (temperature, rain, wind direction, humidity, …) and an indoor-control-unit with some internal sensors (temperature, humidity). The user interface is mainly the touchscreen of the control-unit. There is also some Windows software, which is needed to program the interval in which the measurements are taken and saved in the control-unit.
It seems the weather station is produced by Fine Offset Electronics Co.,Ltd and sold within different brands in different locations. The Linux software can read all of them, as the vendor and product IDs are not changed.
Porting the software was easy, it uses libusb and I just had to correct a little problem for the non-portable functions which are used (I asked about them on usb@ and the response was that they just got implemented upon my request and will be committed to HEAD soon). I made a little patch for the software to only use them when available (if you have not loaded the USB HID driver, you do not need to care about them) and committed it to the Ports Collection as astro/fowsr.
Now I just need to attach the outside sensors at the place where I would put the vertical axis wind turbine, install some toolkit which takes a series of measurements and displays them as a nice graph (while keeping all data values) and write some glue code to feed the output of fowsr to it. After a year I can then calculate how much power a given wind turbine would have produced during the year and calculate the return of investment for it.
The Linux software also references several weather sites, for some of them you can get even an iGoogle widget so that you can view the data from wherever you want (as long as you have a suitable internet connection). I think this is also something I will have a look at later.
Note to users in Europe, the device also comes with a DCF77 receiver. As the time is distributed in UTC+1 (or +2, depending on the daylight saving time), you should adjust the timezone setting accordingly to this, not to plain UTC (so for me the timezone should be ‘0’ for the same timezone).
I've seen a HP "LeftHand" / StorageWorkd P4000 SAN device recently and got quite good impressions of it. One thing that occured to me is - why didn't anyone try this before? Certainly both Linux (to lesser extent) and FreeBSD (to a somewhat greater) contain the pieces for it, and have contained for some years now. In fact, several people did such setups privately or internally for their companies but there was apparently never a concentrated effort to sell it.
If you need to set a users password hash like I do, the following link might save you some time:
http://projects.puppetlabs.com/projects/1/wiki/User_And_Homedir_Recipe_Patterns
I am working on getting something along these lines included into Puppet proper, but until then.
It has been two years since I had a fleet of FreeBSD machines under my control. It has been two months since I lost the videocard of my FreeBSD desktop at home. It has been a good year since something went wrong with the hardware on my laptop and X caused it to hang very fast. It has been a good couple of months, actually since I bought an iPhone and a Mac mini, that I did some FreeBSD development at home. I don't read the mailinglists anymore, I rarely answer emails in relation to it anymore, I haven't visited the IRC channels in months. I think that it is a past station for me.
Completely? No. I still use it on my laptop, in console mode, while traveling from home to work and vice verse. I still update the ports misc/zoneinfo and the timezone data on it. I still use it at home to do my email on, to IRC on and to do things for which I don't have the software on the Mac mini. I made an app for the iPhone called "The FreeBSD Project" which gives some of the social aspects of FreeBSD and the people around it.
Today I've decided to handover some tasks in relation to the things I did for the FreeBSD project. One is the GNATS Auto Assign project for the PRs in relation to the ports system, most likely going into the capable hands of the Bugmeister team. The next one will be a batch of ports, they are up for grab. Except the software I made myself and misc/zoneinfo, something I still manage to keep up to date.
I will see if I can get the FreeBSD Multimedia Resources back into air again, it has grown stale for a good 15 months. Point the finger at me if you need.
Before I have read the news feeds I am interested in via the Firefox plugin “brief�. It did all I wanted it to do, but I had all the data and metadata (all the feeds and read items) only in one browser. I was not able to have a shared state at work and at home.
Now I installed rnews on my webserver. It is multi-user capable, so that multiple people can read the feeds they are interested in, without the need to have multiple installations. I can use it from any place where I have an internet connection, without losing the state.
It is in the FreeBSD Ports Collection as www/rnews.
Martin Matuska has issued two calls for testers with patches that improve various aspects of ZFS performance.
The patches are basically ports of OpenSolaris' code and should improve write performance and stat() syscall performance. The patches are available for testing on 9-CURRENT and 8-STABLE.
Martin Matuska has issued two calls for testers with patches that improve various aspects of ZFS performance.
The patches are basically ports of OpenSolaris' code and should improve write performance and stat() syscall performance. The patches are available for testing on 9-CURRENT and 8-STABLE.
Sławek Żak - NoSQL
I just bought two Western Digital 2 TB disks the other day in order to increase storage capacity. I was planning on putting a ZFS mirror on them. The other day I discovered that the disks uses a new drive format called "Advanced Disk Format". This format basically extends the sector size from 512 to 4096 bytes.
The problem is that the disks report their sector size to be 512 rather than 4096 in order for them to work well with existing operating systems. The issues with these disks are discussed here and here.
To summarize, this results in two main problems:
Partitioning tools operate on 512 bytes "logical" sectors, which may result in a partition starting at a non-aligned (compared to 4096 bytes) physical sector. If using partitioning tools that are not updated to align partitions to 4k, a request may cause a write to more than one sector.
File systems/disk consumers think the underlying device has a 512 byte sector size, and issues requests that are below 4096 bytes. For a write request, this is catastrophic, because in order to write only parts of a block, the disk will have to read the block and modify the part that changed, before writing it back to disk (Read-modify-write).
Dag Erling Smørgrav made a tool to benchmark disk performance using aligned and misaligned writes (mentioned in his post above (svn co svn://svn.freebsd.org/base/user/des/phybs). Here are the results:
nobby# ./phybs -w /dev/gpt/storage0
count size offset step msec tps kBps
131072 1024 0 4096 131771 16 994
131072 1024 512 4096 136005 16 963
65536 2048 0 8192 74762 14 1753
65536 2048 512 8192 71407 15 1835
65536 2048 1024 8192 73432 15 1784
32768 4096 0 16384 20710 130 6328
32768 4096 512 16384 61987 43 2114
32768 4096 1024 16384 62719 43 2089
32768 4096 2048 16384 61089 44 2145
16384 8192 0 32768 14238 245 9205
16384 8192 512 32768 53348 65 2456
16384 8192 1024 32768 52868 66 2479
16384 8192 2048 32768 50914 68 2574
Clearly, using < 4k blocks results in bad performance. Using blocks larger than 4k results in a 3x speedup.
The way I solved this in FreeBSD was to partition the disk manually with gpart and set the partition start to a multiple of 8 (8 * 512 = 4096). All partitions on the disk should start at a sector number that is a multiple of 8.
ZFS uses variable block sizes for its requests, which can pose a problem when the underlying provider reports a sector size of 512 bytes. In order to override this, I used gnop(8), which can create a provider on top of another provider with different characteristics: gnop create -o 4096 -S 4096
The -o parameter makes sure that the new provider does not conflict with the original provider when ZFS tries to detect any filesystems on the disk. The second parameter sets the sector size of the new parameter to 4096, which makes sure that all requests going to the disk from ZFS will be in 4k blocks.
For UFS, the default fixed block size is 16k, so there should be no worries about it using lower block sizes. Moreover, newfs provides a -S parameter, which overrides the sector size of the underlying provider. I have not tried using UFS on these disks, but I don't see any reason for it not working.
I made a small modification to phybs to verify the function of jumpers 7-8 on the WD Advanced Format drives (see here and here). It is supposed to cause the disk to internally shift every write by one sector, so that a write to sector 63 (where the first partition on a PC normally starts) actually goes to sector 64, which coincides with the beginning of a physical 4,096-byte sector. These numbers confirm this:
count size offset step msec tps kBps 32768 4096 0 16384 78631 34 1666 32768 4096 512 16384 79880 33 1640 32768 4096 1024 16384 73164 36 1791 32768 4096 1536 16384 77727 34 1686 32768 4096 2048 16384 76975 35 1702 32768 4096 2560 16384 74970 36 1748 32768 4096 3072 16384 79379 34 1651 32768 4096 3584 16384 28094 96 4665
The firmware on the disk shifts everything forward by 512 bytes, so all these passes are unaligned except the last one, because 3,584 + 512 = 4,096.
Doesn't everyone?
I'm ashamed to say mine is not even half full, and two of the computers in it aren't in rack-mount enclosures.
Also shown are some of disks I've benchmarked. The one in the dock is a Western Digital WD20EARS, the other three are (from front to back) a WD10EARS, a WD20EADS and a Hitachi HDS722020ALA330.
Here is a quick overview of the disks used in my tests:
| Brand | Model | Capacity | Speed | Interface | Notes |
|---|---|---|---|---|---|
| Western Digital | WD4000AAKS | 400 GB | 7,200 rpm | SATA 3 Gbps | |
| Western Digital | WD10EARS | 1 TB | > 5,400 rpm | SATA 3 Gbps | 1 |
| Western Digital | WD20EARS | 2 TB | > 5,400 rpm | SATA 3 Gbps | 2 |
| Western Digital | WD20EADS | 2 TB | > 5,400 rpm | SATA 3 Gbps | 3 |
| Hitachi | HDS722020ALA330 | 2 TB | 7,200 rpm | SATA 3 Gbps | 3 |
The computer runs FreeBSD 9 on an Intel E6600 with an ICH9 chipset and 4 GB RAM. For convenience, the disks were tested in an Akasa Duo Dock connected by eSATA cable to one of the ICH9 SATA ports.
1 Kindly provided by Alastair Hogge
2 Kindly provided by GetOnline Ltd.
3 Kindly provided by Dansk Scanning AS
In the previous post, I discussed Western Digital's “Advanced Format” drives and the problems caused by their misreporting their real, physical sector size.
I wrote a benchmark utility to demonstrate the performance penalty of unaligned accesses and uncover a drive's physical sector size. What it does is write blocks of zeroes varying size at regular intervals. For each block size, it writes a total of 128 MB at intervals of four times the block size, and at an offset that varies from 512 bytes up to half of the block size.
With the default settings, the first pass will write 131,072 1,024-byte blocks at n × 4,096, and the second pass will do the same at n × 4,096 + 512. The third, fourth and fifth passes will write 65,536 2,048-byte blocks each at n × 8,192, n × 8,192 + 512 and n × 8,192 + 1,024. It will make four more passes with 4,096-byte blocks and five with 8,192-byte blocks.
Here's the idea: most passes will be very slow (up to half an hour per pass), but when we hit the right block size and alignment, performance will skyrocket; so on—let's say—a WD20EARS with factory settings, passes 6 (4,096 bytes at offset 0), 10 (8,192 bytes at offset 0) and 14 (8192 bytes at offset 4,096) should stand out from the crowd. In fact, here are the results for passes 6 through 9:
count size offset step msec tps kBps 32768 4096 0 16384 19503 138 6720 32768 4096 512 16384 1216537 2 107 32768 4096 1024 16384 1213479 2 108 32768 4096 2048 16384 1214623 2 107
Pass 6 takes 20 seconds, while passes 7, 8 and 9 take 20 minutes.
Let me rephrase that: properly aligned non-sequential writes are faster than misaligned ones by a factor of sixty.
Sixty. Six zero.
We really, really need to get that fixed somehow.
That's not the whole story, though. Let's see how it compares to a 7,200 rpm, 2 TB Hitachi Deskstar (HDS722020ALA330) with 512-byte physical sectors:
count size offset step msec tps kBps 32768 4096 0 16384 8803 307 14889 32768 4096 512 16384 8701 310 15063 32768 4096 1024 16384 8735 309 15004 32768 4096 2048 16384 8705 310 15056
The Hitachi blows through the test so fast you don't even have time to make yourself a cup of coffee, let alone drink it.
This is a 7,200 rpm, 400 GB Caviar SE16 (WD4000AAKS)—more than three years old, so don't expect too much:
count size offset step msec tps kBps 32768 4096 0 16384 21348 126 6139 32768 4096 512 16384 21674 124 6047 32768 4096 1024 16384 20799 129 6301 32768 4096 2048 16384 21031 128 6232
So, about the same as we get from the WD20EARS with aligned writes.
Now, here's the kicker. The last drive in my test lineup is a WD20EADS—almost the same as the WD20EARS, but with 512-byte sectors and only 32 MB cache (although cache doesn't mean anything here—I made sure my test program writes enough data to blow through the cache on every pass).
count size offset step msec tps kBps 32768 4096 0 16384 22811 118 5745 32768 4096 512 16384 19552 138 6703 32768 4096 1024 16384 36945 73 3547 32768 4096 2048 16384 50102 53 2616
Ouch. It's not just slow, it's also very inconsistent. I have no idea what to make of that.
Note 1: I did not mention rotational speed for the WD Green disks, because Western Digital themselves do not specify one; the spec sheet just says “IntelliPower”. Not sure what to make of that, either. Tom's Hardware contradict themselves, saying in one review that it means 5,400, and in another that it means it varies. Meanwhile, my supplier claim the WD20EARS rotates at 7,200 rpm. Go figure.
Note 2: I also have a 1 TB WD10EARS, but I haven't tested it yet. I expect it to perform pretty much as well (or as poorly, depending on your perspective) as the WD20EARS.
Update: the results for the WD10EARS are in. Strangely, it is much faster at unaligned writes than the WD20EARS, although it's a little slower at aligned writes.
count size offset step msec tps kBps 32768 4096 0 16384 23105 116 5672 32768 4096 512 16384 79285 34 1653 32768 4096 1024 16384 75814 35 1728 32768 4096 2048 16384 79920 33 1640
A naïve sequential-write benchmark (diskinfo -t) suggests that it's about 20% slower overall. It is possible that both disks use a striped layout internally, so the WD20EARS gets better results because it has more platters. If that is the case, it should be possible to modify phybs to detect the stripe size.
I've been playing with WD Green disks, trying to solve the 4,096-byte sector problem. To summarize, Western Digital have started to move from 512-byte sectors to 4,096-byte sectors in order to reduce overhead and thereby increase the amount of data that can be stored on the same amount of platters with the same density. These disks (specifically, the EARS and AARS series) emulate 512-byte sectors for compatibility with older BIOSes and operating systems, but the problem is that they report 512-byte logical and physical sectors instead of 512/4,096.
If the length of a write operation is not a multiple of 4,096, or it does not begin at an address divisible by 4,096, either the beginning or the end of the operation, or both, will cover only part of a sector. This requires the disk to do a read-modify-write operation, meaning that it has to read a complete 4,096-byte sector, update parts of it, and write it back. This is extremely inefficient, as I will demonstrate later.
The reason why this matters so much is subtle. For efficiency reasons, most modern filesystems use on-disk structures of 4,096 bytes or more, so it shouldn't matter, right? But on PCs, for legacy reasons, the first filesystem on a disk (or rather, the first partition) usually starts at sector 63, and 63 × 512 is not a multiple of 4,096. This means that every write operation will be misaligned.
In most cases, you can work around this by making sure, when you partition a new disk, that the first partition starts on a 4,096-byte boundary - say, sector 64 instead of 63. In addition to that, the WD EARS and AARS disks have a jumper setting that makes the disk offset every read or write operation by exactly one logical sector, so what the computer thinks is logical sector 0 is actually logical sector 1, and what the computer thinks is logical sector 63 is actually logical sector 64. Unfortunately, this means that systems that use the whole disk, starting at address 0, or that already take care to align their writes on 4096-byte boundaries, are screwed.
There is another problem: ZFS. ZFS operates on variable-sized blocks of any power of two between 512 bytes and 128 kilobytes. The only way to prevent ZFS from using block sizes smaller than 4,096 bytes is to build your vdevs from devices which advertise 4,096-byte sectors.
The ideal solution is to either a) force the disk to advertise its true physical sector size, or b) hack FreeBSD so it recognizes disks with 4,096-byte sectors.
Regarding the first option, it might be possible to lobby Western Digital to release a firmware upgrade like they did for the auto-idle issue.
As for the second solution, there is an important question: should we do this unconditionally? If we do, then misaligned filesystems on existing disks will become inaccessible. However, one could argue that those filesystems are already essentially unusable due to atrocious performance.
Now that the press release is out, I am finally able to talk freely about a project I have working on for almost close to a year now.
Pretty much since the time that The NYC BSD Users Group was formed, The NY Internet Company have donated a full cabinet and a 10 Mb internet connection to NYCBUG. We used that space to host our website and mailing lists, hardware for developers and mirrors for all the major BSD projects.
In October of 2009, I received an email inviting me to a grand opening party at NYI’s new state of the art data center located in Bridgewater, NJ. I asked some folks on core@ if they thought it would be worthwhile to approach NYI to see if they would be willing to donate a few cabinets so we could build out a FreeBSD mirror on the east coast. gnn, jhb and I had a very informal meeting with Phil from NYI and after asking him if they would be willing to provide us with a few cabinets, some power and bandwidth, without thought or hesitation he said yes. The possibility of putting a mirror of FreeBSD.org on the east coast quickly became possible.
Fast forward to today, with very generous donations from NYI, iX systems, Juniper Networks and Exatrol we are pleased to announce the FreeBSD east coast mirror.
We now have a fully distributed infrastructure that allows us to remain online in the event of a disaster at the west coast facility, spread our traffic among two different locations, provides us to enough power and bandwidth to scale out our packing building cluster and also puts an official FreeBSD mirror much closer to people in Europe.
However, I feel that the most impressive and interesting feature of this datacenter is that it is composed entirely of products based on FreeBSD in some way shape or form.
The number of companies choosing to use FreeBSD in their products when they care about performance, reliability and scalability are growing. We are now at the point where you can deploy an entire datacenter without having to deal with non unix like operating systems. Not only does this east coast mirror add necessary and critical infrastructure to continue to development of FreeBSD, but it can also become a model for future data centers you deploy.
In the near future, I plan to write more about the build out of this mirror as well as some of the technology behind it. Stay tuned.
I wasn't able to make it to BSDCan this year due to family commitments, but I did make it to another open source conference later this summer that I also wrote about on Google's open source blog.
Kirk and I haven't worked closely together but we both do our best at evangelizing BSD and open source inside our respective corners of the company. It's great to see his post about all the excellent work happening in the BSD community on a corporate blog.
From aesthetics point of view suck, but it also works and it's everything I need:
Info : JTAG tap: avr32.cpu tap/device found: 0x21e8203f (mfg: 0x01f, part: 0x1e82, ver: 0x2)
Warn : JTAG tap: avr32.cpu UNEXPECTED: 0x21e8203f (mfg: 0x01f, part: 0x1e82, ver: 0x2)
The industry is moving toward 4k-sectored hard disk drives - a bit late though but better late then newer. However, in the interim period where compatibility for 4k sector drives is not widespread, drives such as WD's EARS series are using internal translation to present themselves as 512-byte sector drives, leading to unexpectedly horrible write performance. As a band-aid I'm proposing an almost trivial patch to glabel which allows it to have a "forced sector size" specified and as a result can present 4k sectors to the file system even if the physical drive doesn't. See also: glabel.