FreeBSD From the Trenches: Using autofs(5) to Mount Removable Media

This next FreeBSD From the Trenches story come to us from Edward Tomasz Napierała who shares his work on the new FreeBSD automounter.

My big project for 2014 was the new FreeBSD automounter.  Like any proper FreeBSD Foundation sponsored project, it included the usual kind of documentation - man pages and the Handbook chapter.  But there is no document that shows how it works inside, from the advanced system administrator or a power user point of view.

So, here it is.  The article demonstrates how modular the automounter is, and how easy it is to adopt to any mount-related situation you might have, using recently added removable media support as an example.  (And it shows some related mechanisms as a bonus.)

autofs(5) Basics

The purpose of autofs(5) is to mount filesystems on access, in a way that's transparent to the application. In other words, filesystems get mounted when they are first accessed, and then unmounted after some time passes. The application trying to access the filesystem doesn't even notice this event, apart from a slight delay on first access.  It's a mechanism similar to ones available in other systems, in particular to OS X.  It's a completely independent implementation, it's just that OS X is the other operating system I use.

Automounting requires cooperation of four things: the kernel filesystem, autofs.ko, which is responsible, among other things, for "pausing" the application until the filesystem is actually there; the automountd(8) daemon, which is the component that retrieves configuration information from maps (this includes fetching it from remote sources, such as LDAP) and actually mounts the filesystems; the automount(8) utility for various administrative purposes; and then the autounmountd(8) daemon to, well, unmount the filesystems mounted by automountd(8) after a timeout.

Setting it up is fairly simple: you obviously need to have autofs(5) enabled in /etc/rc.conf:
And you need to have the autofs(5) daemons running - just like other deamons in FreeBSD those will get started at system bootup if autofs_enable was set; otherwise you need to start them by hand:
# /etc/rc.d/automount start
# /etc/rc.d/automountd start
# /etc/rc.d/autounmountd start
The kernel driver will get loaded automatically, you can see it in kldstat(8) output.

autofs(5) and Removable Media

Note that at the time of this writing, this is only available in FreeBSD 11-CURRENT. This will change soon.

The main configuration file for autofs(5) is /etc/auto_master; you need to uncomment this line:
/media -media -nosuid
This basically says that there is a /media directory, and automount will mount the "-media" map there, and everything that gets mounted there will have the "nosuid" mount option, for security reasons.

If you already had autofs(5) running before uncommenting the line, you must refresh its configuration by running automount(8) as root; run it as "automount -v" for a detailed explanation of what it does.  It looks like this:
# automount -v
automount: parsing auto_master file at "/etc/auto_master"
automount: done parsing "/etc/auto_master"
automount: unmounting stale autofs mounts
automount: skipping /, filesystem type is not autofs
automount: skipping /dev, filesystem type is not autofs
automount: leaving autofs mounted on /net
automount: mounting new autofs mounts
automount: autofs already mounted on /net
automount: nothing mounted on /media; mounting
automount: mounting map -media on /media, prefix "/media", options "nosuid"
If you run mount(8), you will see so called "trigger nodes" of type autofs(5):
# mount
/dev/ada0p2 on / (ufs, local, noatime, journaled soft-updates)
devfs on /dev (devfs, local, multilabel)
map -hosts on /net (autofs)
map -media on /media (autofs)

Basic usage

With all that done, plug a drive into USB, and here is what happens in a real-world case:
[trasz@brick:~]% ll /media
total 9
drwxr-xr-x 3 root wheel 512 Feb 24 12:54 .
drwxr-xr-x 30 root wheel 1024 Feb 24 12:28 ..
drwxr-xr-x 1 root wheel 4096 Jan 1 1980 ADATA UFD
drwxr-xr-x 3 root wheel 512 Feb 24 12:54 md0
[trasz@brick:~]% cd /media/ADATA UFD
[trasz@brick:/media/ADATA UFD]% ll
total 10117
drwxr-xr-x 1 root wheel 4096 Jan 1 1980 .
drwxr-xr-x 3 root wheel 512 Feb 24 12:54 ..
drwxr-xr-x 1 root wheel 4096 Nov 24 00:03 .Spotlight-V100
drwxr-xr-x 1 root wheel 4096 Nov 24 00:03 .Trashes
-rwxr-xr-x 1 root wheel 4096 Nov 24 00:03 ._.Trashes
drwxr-xr-x 1 root wheel 4096 Jan 13 11:24 .fseventsd
drwxr-xr-x 1 root wheel 4096 Nov 22 22:44 Bonus
-rwxr-xr-x 1 root wheel 3309568 Nov 24 14:50 DSC05996.JPG
-rwxr-xr-x 1 root wheel 4063232 Nov 24 14:50 DSC05997.JPG
-rwxr-xr-x 1 root wheel 2953199 Nov 25 21:40 DSC05998.JPG
drwxr-xr-x 1 root wheel 4096 Nov 22 18:24 Meshuggah
drwxr-xr-x 1 root wheel 4096 Nov 22 21:06 System Volume Information
[trasz@brick:/media/ADATA UFD]% mount
/dev/ada0p2 on / (ufs, local, noatime, journaled soft-updates)
devfs on /dev (devfs, local, multilabel)
map -hosts on /net (autofs)
map -media on /media (autofs)
/dev/da0s1 on /media/ADATA UFD (msdosfs, local, nosuid, automounted)
[trasz@brick:/media/ADATA UFD]% cd /
[trasz@brick:/media/ADATA UFD]% sudo automount -u
[trasz@brick:/media/ADATA UFD]% mount
/dev/ada0p2 on / (ufs, local, noatime, journaled soft-updates)
devfs on /dev (devfs, local, multilabel)
map -hosts on /net (autofs)
map -media on /media (autofs)
Two things to notice here: first, the "ADATA UFD" is a factory default
filesystem label on the flash drive.  If there was no filesystem label,
autofs(5) would use device name instead - in this case, that would
be "da0s1".  Second - if you don't want to wait for autounmountd(8)
to unmount the automounted volume, you can use "automount -u".  Or
"automount -fu", if you want to force unmount.

Not So Basic Usage

Take a close look at the directory listing for /media in previous example. Did you notice the "md0" there?  It looks like a device node for memory disk (md(4)), but is a directory.  That's a leftover from my earlier experimentation, and shows an interesting feature of autofs(5)-based automounter: it's not limited to removable media, it can mount everything that's available for mounting.  In this case it's a memory disk (kind of ramdisk, see "man mdconfig").  It can also be an iSCSI lun.  And, of course, a removable media.  How does that work?


In FreeBSD, GEOM is a name of what could otherwise be called a block device layer.  It's a piece of code that manages all the "disk-like devices", both physical and virtual: SATA/SAS/FC/NVME/USB drives, memory disks, iSCSI LUNs, partitions, encrypted GELI volumes etc.

GEOM has another meaning: an instance of GEOM class.  The "class" here means the "kind" of device, and the instance is an actual device of that kind. It's easiest to explain it with an example:
# geom disk list
Geom name: cd0
1. Name: cd0
   Mediasize: 0 (0B)
   Sectorsize: 2048
   Mode: r0w0e0
   ident: (null)
   fwsectors: 0
   fwheads: 0

Geom name: ada0
1. Name: ada0
   Mediasize: 250059350016 (233G)
   Sectorsize: 512
   Stripesize: 4096
   Stripeoffset: 0
   Mode: r2w2e3
   descr: Samsung SSD 850 EVO 250GB
   lunid: 5002538da000f602
   ident: S21PNSAFC02149R
   fwsectors: 63
   fwheads: 16

Geom name: da0
1. Name: da0
   Mediasize: 7654604800 (7.1G)
   Sectorsize: 512
   Mode: r0w0e0
   descr: ADATA USB Flash Drive
   lunname: USB MEMORY BAR
   lunid: 2020030102060804
   ident: 14A0711312300023
   fwsectors: 63
   fwheads: 255

# geom part list
Geom name: ada0
modified: false
state: OK
fwheads: 16
fwsectors: 63
last: 488397127
first: 34
entries: 128
scheme: GPT
1. Name: ada0p1
   Mediasize: 65536 (64K)
   Sectorsize: 512
   Stripesize: 4096
   Stripeoffset: 1024
   Mode: r0w0e0
   rawuuid: 42dc1b8b-c49b-11e3-8066-001c257ac65f
   rawtype: 83bd6b9d-7f41-11dc-be0b-001560b84f0f
   label: (null)
   length: 65536
   offset: 17408
   type: freebsd-boot
   index: 1
   end: 161
   start: 34
2. Name: ada0p2
   Mediasize: 236223201280 (220G)
   Sectorsize: 512
   Stripesize: 4096
   Stripeoffset: 1024
   Mode: r1w1e1
   rawuuid: 42dc921f-c49b-11e3-8066-001c257ac65f
   rawtype: 516e7cb6-6ecf-11d6-8ff8-00022d09712b
   label: (null)
   length: 236223201280
   offset: 82944
   type: freebsd-ufs
   index: 2
   end: 461373601
   start: 162
3. Name: ada0p3
   Mediasize: 13836045312 (13G)
   Sectorsize: 512
   Stripesize: 4096
   Stripeoffset: 1024
   Mode: r1w1e0
   rawuuid: 21a8eef9-a0d4-11e4-ab80-001c257ac65f
   rawtype: 516e7cb5-6ecf-11d6-8ff8-00022d09712b
   label: (null)
   length: 13836045312
   offset: 236223284224
   type: freebsd-swap
   index: 3
   end: 488397127
   start: 461373602
1. Name: ada0
   Mediasize: 250059350016 (233G)
   Sectorsize: 512
   Stripesize: 4096
   Stripeoffset: 0
   Mode: r2w2e3

Geom name: da0
modified: false
state: OK
fwheads: 255
fwsectors: 63
last: 14950399
first: 1
entries: 4
scheme: MBR
1. Name: da0s1
   Mediasize: 7654576128 (7.1G)
   Sectorsize: 512
   Stripesize: 0
   Stripeoffset: 28672
   Mode: r0w0e0
   rawtype: 12
   length: 7654576128
   offset: 28672
   type: !12
   index: 1
   end: 14950399
   start: 56
1. Name: da0
   Mediasize: 7654604800 (7.1G)
   Sectorsize: 512
   Mode: r0w0e0
See?  I've used the geom(8) command to get the information about two GEOM classes: "disk", and "part".  The first one returned information about three instances of the disk class: the DVD drive, the SSD, and the flash.  The second one returned information on the partitions known to the system. Everything that is potentially mountable - a physical disk, a partition, encrypted ELI volume, multipath device, RAID3 volume, memory disks, even volume labels - it all has its GEOM class and can be queried in a similar way.  To see all the GEOM instances in the running system, use:
# sysctl kern.geom.conftxt
Now, notice the "Mode" lines.  Like the one for ada0: "r2w2e3".  Those are three usage counters for ada0 GEOM: read, write, and exclusive.  They are non-zero, because ada0 is used: there are three partitions on it; three instances of PART GEOM class hold it opened.  The partitions, just like any other GEOM nodes, have their own counters.  Take a look at the first one, ada0p1: the mode there is "r0w0e0".  This means it's not open by anything.  It's, in other words, available for mounting.  If you check the MD geom class:
# geom md list 
Geom name: md0
1. Name: md0
   Mediasize: 1073741824 (1.0G)
   Sectorsize: 512
   Mode: r0w0e0
   type: swap
   access: read-write
   compression: off
   length: 1073741824
   fwsectors: 0
   fwheads: 0
   unit: 0
You will see the same thing: it's not opened.  That's the first thing the autofs(5) "-media" map checks for: zero access counts; if the counts are not zero, it means the node is used by something: it's either mounted (like ada0p2, mounted on /), or there is something "on top of it" - like ada0.

But why there is no /media/ada0p1?  Because it's not mountable; there is no filesystem there.  It's a boot loader partition.  How does autofs(5) figure it out?


Before we can do anything with a filesystem, we need to determine what kind of filesystem it is - and whether it actually is a supported filesystem in the first place.  That means we need a piece of code that can take a look at it and determine if it has a format it recognizes.

It is possible to use file(1) for this, eg:
# file -s /dev/md0
Vermaden's sysutils/automount port uses this approach.  There are a few problems with doing it this way, though.  First, the output, for a typical FAT filesystem, looks like this:
/dev/md0: DOS/MBR boot sector, code offset 0x3c+2, 
OEM-ID "BSD4.4  ", sectors/cluster 32, root entries
512, sectors/FAT 256, sectors/track 63, heads 255,
sectors 2097144 (volumes > 32 MB) , serial number
0x668a120e, unlabeled, FAT (16 bit)
It's not particularly easy to parse.  It's even harder to extract the volume label.

Second, file(1) can recognize all kinds of file types, from JPEG to 6502 assembly.  This means that if there are some strange data on the removable media, instead of a filesystem we expect, the file(1) will output something our script wasn't tested against, making the first problem even harder.

Third, file(1) had its share of security bugs, eg CVE-2014-1943, CVE-2014-9620, or CVE-2014-3710.

For this reason I've decided the proper fix would be to just write a new utility. The strange name - "fstyp" - comes from the utility of the same name, installed by default on Solaris, IRIX, OS X, and perhaps most other UNIX systems.

The fstyp(8) addresses the file(1) issues: the output is easily parsable (just a filesystem name, one word), it only recognizes filesystems supported by FreeBSD, and uses Capsicum sandboxing to make sure that even if there is a vulnerability, its impact is limited to incorrectly reporting the filesystem type.  It's a good topic for another article, but in short - in FreeBSD, every process can enter what's called a "capability mode". It's one way - a process can enter it, but there is no way to exit it.  Child processes inherit the mode. In capability mode, kernel will deny all attempts to open new files, create sockets, attach the shared memory segments etc, but the process is pretty much free to do anything it likes with the file descriptors it already had opened before entering the capability mode - and it can receive other file descriptors over a UNIX socket.  So, the fstyp(8) utility opens the device file, then calls cap_enter(2), which switches it into capability mode, and then continues execution, reading from the device to determine what's there.  Should it be compromised, it won't be able to execute /bin/sh, it won't be able to open a socket to transmit the data to some external host, etc.

The "-media" Map

Those are the components underneath the autofs(5), but how does it fit together? Let's start with the actual map.  In FreeBSD, special maps (the ones with names starting with "-") are just executables in /etc/autofs/:
# ls -al /etc/autofs
total 36
drwxr-xr-x 2 root wheel 512 Feb 14 21:18 .
drwxr-xr-x 25 root wheel 3072 Feb 24 11:22 ..
-rwxr-xr-x 1 root wheel 1010 Oct 17 11:26 include_ldap
-rwxr-xr-x 1 trasz wheel 43 Aug 17 2014 include_nis
-rwxr-xr-x 1 root wheel 367 Oct 17 11:26 special_hosts
-rwxr-xr-x 1 root wheel 2294 Dec 6 10:15 special_media
-rwxr-xr-x 1 root wheel 355 Feb 14 21:17 special_noauto
-rwxr-xr-x 1 root wheel 97 Oct 17 11:26 special_null
-rwxr-xr-x 1 root wheel 357 Aug 22 2014 special_smb
See the special_media?  That's the one.  It's a shell script.  The reason it's in /etc is that the system administrator can modify it if required, or add new special maps.

Now, let's try to run it by hand, as root:
# /etc/autofs/special_media

# /etc/autofs/special_media md0
-fstype=msdosfs,nosuid  :/dev/md0
That's exactly how automountd(8) uses it, after the kernel component notifies it that it needs the /media directory taken care of.  It's described in more detail in the auto_master(5) manual page.  The shell script is pretty well commented, and I don't think there is any point in explaining it here.

Bottom line:
the core autofs itself doesn't know anything about removable devices; the special map "-media" does: it queries GEOM for the list of all disk-like nodes that are not in use, and then uses fstyp(8) to determine whether they contain a useful filesystem.  UNIX.  Modularity.  Plain text.  ;-)


Now, let's create a second memory disk, 1GB in size (the "1g" below) to see if it all works as intended:
# mdconfig -s1g
# newfs_msdos /dev/md1
newfs_msdos: cannot get number of sectors per track:
Operation not supported
newfs_msdos: cannot get number of heads:
Operation not supported
newfs_msdos: trim 8 sectors to adjust to a multiple of 63
/dev/md1: 2096576 sectors in 65518 FAT16 clusters
(16384 bytes/cluster)
BytesPerSec=512 SecPerClust=32 ResSectors=1 FATs=2
RootDirEnts=512 Media=0xf0 FATsecs=256 SecPerTrack=63
Heads=255 HiddenSecs=0 HugeSectors=2097144

# ll /media
total 5
drwxr-xr-x 3 root wheel 512 Feb 24 12:25 .
drwxr-xr-x 30 root wheel 1024 Feb 23 09:04 ..
drwxr-xr-x 1 root wheel 4096 Jan 1 1980 ADATA UFD
drwxr-xr-x 3 root wheel 512 Feb 24 12:25 md0
Whoops.  Where is /media/md1?

There is one more mechanism for the whole thing to work correctly: the autofs(5) cache needs to be dealt with.

The first paragraph mentioned that it's automountd(8) that does all the map parsing - including running the /etc/autofs/special_media - and actual mounting. Doing it every time someone accesses the /media directory - or any directory, for that matter - would kill performance.  For this reason, after the kernel component asks the automound(8) to do its magic, it doesn't do that again until some time later.  In most cases it doesn't matter - the list of NFS exports for a given host doesn't change too often - but in case of removable media it's not acceptable.  The cache needs to be flushed, using "automount -c".  After that, the subsequent lookup in /media will trigger automountd(8), which will query the devices list and refresh the directory contents.

This obviously needs to happen automatically.  And if you actually went and opened /etc/auto_master in a text editor, you would have noticed this:
# When using the -media special map, make sure to edit devd.conf(5)
# to move the call to "automount -c" out of the comments section.
The devd(8) is a daemon responsible for listening for notifications from the kernel and running whatever is configured in its config, /etc/devd.conf. There are all kinds of things there, from running utilities to upload firmware for various USB devices, to launch moused(8) when a mouse gets connected, to switching power profiles, to... discarding autofs(5) caches.  It looks like this:
notify 100 {
match "system" "GEOM";
match "subsystem" "DEV";
action "/usr/sbin/automount -c";
If you do "man devd.conf", you will see the description of those events. Note that, just like the "-media" map works the same way for flash drives and encrypted volumes over multipath over iSCSI, this mechanism does not care about any specific hardware either.


Two, really.  First: you need to run 11-CURRENT.  Second: the nodes in /media never disappear.  I expect to merge this support to 10-STABLE after the second issue is addressed.

cache line aliasing effects, or "why is freebsd slower than linux?"

There was some threads on FreeBSD/DragonflyBSD mailing lists a few years ago (2012?) which talked about some math benchmarks being much slower on FreeBSD/DragonflyBSD versus Linux.

When the same benchmark is run on FreeBSD/DragonflyBSD using the Linux layer (ie, a linux binary compiled for linux, but run on BSD) it gives the same or better behaviour.

Some digging was done, and it turned out it was due to memory allocation patterns and memory layout. The jemalloc library allocates large chunks at page aligned boundaries, whereas the allocator in glibc under Linux does not.

I've put the code online in the hope that others can test and verify this:

The branch 'local/freebsd' has my local change to allow the allocator offset to be specified. The offset compounds on each allocation - so with an 'n' byte offset, the first allocation is 0 bytes offset from the page boundary, the next is 'n' bytes offset from the page boundary, the next is '2n' bytes offset, etc.

You can experiment with different values and get completely different behavioural results. It's non-trivial: there's a 100% speedup by using a 127 byte offset for each allocation, versus a 0 byte offset.

I'd like to investigate cache line aliasing effects further. There was work done a few years ago to offset mbuf headers in the FreeBSD kernel so they weren't all page-aligned or 256/512/1024 byte aligned - and apparently this gave a significant performance improvement. But it wasn't folded into FreeBSD. What I'd like to do is come up with some better strategies / profiling guides for identifying when this is actually happening so the underlying objects being accessed can be adjusted.

So - if anyone out there has any tips, hints or suggestions on how to do this, please let me know. I'd like to document and automate this testing.

A look at the upcoming features for 10.1.2

If you’ve been an EDGE user in the past few weeks, or following our Roadmap items for the upcoming 10.1.2 release, you may have noticed a number of new security and privacy related items. I wanted to take a moment to clarify what some of these new features are and what they will do.


– PersonaCrypt –

The first of the new features is a new CLI utility called personacrypt. This command will allow the creation and usage of a GELI backed encrypted external media for your users $HOME directory. We are using it internally to keep our user profiles on USB 3.0 — 256GB hybrid SSD / flash memory stick (Coarsair flash Voyager GTX specifically). This is tied into the PCDM login manager, and user manager, so when you create a new user account, you can opt to keep all your personal data on any external device. The device is formatted with GPT / GELI / ZFS, and is decrypted at login via the GUI, after entering your encryption key, along with the normal user password.

Additionally, the personacrypt command uses GELI’s ability to split the key into two parts. One being your passphrase, and the other being a key stored on disk. Without both of these parts, the media cannot be decrypted. This means if somebody steals the key and manages to get your password, it is still worthless without the system it was “paired” with. PersonaCrypt will also allow exporting / importing this key data, so you can “pair” the key with other systems.

– Tor Mode –

We’ve added a new ability to the System Updater Tray, so you can with a single-click, toggle between running in Tor mode, and regular “Open” mode. This switch to Tor mode, will do the following:

1. Launch the Tor daemon, and connect to the Tor network
2. Re-write all the IPFW rules, blocking all outgoing / incoming traffic, except for traffic to and from the Tor daemon
3. Re-route all DNS / TCP requests through Tor using its transparent proxy support

This allows applications on the system to now connect to the internet through Tor, without needing explicit SOCKS proxy support.

Obviously this alone isn’t enough to keep your identity safe on the Internet. We highly recommend that you read through their excellent FAQ and wiki articles on the subject.

– Stealth Mode –

One of the features we just added to personacrypt is something we are calling “stealth” mode. It is integrated into PCDM, and does the following:

During the login, if stealth mode is selected, the users $HOME directory will be mounted with a GELI backed ZVOL with GELI’s onetime key encryption. This $HOME directory is setup with the default /usr/share/skel data, and is pretty much a “blank” slate, allowing you to login, and run apps as if on a fresh system each time. At logout the dataset is destroyed, or should the system be rebooted, the onetime key is lost, rendering the data useless. Think of it as a web browser’s “private” mode, except for your entire desktop session.

– LibreSSL –

We’ve made the switchover to convert our ports to use LibreSSL by default instead of the base systems OpenSSL. (Thanks to Bernard Spil for his work on this). Our hope is that LibreSSL will help make the system security better, and reduce the number of OpenSSL exploits that our packages may be vulnerable to.

– Encrypted Backups –

The Life-Preserver utility has had the ability for a while now to replicate your system to another box running FreeBSD, such as FreeNAS. This backup is done via ZFS send/recv using SSH, but the data on the remote end was stored un-encrypted and could be read by whomever was administrating the remote box. To provide an extra measure of security to backups, we are in the process of adding support for fully-encrypted backups, using GELI backed iSCSI volumes. This allows us to use ZFS send/recv over the wire, with all the data leaving the box already being encrypted via GELI. Your data on the remote side is fully-encrypted, and only accessibly with the key file you have stored on the client side. This is still in active development and should show up in the EDGE repo in the upcoming weeks, along with some additional details on usage.



We hope you’ve enjoyed this sneak-peek of whats happening with PC-BSD development right now. As always, we love people to test these features in our EDGE repo, and let us know of issues via our bug tracker:

HDMI progress and 4DCAPE-43 support for Beaglebone Black

HDMI support for Beaglebone Black is stable now and supports reading EDID, you can get the path here. Before committing it I’d like to make interoperability between HDMI framer and FB/LCD drivers as generic as possible and for this I need at least one more system with working HDMI to find common patterns. For this purpose I picked up i.MX6-based Hummingboard and now try to get video output working on it. There is some minor progress but it seems before getting to HDMI/IPU I need to do some work on clock management part of the system. So it’s going to be some time before I see first pixels on my monitor.

I also got 4DCAPE-43, neat Beaglebone Black LCD cape by 4D Systems. Patch (download here) for it is pretty minimal: VT support in kernel config, panel info and pins configuration in dts, and one improvement in GPIO driver (setting default values for OUT GPIO pins). None of the other features except reset button work yet but getting GPIO keyboard working would be an interesting project by itself.

Here is the picture of cape in action:


FreeBSD From the Trenches: ZFS, and How to Make a Foot Cannon

This month's story comes to us from Glen Barber, UNIX Systems Administrator.

The ZFS filesystem is regarded for its robustness and extensive feature set.

Its robustness can be haunting, however, if a mistake is made.  I learned this the hard way through a seemingly innocent typo, a mistake I certainly will not soon repeat.

We use ZFS almost exclusively in the FreeBSD cluster.  I say "almost" because there is one remaining machine that does not use ZFS, because the machine is too underpowered to handle it.

All machines are installed in a netboot environment while logged in at the serial console, providing the utilities necessary for extremely customizable installations.  Most of the installations I have performed on machines in the cluster have been pseudo-scripted, with subtle differences depending on the machine, such as if the disks are da(4) or ada(4), the number of disks, how much space to allocate for swap, the number of ZFS pools, and so on.

For the most part, a basic installation would be done with a very simple sh(1) script that looks something like:
# for i in $(sysctl -n kern.disks); do 
gpart create -s gpt $i; [...]; done
Nothing too fancy at all.

Most times I would copy/paste from an installation script I've used for years, other times I would manually type the commands.  It really depended on what the end result was supposed to be, as far as configuration.

When I installed the FreeBSD Foundation's new server, I typed the commands manually.  You might ask, "Why did you do it this way?"  To this day, I cannot answer that question.  But if I didn't, this story would be far less interesting.

The machine was installed like this, almost verbatim:
# for i in $(sysctl -n kern.disks); do 
gpart create -s gpt /dev/${i};
gpart add -t freebsd-boot -s 512k -i 1 /dev/${i};
gpart bootcode -b /boot/pmbr
-p /boot/gptzfsboot -i 1 /dev/${i};
gpart add -t freebsd-swap -s 16G -i 2 /dev/${i};
gpart add -t freebsd-zfs -i 3 /dev/${i};
# zpool create zroot mirror /dev/ada0 /dev/ada1
# for i in tmp var var/tmp var/log
var/db usr usr/local usr/home; do
zfs create -o atime=off zroot/${i};
This creates the GPT partition scheme for all available hard disks, writes the partition layout to the disks, writes the GPT boot code to the first partition on each disk, and allocates the swap space and ZFS space.  Then it creates the ZFS pool named 'zroot' configured as a mirror, and creates the ZFS datasets in the new pool.

The problem is not too obvious unless you are looking for it specifically, but instead of using the 'freebsd-zfs' GPT partitions, which are /dev/ada0p3 and /dev/ada1p3, I created the pool on the full disk (/dev/ada0 and /dev/ada1).

Simple enough to fix, right?  Destroy the 'zroot' pool, destroy the GPT partition layout to be safe, and create it again with the correct arguments to 'zpool create'.

So, that's what I did.

Luckily I wasn't ready to put this machine into production yet.  I still wanted to do some basic stress testing on the machine before moving anything critical to it.

Fast forward about a month.

After being satisfied that the machine did not have any obvious stability problems, such as faulty RAM for example, and after having lowered the relevant TTL entries in DNS, I decided to do one more upgrade on the machine before beginning the independent service migrations to the new machine.

This is where things started to go wrong.  Fast.

The source-based upgrade finished, and I rebooted the machine.  In another terminal, attached to the serial console, saw the machine proceed through the normal reboot routines, killing running services, syncing buffers, and so on.

After the machine completed POST routines, everything went dark.  The machine did not respond to serial console input, and as far as I could tell, this was not due to a change caused by the update.

I should note that, by nature, I am a paranoid sysadmin.  This is a good quality, in my opinion, because I habitually go out of my way to make sure any situation is recoverable if something goes wrong.  Suspecting I did something wrong, I immediately began reviewing the history recorded while being logged in at the console.  Nothing looked suspicious.  This upgrade should have "just worked."

I remotely power-cycled the machine, and booted into our netboot environment to investigate further.

I immediately knew something went wrong after importing the 'zroot' pool into a temporary location, and seeing several tell-tale signs.  For starters, /etc/rc.conf had a timestamp that predated the machine from even being shipped to the colocation facility.  More confusingly, /usr/obj was empty, as if the 'buildworld/buildkernel'-style upgrade that took place less than an hour prior had never happened.

Then panic ensued.  The machine didn't panic -- I did.

Everything was gone.

Every configuration change since the initial install, every jail that was created, every package that was installed.  All of it.  Just gone.

While investigating, I sent a heads-up to the other cluster administrators in case there was an issue that affected other installations.  As investigation progressed, Peter realized he had seen this exact behavior in the past, and provided an example scenario with which it could occur.

It was exactly what I had done - used the raw disk for the ZFS pool instead of the 'freebsd-zfs' GPT partition.

So, what's the problem?

The problem is 'zpool destroy' does not implicitly delete pool metadata from the disks, so as far as ZFS is concerned, I had two different ZFS pools, both named 'zroot', which confused the boot blocks just enough to import the wrong pool at boot.  Well, it didn't just import the wrong pool, it imported an empty pool.

Worse yet, because I had allocated the partitions in the order of 'freebsd-boot', 'freebsd-swap', and 'freebsd-zfs', and that 'freebsd-swap' consisted of 16GB, the swap partition had more than enough space to hold on to the metadata from the pool I did not want to exist.  There was no way to force one pool to be chosen over the other, and worse, no way to tell which pool would be chosen by the loader.

The only good news at this point was that the machine was not yet in production.

How do you fix this, then?

Peter had a suggestion, since he has run into this before.  Reboot the machine into the netboot environment, and try to force the correct pool into being imported by forcibly removing all device entries for the disks and retrying the ZFS pool import.  This would be done by running:
# rm -f /dev/gptid/* /dev/diskid/* /dev/ada?
# zpool import -o altroot=/tmp/zroot zroot
Unfortunately, the wrong pool was imported again, most likely (but unconfirmed) by allocation such a large amount of swap to the disks.
# zpool status
zroot ONLINE
mirror-0 ONLINE
Then I realized the partition table was also corrupt.

After several attempts to coerce the correct pool to import, I became increasingly more uncomfortable with leaving the machine in this condition. At this point, there was only one solution - wipe the disks, and start over.

Ultimately, despite disliking the solution, that is what I did to correct the problem, though at the time, I was unaware of the 'labelclear' command to zpool(8), which would have wiped the ZFS pool metadata from the disks.  But at that point, I was not going to take any chances either way.

The takeaway is, despite how innocent a mistake may appear at first, when dealing with metadata stored on disk devices, it surely will come back to haunt you at some point sooner or later.

SCALE 13x Trip Report: Michael Dexter

The Foundation recently sponsored Michael Dexter to attend SCALE 13x. Michael provides the following trip report:

SCALE 13x was the 13th Southern California Linux Expo and took place February 19th through 20th in Los Angeles, California. Despite its name, this year's event demonstrated sincere outreach to the BSD community as demonstrated by two booths and several BSD-related talks. The first booth featured FreeBSD, the FreeBSD Foundation, FreeNAS, PC-BSD and pfSense while the second featured OpenBSD and NetBSD. Both booths were filled with familiar faces including Dru Lavigne, Denise Ebery, Matt Olander, James Nixon, David Maxwell, Brooke and Seth and two toddlers!
The FreeBSD Booth Crew -
Photo courtesy of iXsystems

The variety of booth visitors were very familiar for SCALE: a mix of students, consultants, open source developers and military/aerospace contractors. I heard lots of "I got started on FreeBSD" and "I use FreeNAS" plus the occasional "When can we have a military-certified BSD so we can stop using Linux?" The last one is something I have heard at every SCALE I have attended and is representative of the region. Hats off to the SCALE organizers for also attracting such a diverse

The BSD-related talk topics included David Maxwell's newly-released pipecut that he debuted at MeetBSD (, Brooks Davis' talk on the BERI CPU that he is working on with Robert Watson, Dru Lavigne's talk on new FreeNAS 9.3 features and my talk on FreeBSD Virtualization Options. There were also many overlapping talks such as those on various system containers, embedded systems and of course Brendan Gregg's talk on systems performance. Brendan kindly updated the Netflix statistics that I was already going to address and both Bryan Smith and Randal Schwartz had great user questions. It truly was a pleasure to speak at SCALE and my sincerest thanks to Brendan for live Tweeting my talk.

Impressively, some SCALE speakers were in their teens and the overall outreach to kids was great including an evening kids-only event. The BSD Certification Group scheduled a BSDA exam but alas it was poorly attended. I humbly invite you to take the BSDA exam if you have not done so already and ask that you help spread the word whenever you get a chance.

In a community where we often preach to the converted, I find SCALE to be a very receptive venue for outreach and encourage you to attend and consider submitting a BSD-related talk to SCALE 14x. Special thanks to Gareth Greenaway for reaching out to the BSD community and for the great attitude demonstrated by his team of volunteers. Finally, I would like to thank the FreeBSD Foundation for covering my air travel and O'Reilly Media for allowing me to share a room with one of their amazing team members.

FreeBSD on the POWER8: it’s alive!

A post to freebsd-ppc from a couple of months ago asked if we had support for POWER8 and offered to provide remote access to anyone interested in working on it. I was sufficiently intrigued that I approached the FreeBSD powerpc hackers to ask about it, and was informed that it'd be nice, but we didn't have hardware.

After a bit of wrangling of hardware logistics and with the FreeBSD Foundation purchasing a box, a Tyan POWER8 evaluation server appeared. Nathan Whitehorn started poking at it and managed to get a basic "hello world" going, but stalled on issues with the Linux KVM virtualisation environment.

Fast forward a few weeks - he's figured out the KVM issues, their lack of support for some mandated hypervisor APIs and other bugs - FreeBSD now boots inside of the hypervisor environment and seems stable enough to do development on.

He then found the existing powerpc pmap (physical memory management) code wasn't very SMP friendly - it works fine on one and two CPU powerpc machines, but this POWER8 evaluation board is a 4-core, 32-thread CPU. So a few days of development went by and he rewrote most of the pmap code to be much more fine grained locked and scale much, much better than the existing code. (He also found the PS3 hypervisor layer isn't thread-safe.)

What's been done thus far?

  • FreeBSD boots inside the hypervisor environment;
  • Virtualised console, networking and storage all work;
  • (in progress) new, scalable pmap implementation;
  • Initial support for the Vector-Scalar Extension (VSX) that's found on POWER7 and POWER8.
So, I'm impressed. Nathan's done a fantastic job bringing the whole thing up. There's some further work on the new powerpc technology that needs doing (things like the new vector processing units, performance counter support and such) and I'm sure Justin and Nathan will poke powerpc dtrace support into further good shape. I'm going to see if we can fix a chelsio 40G NIC into one of these and work with their developers to fix any endian/busdma issues that creep up, and then do some network stack scaling testing with it. There's also the missing hardware/hypervisor support to run FreeBSD on the bare metal, which would be a fantastic achievement.

Now I kind of want some larger POWER8 hardware.

Lumina Desktop 0.8.2 Released!

The next version of the Lumina desktop environment has just been released! Version 0.8.2 is mainly a “spit-and-polish” release: focusing on bugfixes, overall appearances, and interface layout/design. The FreeBSD port has already been updated to the new version, and the PC-BSD “Edge” repository will be making the new version available within the next day or two (packages building now). If you are creating/distributing your own packages, you can find the source code for this release in the “qt5/0.8.2″ branch in the Lumina repository on GitHub.

The major difference that people will notice is that the themes/colors distributed with the desktop have been greatly improved, and I have included a few examples below. The full details about the changes in this release are listed at the bottom of the announcement.

Reminder: The Lumina desktop environment is still considered to be “beta-quality”, so if you find things that either don’t work or don’t work well, please report them on the PC-BSD bug tracker so that they can get fixed as soon as possible.


Click to view slideshow.


Changes from 0.8.1 -> 0.8.2:

New Utility: “lumina-info”
  • This utility provides basic information about the current version of the Lumina desktop as well as link to various information online (source repository, list of contributors, bug tracker, etc…).
  • New Color Scheme: PCBSD10-Default (the default for new PC-BSD users)
  • Lumina-[Red/Green/Gold/Purple] color schemes updated.
  • “Lumina-Default” and “None” theme templates updated significantly.
  • Make it easier for a distributor to set a default theme/color by just supplying the name instead of the full path (since paths can be different on different OS’s)
  • Fix issue with some fonts not loading properly.
  • Distinguish between “Save” and “Apply” in the theme template editor (“Save” to update the file but keep the editor open, “Apply” to update the file and close the editor). This makes it much easier to test changes to the current theme template as you don’t need to close/re-open the editor to try it out.
  • Add the ability to distinguish between transparent/opaque desktop plugins in the theme template.
Configuration Utility Changes:
  • Add a new dialog for selecting plugins of all kinds. This makes it easier to browse through and read the descriptions of the plugins before actually selecting one.
  • Re-arrange the interface quite a bit. Now the general page arrangements make much more sense (Appearance [wallpaper/theme], Interface [desktop/panels], Applications [auto-start/file defaults], Shortcuts, Session [general options/window system] )
  • Add the ability to set your preferred time/date format(s) in the general session options.
  • Add the ability to reset your preferred file manager/terminal/web browser/email client back to defaults.
Insight File Manager Changes:
  • Make all the file operations happen in a separate thread (does not lock the UI any more).
  • Clean up the detection routine for Qt-editable files (for rotating images in the slideshow in particular).
  • When removing a file in the slideshow, make it simply go to the next/previous image instead of the beginning of the list.
  • Make the ZFS snapshot search functionality a lot faster if just refreshing the current directory.
Desktop Session Changes:
  • For vertical panels, have the clock plugin try to use vertical space instead of horizontal.
  • Clean up a few desktop plugin stability issues.
  • Streamline the session cleanup functionality.
  • Fix some issues with 2nd panel functionality.
  • Reduce the number of backend filesystem watchers (only one per session instead of one per screen now).
Lumina Search Changes:
  • When searching for files/directories, make the initial starting point user-configurable as well as add the ability to exclude directories from the search routine (also configurable by the user).


There will be a FreeBSD booth during SCALE, to be held at the Hilton LAX in Los Angeles from Friday, February 19 to Sunday, February 22. This event has a small entrance fee and registration is required. The FreeBSD booth will be next to the OpenBSD booth. As usual, we’ll be giving away cool swag, PC-BSD DVDs, FreeNAS CDs, and brochures and accepting donations for the FreeBSD Foundation.

There will be several BSD-related talks at this event. Dru Lavigne will present “What’s New in FreeNAS 9.3″ at 13:30 on Saturday, Brooks Davis will present “Booting FreeBSD on BERI” at 14:30 on Saturday, and Michael Dexter will present “FreeBSD Virtualization Options” at 16:30 on Saturday.

The BSDA certification exam will be available at noon on Sunday. You can register and pay for the exam here.

Testing the DRM update

The DRM update patch is now ready for review and testing.

As you can see, the patch is huge and I don’t expect a deep review of it. When I started to work on this, I tried to import some bits here and there. But I ended up porting every DRM files from Linux 3.8 from scratch. Many symbols were renamed (structure names, members, variables, etc.), functions were reordered or moved between files, new helpers and macros were added. The end-result is that the diff with Linux is greatly reduced.

Changes are mostly under-the-hood; the most important ones are described below. Beside a couple Radeon PCI IDs, drivers are largely untouched. Don’t expect Haswell support or Radeon power management in this patch for instance.

I think the two changes which could cause stability issues are:

  • A new lock, global to the DRM subsystem, was introduced. This one exists in Linux for a long time, but was not used in the original port of DRM: it was replaced by the lock protecting accesses to the structure representing a device. This incompatible change required us to sometimes unlock/relock the device lock to avoid recursion. Now that the global DRM lock is restored, we may have issues in some places where the device is now missing.
  • Return values were hopefully sanitized. On Linux, they return negative integers to indicate an error code. In FreeBSD, we do the opposite. The original DRM port modified many places to use positive integers. I restored negative integers to reduce the diff. Now, integers are converted from/to positive integers only when they come from a FreeBSD API or when they are returned to other FreeBSD subsystems. I reviewed every return values to make sure we use the proper sign, but I’m sure there are bugs in this area.

Now, the new things! First, support for the setmaster/dropmaster ioctls was introduced. A master is the application controlling the video device. The most common master is the X.Org server. Those ioctls allows masters to tell when they no longer use the video device so another master can use it. This can be used to run two X.Org servers for instance: when a user vt-switch, the currently displayed X.Org server drops its master privilege, allowing the second server to become the master.

The other new feature is the support for minor devices. Historically, a video device is accessed through /dev/dri/card0 for instance. As shown in the previous paragraph, if two masters compete for the same device, only one is allowed to execute commands on the video device. It could make sense for two X.Org servers. It is less than ideal if you want to run multiple OpenCL programs. Minor devices where introduced to present several /dev entries for a given video device. With this patch, you will see /dev/dri/card0 and /dev/dri/controlD64. The second dev entry is not used currently. When we update the drivers to a more recent Linux version and gain support for render nodes, another minor device will appear: /dev/dri/renderD128. Render nodes are used to grant applications such as computation programs a limited access to the video device. The X.Org server remains the master application, and an OpenCL program can use the video card without having to ask anything to the server. It doesn’t even need a running server. But this feature is not for now: we need to update the drivers first.

TDMA (somewhat) working on AR9380 chips

(Wow, I have a lot of posts to write to catch up on things.)

I've just brought up FreeBSD's TDMA support on the AR9380 chipset. Specifically, the AR9331, since I have a Carambola 2 on me today.

It was pretty simple to bring up - I was missing the beacon configuration HAL call that the TDMA code expected. It's only used by the TDMA code - the STA and AP modes rely on the normal HAL beacon methods that date back to the Atheros HAL.

The only problem - it seems something is up with ANI (noise immunity) and sensitivity on at least the AR9331. It doesn't seem to behave well on slightly loaded channels and thus the beacons don't always go out when they're supposed to.

But, if you've been wanting to play with TDMA on the later Atheros chips, now you can!

PC-BSD 11.0-CURRENT Images Now Available!

Kris just announced on the mailing list that new current images are now available! Check out the info below!

The PC-BSD project is pleased to announce the availability of our first
images based upon FreeBSD 11.0-CURRENT!

WARNING: These images are considered “bleeding-edge” and should be
treated as such.

The DVD/USB ISO files can now be downloaded from the following URL:

We hope to continue rolling these –CURRENT images as a way for testers
and developers to tryout both FreeBSD and PC-BSD bleeding edge features,
often months before a planned release. These images include a full PKG
repository compiled for that months image. Users of this system will
also be able to “upgrade” when the next monthly image is published.

– Reporting Bugs –

This is a great way to test features and report bugs well before the
release cycle begins for the next major .0 release.

For bugs in PC-BSD, please report to:

For FreeBSD / Port / Kernel / World bugs, please report to: