I’d like to share how we set up new servers at my company.
As you know, sysinstall currently doesn’t support ZFS or GEOM configuration, but I’ll show you how to convert system installed with sysinstall to ZFS-only server. Even if you don’t want to follow all the steps, please take a look at ZFS datasets layout, which after several modifications I consider quite optimal for FreeBSD.
After doing all the steps below your system will use GPT partitions, encrypted+mirrored swap and mirrored ZFS system pool.
I’m assuming your server contains two identical disks (ada0 and ada1).
Start from installing FreeBSD on the first disk using regular installation CD/DVD. Choose exactly one slice and exactly one partition. Reboot.
Your system is now up and running, booted from single UFS file system.
Add the following lines to /boot/loader.conf:
vm.kmem_size="6G" # This should be 150% of your RAM.
vfs.zfs.arc_max="3G" # This should be a little less than the amount of your RAM.
Partition the second disk:
# gpart create -s GPT ada1
# gpart add -b 34 -s 128 -t freebsd-boot ada1
# gpart add -s 2g -t freebsd-swap -l swap1 ada1
# gpart add -t freebsd-zfs -l system1 ada1
# gpart bootcode -b /boot/pmbr -p /boot/gptzfsboot -i 1 ada1
Create swap on the second partition and ZFS system pool on the third partition:
# gmirror label -F -h -b round-robin swap /dev/gpt/swap1
# zpool create -O mountpoint=/mnt -O atime=off -O setuid=off -O canmount=off system /dev/gpt/system1
Create root file system and update /etc/fstab:
# cat > /etc/fstab
system/rootfs / zfs rw,noatime 0 0
/dev/mirror/swap.eli none swap sw 0 0
# zfs create -o mountpoint=legacy -o setuid=on system/rootfs
# zpool set bootfs=system/rootfs system
# mount -t zfs system/rootfs /mnt
Create the rest of the file systems (everything should be mounted below /mnt/):
# zfs create system/root
# zfs create -o compress=lzjb system/tmp
# chmod 1777 /mnt/tmp
# zfs create -o canmount=off system/usr
# zfs create -o setuid=on system/usr/local
# zfs create -o compress=gzip system/usr/src
# zfs create -o compress=lzjb system/usr/obj
# zfs create -o compress=gzip system/usr/ports
# zfs create -o compress=off system/usr/ports/distfiles
# zfs create -o canmount=off system/var
# zfs create -o compress=gzip system/var/log
# zfs create -o compress=lzjb system/var/audit
# zfs create -o compress=lzjb system/var/tmp
# chmod 1777 /mnt/var/tmp
# zfs create -o canmount=off system/usr/home
# zfs create system/usr/home/pjd
(create file systems for all your users)
Enable ZFS in /etc/rc.conf:
# echo 'zfs_enable="YES"' >> /etc/rc.conf
I recommend setting ports work directory to /usr/obj:
# echo WRKDIRPREFIX=/usr/obj >> /etc/make.conf
Copy entire system to the second disk (note there are two dashes before one-file-system!!):
# cd /
# tar -c --one-file-system -f - . | tar xpf - -C /mnt/
Unmount ZFS file system and change pool mountpoint to /. It will be inherited by all file systems:
# zfs umount -a
# umount /mnt
# zfs set mountpoint=/ system
Reboot. If your machine booted fine (it should, but…) you will see the following:
system/rootfs on / (zfs, local, noatime)
devfs on /dev (devfs, local)
system/root on /root (zfs, local, noatime, nosuid)
system/tmp on /tmp (zfs, local, noatime, nosuid)
system/usr/home/pjd on /usr/home/pjd (zfs, local, noatime, nosuid)
system/usr/obj on /usr/obj (zfs, local, noatime, nosuid)
system/usr/ports on /usr/ports (zfs, local, noatime, nosuid)
system/usr/ports/distfiles on /usr/ports/distfiles (zfs, local, noatime, nosuid)
system/usr/src on /usr/src (zfs, local, noatime, nosuid)
system/var/audit on /var/audit (zfs, local, noatime, nosuid)
system/var/log on /var/log (zfs, local, noatime, nosuid)
system/var/tmp on /var/tmp (zfs, local, noatime, nosuid)
# swapctl -l
Device: 1024-blocks Used:
/dev/mirror/swap.eli 4194300 4888
Now we need to attach the first disk (ada0):
# dd if=/dev/zero of=/dev/ada0 count=79
# gpart create -s GPT ada0
# gpart add -b 34 -s 128 -t freebsd-boot ada0
# gpart add -s 2g -t freebsd-swap -l swap0 ada0
# gpart add -t freebsd-zfs -l system0 ada0
# gpart bootcode -b /boot/pmbr -p /boot/gptzfsboot -i 1 ada0
# gmirror insert -h -p 1 swap /dev/gpt/swap0
(wait for gmirror to synchronize swap)
# gmirror status
Name Status Components
mirror/swap COMPLETE gpt/swap1
# zpool attach system /dev/gpt/system1 /dev/gpt/system0
(wait for pool to resilver)
# zpool status
scrub: resilver completed after 0h2m with 0 errors on Mon Aug 2 11:28:45 2010
NAME STATE READ WRITE CKSUM
system ONLINE 0 0 0
mirror ONLINE 0 0 0
gpt/system1 ONLINE 0 0 0 55,5M resilvered
gpt/system0 ONLINE 0 0 0 1,67G resilvered
errors: No known data errors
That’s all folks.
BTW. Because your /var/log/ is compressed by ZFS, you can turn off compressing the logs while you rotate them in /etc/newsyslog.conf.
Below you can find few reasons why I use the proposed datasets layout:
- There is no manipulation of mountpoint property of any dataset except for dataset system, which starts the tree. When you manage ZFS it might be confusing when dataset place in ZFS hierarchy doesn’t match mount point. The mountpoint property is always inherited, so if you change it for some reason, you know that all datasets below will inherit it properly and no other changes are needed.
- Note that system/usr and system/var are not mounted. The /usr/bin, /usr/sbin, /var/named, etc. directories all belong to system/rootfs dataset. When you upgrade your system, it should be enough to snapshot system/rootfs dataset alone and rollback only it when something goes wrong.
- In proposed layout you can set setuid to off for all datasets except for system/rootfs and system/usr/local.
- You have /usr/src, /usr/obj, /tmp, /var/tmp, /var/log and /var/audit compressed.
- Having entire OS in one dataset make it easy to create jails from it. What I do when I create new jail is to clone system/rootfs to system/jails/<name> and create the following additional datasets for jail:
- system/jails/<name>/usr (set canmount to off)
- system/jails/<name>/usr/work (I build ports here)
- system/jails/<name>/root (jailed root home directory)
This way you can make system/jails/<name> dataset read-only and when
you update your main system you perform the followin steps to update a jail:
- stop the jail
- rename destroy system/jails/<name> to system/jails/<name>_old
- clone system/jails/<name> again from system/rootfs
- rename all datasets which are now below system/jails/<name>_old to newly cloned system/jails/<name>
- update system/jails/<name>/etc
- update system/jails/<name>/var
I use geli to encrypt partition on my laptop for a very long time.
The only problem is when I need to suspend the laptop (yes, suspend works almost like a charm on my t43) – I need to detach my encrypted partition then. It would be more or less safe for me to leave it attached, as I lock my console with ‘vlock -a’ command, so the only thing a thief can do is to turn off the laptop, thus remove keys from the memory. Although leaving attached partition with all the keys in memory doesn’t seem right…
BTW. ‘vlock -a’ is really nice, because when everything is locked, it will reset the system when one tries to enter DDB. Not sure if that is intended behaviour, but very useful.
Detaching encrypted partition is a bit PITA, as I keep a lot of stuff in there, so before I can unmount the file system and detach it, I need to go through all my x-terms and cd out of directories from encrypted file system, I need to close all encrypted files, etc.
I decided to implement suspend and resume subcommand for geli(8). Before I suspend my laptop I execute ‘geli suspend’. This command tells GELI GEOM class in the kernel to remove all sensitive informations, and delay all further I/O request until ‘geli resume’ (or ‘geli detach’) command is called. This way I don’t need to unmount file system sitting on top of the encrypted partition. When I execute ‘geli resume’ command after resume, I provide my password just like for ‘geli attach’ command, which allows GEOM class to recreate all the keys in the kernel and start the I/O traffic again.
The tricky part is not to suspend a provider which ‘geli resume’ needs to access, because you will simply deadlock your system. For example it most likely won’t work for fully encrypted disk. One way to fix this is to join functionality of suspend and resume geli subcommands, ie. ‘geli suspend’ will automatically ask for the passphrase (without the need of reading or executing anything), which can be given after resuming the laptop. I haven’t decided what to do about that yet, the code is in my perforce tree for now and will probably be committed after the RELENG_7 is branched.
The ZFS file system is nicely integrated with Solaris’ zones. It came to my mind a long ago that will be very cool to integrate ZFS with FreeBSD jails.
People are talking for a long time about making jails more feature full, but mostly on networking front – my multiple IPs in jail patch, Marko Zec vimage, etc., but they forget about storage.
Unfortunately we can’t allow jail’s administrators to access raw disks/partitions (GEOM providers), create file systems on them and mount those file systems. The reason is that we can’t handle corrupted file system metadata, which will allow someone from within a jail to write some garbage to file system partition and panic entire system.
And here comes ZFS. ZFS can very nicely be used for such needs, because we operate on raw disks only when we operate on pools. When we operate on file systems, we don’t have access to raw disks anymore.
The basic idea is to allow pool management only from outside a jail, and assign ZFS file systems to the jails.
main# zfs create tank/jail
main# zfs set jailed=on tank/jail
main# zfs jail 1 tank/jail
Now from within a jail we can:
jail# zfs create tank/jail/home
jail# zfs create tank/jail/home/pjd
jail# zfs create tank/jail/home/pawel
jail# zfs snapshot tank/jail/home/pjd@backup
jail# zfs destroy tank/jail/home/pawel
jail# zfs rollback tank/jail/home/pjd@backup
jail# zfs set compression=on tank/jail/home/pjd
I’ll use this slot to keep list of dataset names accessable inside the jail.Good, now we need to allow for mounting/unmounting file systems from within a jail.
I added security.jail.mount_allowed sysctl, which when set to 1, grants PRIV_VFS_MOUNT and PRIV_VFS_UNMOUNT privileges to jailed root. I’m not really happy with this sysctl, because currently it allows to mount file systems like devfs and procfs from within a jail, which is very wrong, especially for devfs.
We also need to set security.jail.enforce_statfs to 0, so zfs command can see mounted file systems.
What is described to this point is already implemented and work.
I’d also like to add ‘jails’ property, which tells to which jails the given dataset is attached.
BIO_DELETE is yet another I/O request. The two most famous I/O requests are of course BIO_READ and BIO_WRITE.
The BIO_DELETE bascially can be used to say “I don’t need this range of data anymore”. The underlying provider, which receives such a request can do various things, depending on its purpose. The geli(8) GEOM class, which implements disk encryption can, for example, fill the range of data with random bits. Memory- and swap-backed md(4) device could just free memory. Unfortunately there is currently no support for BIO_DELETE on file systems side. When UFS free some blocks, it should send BIO_DELETE.
And this is bascially what I recently implemented. Actually, it was surprisingly easy to implement, but it is not implemented correctly yet. The problem is that UFS use the buffer cache for writes. If it sends delayed write request, which is ment to update blocks used by the given inode and update free blocks bitmap I can’t send BIO_DELETE immediately. Sending BIO_DELETE right away means, that if a system crash or a power failure occurs between BIO_DELETE and inode update, we may end up with a file pointing to a garbage.
Why BIO_DELETE is worth implementing? Because once it’s in place we can implement a lot of really nice features on top of it. Let me name a few:
- md swap-backed temporary file system, on which when you delete a file, memory is freed,
- blocks deallocation for gvirstor; gvirstor is a GEOM class implemented by Ivan Voras during Google SoC, which allows to initially define very large virtual provider, but with limited physical storage available, which one can add when needed; unfortunately currently when file is deleted, gvirstor has no way to reassign those blocks elsewhere,
- GEOM compresion layer; actually it could be implemented without BIO_DELETE, but compression layer is about saving space, right? Actually adding compression to gvirstor may be not a bad idea,
- maybe gjournal can hold its gjournal in free blocks and migrate journal when blocks are allocated?:) Silly idea, but having BIO_DELETE in place we can tell which blocks are really used by the file system and this is very powerful information.
NetBSD has UFS implementation that allows to mount UFS file systems created on architecture with different endianess. In other words, one can do newfs on sparc64 and mount it on i386 or vice versa. Very cool. It works by detecting what endianess file system uses and byteswapping fields in UFS structures at run-time as needed. I wanted to see how hard will it be to implement something simlar on FreeBSD. After one-night hacking mounting file systems read-only seems to works fine. I decided to work a bit lower than NetBSD and I replace bread()s with special functions that byteswap fields when needed. This saves quite a lot of code, but not everything can be implemented that way. I can byteswap the superblock, cylinder groups and inodes, but I can’t do the same for dirents, because ufs_readdir() use plain VOP_READ() to read directory entires, so I need to do the same NetBSD does for dirents. My method is most likely slower than NetBSD’s, because when file system reads one block of inodes, I byteswap them all and only one inode may be used later. We will see if performance impact is too high at some point. On the other hand you probably don’t want to use this functionality very often.
I’ve spend some time working on a test suite, that verifies if file system works correctly. It mostly checks POSIX complaisance and works for FreeBSD/UFS, FreeBSD/ZFS, Solaris/UFS and Solaris/ZFS. The list of system calls tested is as follows: chflags, chmod, chown, link, mkdir, mkfifo, open, rename, rmdir, symlink, truncate, unlink. There are 3438 regression tests in 184 files and belive me, this was really boring work, but very educational on the other hand. All those tests are already committed to FreeBSD’s HEAD branch under src/tools/regression/fstest/. During the work I also updated many manual pages. At some point I’m planning to make this test suite to work on Darwin and Linux, but not sure yet when exactly.
The main motivation for this work was that there is no free POSIX complaisance test suite, AFAIK. Shame on you.
This is my first entry about ZFS, but I’m not going to describe what ZFS is, etc., but I need some place to write about my progress and this seems to be the right place.
As you may know or not, I’m porting ZFS file system to FreeBSD. The port is almost finished. Something like 98% functionality is implemented and work. You can read more about ZFS on OpenSolaris page and more about my port on various FreeBSD mailing lists. Today I finished NFS support, so you can now NFS mount ZFS file systems from FreeBSD. The remaining part I coded today was readdir functionality for GFS. GFS (Generic pseudo-FileSystem) is Solaris framework for virtual file systems. ZFS uses this framework to create .zfs/ directory where snapshots are placed. From now on you can list .zfs/ and .zfs/snapshot/ directories too via NFS. This was the only missing piece of NFS support.
Ok, the work on fast fsck is finished. I integrated it with fsck_ffs(8).
I also changed the way how gjournaled file system is created. Now you need to use ‘-j’ option for newfs(8) to create gjournaled file system. You can also use ‘tunefs -j enable’ on an existing file system. You don’t need to provide ‘gjournal’ mount option any more, but when file system is gjournaled it will be visible in mount options.
I removed a hack of how gjournal was looking for gjournaled file systems. It is much nicer now.
Currently fsck will scan all cylinder groups, which is fast, but it could be a lot faster when I could add cs_nunrefs field to csum structure – file system keeps statistics about all cylinder groups after the super-block and having an information about which cylinder groups have unreferenced inodes would save us a lot of seeking. Unfortunately it will change UFS on-disk layout which is not acceptable at this point.
After publishing patches I implemented few suggestions from Robert Watson.
I also implemented better handling of journal overflow. Now one can define at which point journal switch is forced. This can be defined in percents via kern.geom.journal.force_switch. Not ideal, but much better than panicing.
BIO_FLUSH support was added to da(4) disks.
I’m in the process of reimplementing how removed, but still open objects (files, directories) are handled in case of a power failure or system crash.
Currently there is a magic .deleted/ directory where such objects are moved. On boot mount(8) can simply ‘rm -rf .deleted/*’.
The solution I’m implementing currently is to increase a counter in cylinder group (CG) when such situation takes place (objects is removed, but still open). When it is closed, I simply decrease the counter. On boot, a much faster fsck version checks the counter in every CG and if it is greater than 0, it scans all inodes in the CG. If it finds allocated inode, but with linkcnt=0, it cleans the inode, ie. remove all allocated blocks, marks as unallocated, etc.
Because of the complexity of UFS on-disk layout it is not a trivial task. To remove an inode completely I need to:
1. Go through the list of direct blocks pointers and mark the blocks in freeblock bitmaps as free.
2. The same as in 1, but for indirect blocks (single, double and triple).
3. Watch out for fragments. Fragments are only used for small files (which use only direct pointed blocks) and a fragment can be only at the end of the file.
4. Update blocks and fragements counters in three places.
5. Update clusters bitmap.
6. Update pointer of the last used block and fragment if needed.
7. Zero-fill the entire inode.
8. Mark an inode as free in inode’s bitmap.
9. Update inodes related counters in three places.
10. Update pointer of the last used inode.
11. Pray for success.
Most of the things I already implemented, but this is not the code I’d like to show. I’m thinking about extending libufs(3), because now its usefullness is very limited.