ZFS and FreeBSD jails

March 18th, 2007 by pjd

The ZFS file system is nicely integrated with Solaris’ zones. It came to my mind a long ago that will be very cool to integrate ZFS with FreeBSD jails.

People are talking for a long time about making jails more feature full, but mostly on networking front - my multiple IPs in jail patch, Marko Zec vimage, etc., but they forget about storage.


Unfortunately we can’t allow jail’s administrators to access raw disks/partitions (GEOM providers), create file systems on them and mount those file systems. The reason is that we can’t handle corrupted file system metadata, which will allow someone from within a jail to write some garbage to file system partition and panic entire system.


And here comes ZFS. ZFS can very nicely be used for such needs, because we operate on raw disks only when we operate on pools. When we operate on file systems, we don’t have access to raw disks anymore.


The basic idea is to allow pool management only from outside a jail, and assign ZFS file systems to the jails.


For example:
main# zfs create tank/jail

main# zfs set jailed=on tank/jail

main# zfs jail 1 tank/jail

Now from within a jail we can:
jail# zfs create tank/jail/home

jail# zfs create tank/jail/home/pjd

jail# zfs create tank/jail/home/pawel

jail# zfs snapshot tank/jail/home/pjd@backup

jail# zfs destroy tank/jail/home/pawel

jail# zfs rollback tank/jail/home/pjd@backup

jail# zfs set compression=on tank/jail/home/pjd

I’ll use this slot to keep list of dataset names accessable inside the jail.Good, now we need to allow for mounting/unmounting file systems from within a jail.

I added security.jail.mount_allowed sysctl, which when set to 1, grants PRIV_VFS_MOUNT and PRIV_VFS_UNMOUNT privileges to jailed root. I’m not really happy with this sysctl, because currently it allows to mount file systems like devfs and procfs from within a jail, which is very wrong, especially for devfs.


We also need to set security.jail.enforce_statfs to 0, so zfs command can see mounted file systems.


What is described to this point is already implemented and work.

I’d also like to add ’jails’ property, which tells to which jails the given dataset is attached.

UFS and BIO_DELETE

January 30th, 2007 by pjd

BIO_DELETE is yet another I/O request. The two most famous I/O requests are of course BIO_READ and BIO_WRITE.

The BIO_DELETE bascially can be used to say “I don’t need this range of data anymore”. The underlying provider, which receives such a request can do various things, depending on its purpose. The geli(8) GEOM class, which implements disk encryption can, for example, fill the range of data with random bits. Memory- and swap-backed md(4) device could just free memory. Unfortunately there is currently no support for BIO_DELETE on file systems side. When UFS free some blocks, it should send BIO_DELETE.

And this is bascially what I recently implemented. Actually, it was surprisingly easy to implement, but it is not implemented correctly yet. The problem is that UFS use the buffer cache for writes. If it sends delayed write request, which is ment to update blocks used by the given inode and update free blocks bitmap I can’t send BIO_DELETE immediately. Sending BIO_DELETE right away means, that if a system crash or a power failure occurs between BIO_DELETE and inode update, we may end up with a file pointing to a garbage.

Why BIO_DELETE is worth implementing? Because once it’s in place we can implement a lot of really nice features on top of it. Let me name a few:

– md swap-backed temporary file system, on which when you delete a file, memory is freed, – blocks deallocation for gvirstor; gvirstor is a GEOM class implemented by Ivan Voras during Google SoC, which allows to initially define very large virtual provider, but with limited physical storage available, which one can add when needed; unfortunately currently when file is deleted, gvirstor has no way to reassign those blocks elsewhere, – GEOM compresion layer; actually it could be implemented without BIO_DELETE, but compression layer is about saving space, right? Actually adding compression to gvirstor may be not a bad idea, – maybe gjournal can hold its gjournal in free blocks and migrate journal when blocks are allocated?:) Silly idea, but having BIO_DELETE in place we can tell which blocks are really used by the file system and this is very powerful information.

Endian-independent UFS.

January 28th, 2007 by pjd

NetBSD has UFS implementation that allows to mount UFS file systems created on architecture with different endianess. In other words, one can do newfs on sparc64 and mount it on i386 or vice versa. Very cool. It works by detecting what endianess file system uses and byteswapping fields in UFS structures at run-time as needed. I wanted to see how hard will it be to implement something simlar on FreeBSD. After one-night hacking mounting file systems read-only seems to works fine. I decided to work a bit lower than NetBSD and I replace bread()s with special functions that byteswap fields when needed. This saves quite a lot of code, but not everything can be implemented that way. I can byteswap the superblock, cylinder groups and inodes, but I can’t do the same for dirents, because ufs_readdir() use plain VOP_READ() to read directory entires, so I need to do the same NetBSD does for dirents. My method is most likely slower than NetBSD’s, because when file system reads one block of inodes, I byteswap them all and only one inode may be used later. We will see if performance impact is too high at some point. On the other hand you probably don’t want to use this functionality very often.

Regression tests for file systems.

January 28th, 2007 by pjd

I’ve spend some time working on a test suite, that verifies if file system works correctly. It mostly checks POSIX complaisance and works for FreeBSD/UFS, FreeBSD/ZFS, Solaris/UFS and Solaris/ZFS. The list of system calls tested is as follows: chflags, chmod, chown, link, mkdir, mkfifo, open, rename, rmdir, symlink, truncate, unlink. There are 3438 regression tests in 184 files and belive me, this was really boring work, but very educational on the other hand. All those tests are already committed to FreeBSD’s HEAD branch under src/tools/regression/fstest/. During the work I also updated many manual pages. At some point I’m planning to make this test suite to work on Darwin and Linux, but not sure yet when exactly.

The main motivation for this work was that there is no free POSIX complaisance test suite, AFAIK. Shame on you.

ZFS progress.

January 28th, 2007 by pjd

This is my first entry about ZFS, but I’m not going to describe what ZFS is, etc., but I need some place to write about my progress and this seems to be the right place.

As you may know or not, I’m porting ZFS file system to FreeBSD. The port is almost finished. Something like 98% functionality is implemented and work. You can read more about ZFS on OpenSolaris page and more about my port on various FreeBSD mailing lists. Today I finished NFS support, so you can now NFS mount ZFS file systems from FreeBSD. The remaining part I coded today was readdir functionality for GFS. GFS (Generic pseudo-FileSystem) is Solaris framework for virtual file systems. ZFS uses this framework to create .zfs/ directory where snapshots are placed. From now on you can list .zfs/ and .zfs/snapshot/ directories too via NFS. This was the only missing piece of NFS support.

GJournal progress…

July 12th, 2006 by pjd

Ok, the work on fast fsck is finished. I integrated it with fsck_ffs(8).

I also changed the way how gjournaled file system is created. Now you need to use ‘-j’ option for newfs(8) to create gjournaled file system. You can also use ‘tunefs -j enable’ on an existing file system. You don’t need to provide ‘gjournal’ mount option any more, but when file system is gjournaled it will be visible in mount options.

I removed a hack of how gjournal was looking for gjournaled file systems. It is much nicer now.

Currently fsck will scan all cylinder groups, which is fast, but it could be a lot faster when I could add cs_nunrefs field to csum structure – file system keeps statistics about all cylinder groups after the super-block and having an information about which cylinder groups have unreferenced inodes would save us a lot of seeking. Unfortunately it will change UFS on-disk layout which is not acceptable at this point.

GJournal status.

July 6th, 2006 by pjd

After publishing patches I implemented few suggestions from Robert Watson.

I also implemented better handling of journal overflow. Now one can define at which point journal switch is forced. This can be defined in percents via kern.geom.journal.force_switch. Not ideal, but much better than panicing.

BIO_FLUSH support was added to da(4) disks.

I’m in the process of reimplementing how removed, but still open objects (files, directories) are handled in case of a power failure or system crash.

Currently there is a magic .deleted/ directory where such objects are moved. On boot mount(8) can simply ‘rm -rf .deleted/*’.

The solution I’m implementing currently is to increase a counter in cylinder group (CG) when such situation takes place (objects is removed, but still open). When it is closed, I simply decrease the counter. On boot, a much faster fsck version checks the counter in every CG and if it is greater than 0, it scans all inodes in the CG. If it finds allocated inode, but with linkcnt=0, it cleans the inode, ie. remove all allocated blocks, marks as unallocated, etc.

Because of the complexity of UFS on-disk layout it is not a trivial task. To remove an inode completely I need to:

1. Go through the list of direct blocks pointers and mark the blocks in freeblock bitmaps as free.

2. The same as in 1, but for indirect blocks (single, double and triple).

3. Watch out for fragments. Fragments are only used for small files (which use only direct pointed blocks) and a fragment can be only at the end of the file.

4. Update blocks and fragements counters in three places.

5. Update clusters bitmap.

6. Update pointer of the last used block and fragment if needed.

7. Zero-fill the entire inode.

8. Mark an inode as free in inode’s bitmap.

9. Update inodes related counters in three places.

10. Update pointer of the last used inode.

11. Pray for success.

Most of the things I already implemented, but this is not the code I’d like to show. I’m thinking about extending libufs(3), because now its usefullness is very limited.

First post.

July 6th, 2006 by pjd

This is my first entry in my first blog. Let’s see how it looks.