BIO_DELETE is yet another I/O request. The two most famous I/O requests are of course BIO_READ and BIO_WRITE.
The BIO_DELETE bascially can be used to say “I don’t need this range of data anymore”. The underlying provider, which receives such a request can do various things, depending on its purpose. The geli(8) GEOM class, which implements disk encryption can, for example, fill the range of data with random bits. Memory- and swap-backed md(4) device could just free memory. Unfortunately there is currently no support for BIO_DELETE on file systems side. When UFS free some blocks, it should send BIO_DELETE.
And this is bascially what I recently implemented. Actually, it was surprisingly easy to implement, but it is not implemented correctly yet. The problem is that UFS use the buffer cache for writes. If it sends delayed write request, which is ment to update blocks used by the given inode and update free blocks bitmap I can’t send BIO_DELETE immediately. Sending BIO_DELETE right away means, that if a system crash or a power failure occurs between BIO_DELETE and inode update, we may end up with a file pointing to a garbage.
Why BIO_DELETE is worth implementing? Because once it’s in place we can implement a lot of really nice features on top of it. Let me name a few:
– md swap-backed temporary file system, on which when you delete a file, memory is freed,
– blocks deallocation for gvirstor; gvirstor is a GEOM class implemented by Ivan Voras during Google SoC, which allows to initially define very large virtual provider, but with limited physical storage available, which one can add when needed; unfortunately currently when file is deleted, gvirstor has no way to reassign those blocks elsewhere,
– GEOM compresion layer; actually it could be implemented without BIO_DELETE, but compression layer is about saving space, right? Actually adding compression to gvirstor may be not a bad idea,
– maybe gjournal can hold its gjournal in free blocks and migrate journal when blocks are allocated?:) Silly idea, but having BIO_DELETE in place we can tell which blocks are really used by the file system and this is very powerful information.
NetBSD has UFS implementation that allows to mount UFS file systems created on architecture with different endianess. In other words, one can do newfs on sparc64 and mount it on i386 or vice versa. Very cool. It works by detecting what endianess file system uses and byteswapping fields in UFS structures at run-time as needed. I wanted to see how hard will it be to implement something simlar on FreeBSD. After one-night hacking mounting file systems read-only seems to works fine. I decided to work a bit lower than NetBSD and I replace bread()s with special functions that byteswap fields when needed. This saves quite a lot of code, but not everything can be implemented that way. I can byteswap the superblock, cylinder groups and inodes, but I can’t do the same for dirents, because ufs_readdir() use plain VOP_READ() to read directory entires, so I need to do the same NetBSD does for dirents. My method is most likely slower than NetBSD’s, because when file system reads one block of inodes, I byteswap them all and only one inode may be used later. We will see if performance impact is too high at some point. On the other hand you probably don’t want to use this functionality very often.
I’ve spend some time working on a test suite, that verifies if file system works correctly. It mostly checks POSIX complaisance and works for FreeBSD/UFS, FreeBSD/ZFS, Solaris/UFS and Solaris/ZFS. The list of system calls tested is as follows: chflags, chmod, chown, link, mkdir, mkfifo, open, rename, rmdir, symlink, truncate, unlink. There are 3438 regression tests in 184 files and belive me, this was really boring work, but very educational on the other hand. All those tests are already committed to FreeBSD’s HEAD branch under src/tools/regression/fstest/. During the work I also updated many manual pages. At some point I’m planning to make this test suite to work on Darwin and Linux, but not sure yet when exactly.
The main motivation for this work was that there is no free POSIX complaisance test suite, AFAIK. Shame on you.
This is my first entry about ZFS, but I’m not going to describe what ZFS is, etc., but I need some place to write about my progress and this seems to be the right place.
As you may know or not, I’m porting ZFS file system to FreeBSD. The port is almost finished. Something like 98% functionality is implemented and work. You can read more about ZFS on OpenSolaris page and more about my port on various FreeBSD mailing lists. Today I finished NFS support, so you can now NFS mount ZFS file systems from FreeBSD. The remaining part I coded today was readdir functionality for GFS. GFS (Generic pseudo-FileSystem) is Solaris framework for virtual file systems. ZFS uses this framework to create .zfs/ directory where snapshots are placed. From now on you can list .zfs/ and .zfs/snapshot/ directories too via NFS. This was the only missing piece of NFS support.