Brendan Gregg of Sun Oracle fame made a good explanation how to visualize latency to get a better understanding of what is going on (and as such about how to solve bottlenecks). I have seen all this already in various posts in his blog and in the Analytics package in an OpenStorage presentation, but the ACM article summarizes it very good.
Unfortunately Analytics is AFAIK not available in OpenSolaris, so we can not go out and adapt it for FreeBSD (which would probably require to port/implement some additional dtrace stuff/probes). I am sure something like this would be very interesting to all those companies which use FreeBSD in an appliance (regardless if it is a storage appliance like NetApp, or a network appliance like a Cisco/Juniper router, or anything else which has to perform good).
At work we have the situation of a slow application. The vendor of the custom application insists that the ZFS (Solaris 10u8) and the Oracle DB are badly tuned for the application. Part of their tuning is to limit the ARC to 1 GB (our max size is 24 GB on this machine). One problem we see is that there are many write operations (rounded values: 1k ops for up to 100 MB) and the DB is complaining that the logwriter is not able to write out the data fast enough. At the same time our database admins see a lot of commits and/or rollbacks so that the archive log grows very fast to 1.5 GB. The funny thing isâ€¦ the performance tests are supposed to only cover SELECTs and small UPDATEs.
I proposed to reduce the zfs_txg_timeout from the default value of 30 to some seconds (and as no reboot is needed like for the max arc size, this can be done fast instead of waiting some minutes for the boot-checks of the M5000). The first try was to reduce it to 5 seconds and it improved the situation. The DB still complained about not being able to write out the logs fast enough, but it did not do it as often as before. To make the vendor happy we reduced the max arc size and tested again. First we have not seen any complains from the DB anymore, which looked strange to me because my understanding of the ARC (and the description of the ZFS Evil Tuning Guide regarding the max size setting) suggest that this should not show this behavior we have seen, but the machine was also rebooted for this, so there could also be another explanation.
Luckily we found out that our testing infrastructure had a problem so that only a fraction of the performance test was performed. This morning the people responsible for that made some changes and now the DB is complaining again.
This is what I expected. To make sure I fully understand the ARC, I had a look at the theory behind it at the IBM research center. There are some papers which explain how to extend a cache which uses the LRU replacement policy with some lines of code to an ARC. It looks like it would be an improvement to have a look at which places in FreeBSD a LRU policy is used to test if an ARC would improve the cache hit rate. From reading the paper it looks like there are a lot of places where this should be the case. The authors also provide two adaptive extensions to the CLOCK algorithm (used in various OS in the VM subsystem) which indicate that such an approach could be beneficial for a VM system. I already contacted Alan (the FreeBSD one) and asked if he knows about it and if it could be beneficial for FreeBSD.