Brendan Gregg of Sun Oracle fame made a good explanation how to visualize latency to get a better understanding of what is going on (and as such about how to solve bottlenecks). I have seen all this already in various posts in his blog and in the Analytics package in an OpenStorage presentation, but the ACM article summarizes it very good.
Unfortunately Analytics is AFAIK not available in OpenSolaris, so we can not go out and adapt it for FreeBSD (which would probably require to port/implement some additional dtrace stuff/probes). I am sure something like this would be very interesting to all those companies which use FreeBSD in an appliance (regardless if it is a storage appliance like NetApp, or a network appliance like a Cisco/Juniper router, or anything else which has to perform good).
At work we have the situation of a slow application. The vendor of the custom application insists that the ZFS (Solaris 10u8) and the Oracle DB are badly tuned for the application. Part of their tuning is to limit the ARC to 1 GB (our max size is 24 GB on this machine). One problem we see is that there are many write operations (rounded values: 1k ops for up to 100 MB) and the DB is complaining that the logwriter is not able to write out the data fast enough. At the same time our database admins see a lot of commits and/or rollbacks so that the archive log grows very fast to 1.5 GB. The funny thing isâ€¦ the performance tests are supposed to only cover SELECTs and small UPDATEs.
I proposed to reduce the zfs_txg_timeout from the default value of 30 to some seconds (and as no reboot is needed like for the max arc size, this can be done fast instead of waiting some minutes for the boot-checks of the M5000). The first try was to reduce it to 5 seconds and it improved the situation. The DB still complained about not being able to write out the logs fast enough, but it did not do it as often as before. To make the vendor happy we reduced the max arc size and tested again. First we have not seen any complains from the DB anymore, which looked strange to me because my understanding of the ARC (and the description of the ZFS Evil Tuning Guide regarding the max size setting) suggest that this should not show this behavior we have seen, but the machine was also rebooted for this, so there could also be another explanation.
Luckily we found out that our testing infrastructure had a problem so that only a fraction of the performance test was performed. This morning the people responsible for that made some changes and now the DB is complaining again.
This is what I expected. To make sure I fully understand the ARC, I had a look at the theory behind it at the IBM research center. There are some papers which explain how to extend a cache which uses the LRU replacement policy with some lines of code to an ARC. It looks like it would be an improvement to have a look at which places in FreeBSD a LRU policy is used to test if an ARC would improve the cache hit rate. From reading the paper it looks like there are a lot of places where this should be the case. The authors also provide two adaptive extensions to the CLOCK algorithm (used in various OS in the VM subsystem) which indicate that such an approach could be beneficial for a VM system. I already contacted Alan (the FreeBSD one) and asked if he knows about it and if it could be beneficial for FreeBSD.
At work (client site) SUN made a presentation about their OpenStorage products (Sun Storage 7000 Unified Storage Systems) today.
From a technology point of view, the software side is nothing new to me. Using SSDs for zfs as a read-/write-cache is something we can do (partly) already since at least Solaris 10u6 (that is the lowest Solaris 10 version we have installed here, so I can not check quickly if the ZIL can be on a separate disk in previous versions of Solaris, but I think we have to wait until we updated to Solaris 10u8 until we can have the L2ARC on a separate disk) or in FreeBSD. All other nice ZFS features available in the OpenStorage web interface are also not surprising.
But the demonstration with the Storage Simulator impressed me. The interaction with Windows via CIFS makes the older version of files in snapshots available in Windows (I assume this is the Volume Shadow Copy feature of Windows), and the statistics available via DTrace in the web interface are also impressive. All this technology seems to be well integrated into an easy to use package for heterogeneous environments. If you would like to setup something like this by hand, you would need to have a lot of knowledge about a lot of stuff (and in the FreeBSD case, you would probably need to augment the kernel with additional DTrace probes to be able to get a similar granularity of the statistics), nothing a small company is willing to pay.
I know that I can get a lot of information with DTrace (from time to time I have some free cycles to extend the FreeBSD DTrace implementation with additional DTrace probes for the linuxulator), but what they did with DTrace in the OpenStorage software is great. If you try to do this at home yourself, you need some time to implement something like this (I do not think you can take the DTrace scripts and run them on FreeBSD, this will probably take some weeks until it works).
It is also the first time I see this new CIFS implementation from SUN in ZFS life in action. It looks well done. Integration with AD looks more easy than doing it by hand in Samba (at least from looking at the OpenStorage web interface). If we could get this in FreeBSD… it would rock!
The entire OpenStorage web interface looks usable. I think SUN has a product there which allows them to enter new markets. A product which they can sell to companies which did not buy something from SUN before (even Windows-only companies). I think even those Windows admins which never touch a command line interface (read: the low-level ones; not comparable at all with the really high-profile Windows admins of our client) could be able to get this up and running.
As it seems at the moment, our client will get a Sun Storage F5100 Flash Array for technology evaluation in the beginning of next year. Unfortunately the technology looks to easy to handle, so I assume I have to take care about more complex things when this machine arrives…
Though having a hard time to find sparetime at the moment.