I have the habit to chmod with the relative notation (e.g. g+w or a+r or go-w or similar) instead of the absolute one (e.g. 0640 or u=rw,g=r,o=). Recently I had to chmod a lot of files. As usual I was using the relative notation. With a lot of files, this took a lot of time. Time was not really an issue, so I did not stop it to restart with a better performing command (e.g. find /path –type f –print0 | xargs –0 chmod 0644; find /path –type d –print0 | xargs –0 chmod 0755), but I thought a little tips&tricks posting may be in order, as not everyone knows the difference.
The relative notation
When you specify g+w, it means to remove the write access for the group, but keep everything else like it is. Naturally this means that chmod first has to lookup the current access rights. So for each async write request, there has to be a read-request first.
The absolute notation
The absolute notation is what most people are used to (at least the numeric one). It does not need to read the access rights before changing them, so there is less I/O to be done to get what you want. The drawback is that it is not so nice for recursive changes. You do not want to have the x-bit for data files, but you need it for directories. If you only have a tree with data files where you want to have an uniform access, the example above via find is probably faster (for sure if the directory meta-data is still in RAM).
If you have a mix of binaries and data, it is a little bit more tricky to come up with a way which is faster. If the data has a name-pattern, you could use it in the find.
And if you have a non-uniform access for the group bits and want to make sure the owner has write access to everything, it may be faster to use the relative notation than to find a replacement command-sequence with the absolute notation.
The FreeBSD linux compatibility environment currently uses RPMs from Fedora 10. Unfortunately Fedora 10 is end of life since a while. For one of the RPMs (the pango one) we where aware of a security vulnerability. As we do not know if it is feasible to update the linuxulator ports to something more recent, I decided to setup a VM with Fedora 10 and generate a new RPM for the linux-f10-pango port. Thanks to Luchesar V. ILIEV for explaining me how to do this.
Setup of the VM
I used VirtualBox 4.0.4 on a Solaris 10 x86 machine. I configured a fixed size disk of 16 GB and kept the default network setup (after installing the guest tools / kernel modules I switched to virtio, as I was not able to do anything useful besides a ping) and RAM size. The CD/DVD drive was configured to use the image of the full Fedora 10 DVD for i386 systems.
Setup of Fedora 10
Booting the VM from the DVD leads to the graphical Fedora 10 install software (after chosing to install a new system on the console). There I accepted all the defaults, except for the software to install. I deselected the Office and Productivity group and selected the Software Development group. When I was asked if I want to install some additional RPMs I had a look at the complete list and installed some I thought are necessary. I do not remember anymore which ones I chose, but everything which looks related to RPM building is a good candidate.
After a while the install will be finished and you can boot into the new system (eject the DVD from the drive before reboot). After reboot chose to install the Guest Additions in the menu of the VM. This should mount the ISO image in the VM. As root execute the file for Linux. This will build some kernel modules for better integration (e.g. seamless integration of the mouse between your desktop and the VM). At this point I rebooted and configured virtio as the NIC. I also had to configure the network settings by hand, as the GUI tool did not safe all the settings correctly.
Update and install of required RPMs
After the VM was up and the network configured, I updated the entire system (chose System Update in the menu). To update the pango port, I had to install the libthai-devel RPM. I had the RPM for it (and all the files I need to build a new pango RPM) already downloaded, so I did a “yum install /path/to/rpm
Today I stumbled again over some HeatMaps from Brendan Gregg (of DTrace–fame). This time it was the PDF of his presentation at the LISA 2010 conference. It shows nicely how he plans to evolve it from a single-machine (like in Analytics for Oracle Storage products) to the cloud. It is a very good overview about what kind of intuitive performance visualization you can do with this.
I would love to see something for FreeBSD (or other performance data). Maybe someone could take the DTraceTazTool as a base or hint and write something generic which works for a lot of things…
There are just too much nice and interesting things out there, and not enough time for all of them.
In short: ZFS guarantees that the NFS client does not experience silent corruption of data (NFS server crash and loss of data which is supposed to be already on disk for the client). A recommendation is to enable the disk-cache for disks which are completely used by ZFS, as ZFS (unlike UFS) is aware of disk-caches. This increases the performance to what UFS is delivering in the NFS case.
There is no in-deep description of what it means that ZFS is aware of disk-caches, but I think this is a reference to the fact that ZFS is sending a flush command to the disk at the right moments. Letting aside the fact that there are disks out there which lie to you about this (they tell the flush command finished when it is not), this would mean that this is supported in FreeBSD too.
So everyone who is currently disabling the ZIL to get better NFS performance (and accept silent data corruption on the client side): move your zpool to dedicated (no other real FS than ZFS, swap and dump devices are OK) disks (honest ones) and enable the disk-caches instead of disabling the ZIL.
I also recommend that people which have ZFS already on dedicated (and honest) disks have a look if the disk-caches are enabled.
Brendan Gregg of Sun Oracle fame made a good explanation how to visualize latency to get a better understanding of what is going on (and as such about how to solve bottlenecks). I have seen all this already in various posts in his blog and in the Analytics package in an OpenStorage presentation, but the ACM article summarizes it very good.
Unfortunately Analytics is AFAIK not available in OpenSolaris, so we can not go out and adapt it for FreeBSD (which would probably require to port/implement some additional dtrace stuff/probes). I am sure something like this would be very interesting to all those companies which use FreeBSD in an appliance (regardless if it is a storage appliance like NetApp, or a network appliance like a Cisco/Juniper router, or anything else which has to perform good).
At work we have the situation of a slow application. The vendor of the custom application insists that the ZFS (Solaris 10u8) and the Oracle DB are badly tuned for the application. Part of their tuning is to limit the ARC to 1 GB (our max size is 24 GB on this machine). One problem we see is that there are many write operations (rounded values: 1k ops for up to 100 MB) and the DB is complaining that the logwriter is not able to write out the data fast enough. At the same time our database admins see a lot of commits and/or rollbacks so that the archive log grows very fast to 1.5 GB. The funny thing is… the performance tests are supposed to only cover SELECTs and small UPDATEs.
I proposed to reduce the zfs_txg_timeout from the default value of 30 to some seconds (and as no reboot is needed like for the max arc size, this can be done fast instead of waiting some minutes for the boot-checks of the M5000). The first try was to reduce it to 5 seconds and it improved the situation. The DB still complained about not being able to write out the logs fast enough, but it did not do it as often as before. To make the vendor happy we reduced the max arc size and tested again. First we have not seen any complains from the DB anymore, which looked strange to me because my understanding of the ARC (and the description of the ZFS Evil Tuning Guide regarding the max size setting) suggest that this should not show this behavior we have seen, but the machine was also rebooted for this, so there could also be another explanation.
Luckily we found out that our testing infrastructure had a problem so that only a fraction of the performance test was performed. This morning the people responsible for that made some changes and now the DB is complaining again.
This is what I expected. To make sure I fully understand the ARC, I had a look at the theory behind it at the IBM research center. There are some papers which explain how to extend a cache which uses the LRU replacement policy with some lines of code to an ARC. It looks like it would be an improvement to have a look at which places in FreeBSD a LRU policy is used to test if an ARC would improve the cache hit rate. From reading the paper it looks like there are a lot of places where this should be the case. The authors also provide two adaptive extensions to the CLOCK algorithm (used in various OS in the VM subsystem) which indicate that such an approach could be beneficial for a VM system. I already contacted Alan (the FreeBSD one) and asked if he knows about it and if it could be beneficial for FreeBSD.
Currently I play a little bit around with my ZFS setup. I want to make it faster, but I do not want to spend a lot of money.
The disks are connected to an ICH 5 controller, so an obvious improvement would be to either buy a controller for the PCI slot which is able to do NCQ with the SATA disks (a siis(4) based one is not cheap), or to buy a new system which comes with a chipset which knows how to do NCQ (this would mean new RAM, new CPU, new MB and maybe even a new PSU). A new controller is a little bit expensive for the old system which I want to tune. A new system would be nice, and reading about the specs of new systems lets me want to get a Core i5 system. The problem is that I think the current offers of mainboards for this are far from good. The system should be a little bit future proof, as I would like to use it for about 5 years or more (the current system is somewhere between 5–6 years old). This means it should have SATA-3 and USB 3, but when I look at what is offered currently it looks like there are only beta-versions of hardware with SATA-3 and USB 3 support available on the marked (according to tests there is a lot of variance of the max speed the controllers are able to achieve, bugs in the BIOS, or the controllers are attached to a slow bus which prevents to use the full bandwidth). So it will not be a new system soon.
As I had a 1GB USB-stick around, I decided to attach it to the one of the EHCI USB ports and use it as a cache device for ZFS. If someone wants to try this too, be careful with the USB ports. My mainboard has only 2 USB ports connected to an EHCI, the rest are UHCI ones. This means that only 2 USB ports are fast (sort of… 40 MBit/s), the rest is only usable for slow things like a mouse, keyboard or a serial line.
Be warned, this will not give you a lot of bandwidth (if you have a fast USB stick, the 40MBit/s of the EHCI are the limit which prevent a big streaming bandwidth), but the latency of the cache device is great when doing small random IO. When I do a gstat and have a look how long a read operation takes for each involved device, I see something between 3 msec and 20 msec for the harddisks (depending if they are reading something at the current head position, or if the harddisk needs to seek around a lot). For the cache device (the USB stick) I see something between around 1 mssec and 5 msec. That is 1/3th to 1/4th of the latency of the harddisks.