Category Archives: Solaris

HOWTO: creating your own updated linux RPM for the FreeBSD linuxulator

Background info

The FreeBSD linux compatibility environment currently uses RPMs from Fedora 10. Unfortunately Fedora 10 is end of life since a while. For one of the RPMs (the pango one) we where aware of a security vulnerability. As we do not know if it is feasible to update the linuxulator ports to something more recent, I decided to setup a VM with Fedora 10 and generate a new RPM for the linux-f10-pango port. Thanks to Luchesar V. ILIEV for explaining me how to do this.

Setup of the VM

I used VirtualBox 4.0.4 on a Solaris 10 x86 machine. I configured a fixed size disk of 16 GB and kept the default network setup (after installing the guest tools / kernel modules I switched to virtio, as I was not able to do anything useful besides a ping) and RAM size. The CD/DVD drive was configured to use the image of the full Fedora 10 DVD for i386 systems.

Setup of Fedora 10

Booting the VM from the DVD leads to the graphical Fedora 10 install software (after chosing to install a new system on the console). There I accepted all the defaults, except for the software to install. I deselected the Office and Productivity group and selected the Software Development group. When I was asked if I want to install some additional RPMs I had a look at the complete list and installed some I thought are necessary. I do not remember anymore which ones I chose, but everything which looks related to RPM building is a good candidate.

After a while the install will be finished and you can boot into the new system (eject the DVD from the drive before reboot). After reboot chose to install the Guest Additions in the menu of the VM. This should mount the ISO image in the VM. As root execute the file for Linux. This will build some kernel modules for better integration (e.g. seamless integration of the mouse between your desktop and the VM). At this point I rebooted and configured virtio as the NIC. I also had to configure the network settings by hand, as the GUI tool did not safe all the settings correctly.

Update and install of required RPMs

After the VM was up and the network configured, I updated the entire system (chose System Update in the menu). To update the pango port, I had to install the libthai-devel RPM. I had the RPM for it (and all the files I need to build a new pango RPM) already downloaded, so I did a “yum install /path/to/rpm�. At this point I was ready to create the RPM build environment.

The RPM build environment

As a normal user I executed the command rpmdev-setuptree which creates the directory rpmbuild and populates it with some directories. Now you just need to find a suitable .spec file and put it into rpmbuild/SPECS, put the sources (and maybe patches referenced in the .spec file) into rpmbuild/SOURCES, and you are ready to go (I patched pango.spec for a more recent pango version, basically just changing the version numbers). If you want to have a custom packager and vendor attribute in the RPM, you can add a line for each to ~/.rpmmacros, e.g. %packager yournamehere and %vendor whateverisappropriate. I used my @FreeBSD.org EMail address as the packager, and FreeBSD as the vendor.

Building a RPM

I used rpmbuild –ba –target i386-redhat-linux-gnu –clean rpmbuild/SPECS/pango.spec to build the new pango RPM. If everything is OK, the resulting RPMs (a source RPM, a devel RPM, a debuginfo RPM and the RPM for the binaries) are in rpmbuild/RPMS and rpmbuild/SRPMS. For a FreeBSD port we just need the source RPM (to comply to the (L)GPL) and the RPM for the binaries.

Additional info

The i386-redhat-linux-gnu string which is used for the –target option of the rpmbuild command is what seems to be used to build the Fedora 10 RPMs. After building pango, the RPM has i686-pc-linux-gnu in some filenames instead (the default value for this setup). The binaries seem to be compiled for i386, so there should be no problem even for old systems.

Share

HeatMaps again…

Today I stumbled again over some HeatMaps from Brendan Gregg (of DTrace–fame). This time it was the PDF of his presentation at the LISA 2010 conference. It shows nicely how he plans to evolve it from a single-machine (like in Analytics for Oracle Storage products) to the cloud. It is a very good overview about what kind of intuitive performance visualization you can do with this.

I would love to see something for FreeBSD (or other performance data). Maybe someone could take the DTraceTazTool as a base or hint and write something generic which works for a lot of things…

There are just too much nice and interesting things out there, and not enough time for all of them.

Share

ZFS and NFS / on-disk-cache

In the FreeBSD mailinglists I stumbled over  a post which refers to a blog-post which describes why ZFS seems to be slow (on Solaris).

In short: ZFS guarantees that the NFS client does not experience silent corruption of data (NFS server crash and loss of data which is supposed to be already on disk for the client). A recommendation is to enable the disk-cache for disks which are completely used by ZFS, as ZFS (unlike UFS) is aware of disk-caches. This increases the performance to what UFS is delivering in the NFS case.

There is no in-deep description of what it means that ZFS is aware of disk-caches, but I think this is a reference to the fact that ZFS is sending a flush command to the disk at the right moments. Letting aside the fact that there are disks out there which lie to you about this (they tell the flush command finished when it is not), this would mean that this is supported in FreeBSD too.

So everyone who is currently disabling the ZIL to get better NFS performance (and accept silent data corruption on the client side): move your zpool to dedicated (no other real FS than ZFS, swap and dump devices are OK) disks (honest ones) and enable the disk-caches instead of disabling the ZIL.

I also recommend that people which have ZFS already on dedicated (and honest) disks have a look if the disk-caches are enabled.

Share

Understanding latency

Brendan Gregg of Sun Oracle fame made a good explanation how to visualize latency to get a better understanding of what is going on (and as such about how to solve bottlenecks). I have seen all this already in various posts in his blog and in the Analytics package in an OpenStorage presentation, but the ACM article summarizes it very good.

Unfortunately Analytics is AFAIK not available in OpenSolaris, so we can not go out and adapt it for FreeBSD (which would probably require to port/implement some additional dtrace stuff/probes). I am sure something like this would be very interesting to all those companies which use FreeBSD in an appliance (regardless if it is a storage appliance like NetApp, or a network appliance like a Cisco/Juniper router, or anything else which has to perform good).

Share

ARC (adaptive replacement cache) explained

At work we have the situation of a slow application. The vendor of the custom application insists that the ZFS (Solaris 10u8) and the Oracle DB are badly tuned for the application. Part of their tuning is to limit the ARC to 1 GB (our max size is 24 GB on this machine). One problem we see is that there are many write operations (rounded values: 1k ops for up to 100 MB) and the DB is complaining that the logwriter is not able to write out the data fast enough. At the same time our database admins see a lot of commits and/or rollbacks so that the archive log grows very fast to 1.5 GB. The funny thing is… the performance tests are supposed to only cover SELECTs and small UPDATEs.

I proposed to reduce the zfs_txg_timeout from the default value of 30 to some seconds (and as no reboot is needed like for the max arc size, this can be done fast instead of waiting some minutes for the boot-checks of the M5000). The first try was to reduce it to 5 seconds and it improved the situation. The DB still complained about not being able to write out the logs fast enough, but it did not do it as often as before. To make the vendor happy we reduced the max arc size and tested again. First we have not seen any complains from the DB anymore, which looked strange to me because my understanding of the ARC (and the description of the ZFS Evil Tuning Guide regarding the max size setting) suggest that this should not show this behavior we have seen, but the machine was also rebooted for this, so there could also be another explanation.

Luckily we found out that our testing infrastructure had a problem so that only a fraction of the performance test was performed. This morning the people responsible for that made some changes and now the DB is complaining again.

This is what I expected. To make sure I fully understand the ARC, I had a look at the theory behind it at the IBM research center. There are some papers which explain how to extend a cache which uses the LRU replacement policy with some lines of code to an ARC. It looks like it would be an improvement to have a look at which places in FreeBSD a LRU policy is used to test if an ARC would improve the cache hit rate. From reading the paper it looks like there are a lot of places where this should be the case. The authors also provide two adaptive extensions to the CLOCK algorithm (used in various OS in the VM subsystem) which indicate that such an approach could be beneficial for a VM system. I already contacted Alan (the FreeBSD one) and asked if he knows about it and if it could be beneficial for FreeBSD.

Making ZFS faster…

Currently I play a little bit around with my ZFS setup. I want to make it faster, but I do not want to spend a lot of money.

The disks are connected to an ICH 5 controller, so an obvious improvement would be to either buy a controller for the PCI slot which is able to do NCQ with the SATA disks (a siis(4) based one is not cheap), or to buy a new system which comes with a chipset which knows how to do NCQ (this would mean new RAM, new CPU, new MB and maybe even a new PSU). A new controller is a little bit expensive for the old system which I want to tune. A new system would be nice, and reading about the specs of new systems lets me want to get a Core i5 system. The problem is that I think the current offers of mainboards for this are far from good. The system should be a little bit future proof, as I would like to use it for about 5 years or more (the current system is somewhere between 5–6 years old). This means it should have SATA-3 and USB 3, but when I look at what is offered currently it looks like there are only beta-versions of hardware with SATA-3 and USB 3 support available on the marked (according to tests there is a lot of variance of the max speed the controllers are able to achieve, bugs in the BIOS, or the  controllers are attached to a slow bus which prevents to use the full bandwidth). So it will not be a new system soon.

As I had a 1GB USB-stick around, I decided to attach it to the one of the EHCI USB ports and use it as a cache device for ZFS. If someone wants to try this too, be careful with the USB ports. My mainboard has only 2 USB ports connected to an EHCI, the rest are UHCI ones. This means that only 2 USB ports are fast (sort of… 40 MBit/s), the rest is only usable for slow things like a mouse, keyboard or a serial line.

Be warned, this will not give you a lot of bandwidth (if you have a fast USB stick, the 40MBit/s of the EHCI are the limit which prevent a big streaming bandwidth), but the latency of the cache device is great when doing small random IO. When I do a gstat and have a look how long a read operation takes for each involved device, I see something between 3 msec and 20 msec for the harddisks (depending if they are reading something at the current head position, or if the harddisk needs to seek around a lot). For the cache device (the USB stick) I see something between around 1 mssec and 5 msec. That is 1/3th to 1/4th of the latency of the harddisks.

With a “zfs send� I see about 300 IOops per harddisk (3 disks in a RAIDZ). Obviously this is an optimum streaming case where the disks do not need to seek around a lot. You see this in the low latency, it is about 2 msec in this case. In the random-read case, like for example when you run a find, the disks can not keep this amount of IOops, as they need to seek around. And here the USB-stick shines. I’ve seen upto 1600 IOops on it during running a find (if the corresponding data is in the cache, off course). This was with something between 0.5 and 0.8 msec of latency.

This is the machine at home which is taking care about my mails (incoming and outgoing SMTP, IMAP and Webmail), has a squid proxy and acts as a file server. There are not many users (just me and my wife) and there is no regular usage pattern for all those services. Because of this I did not do any benchmark to see how much time I can gain with various workloads (and I am not interested in some artificial performance numbers of my webmail session, as the browsing experience is highly subjective in this case). For this system a 1 GB USB stick (which was just collecting dust before) seems to be a cheap way to improve the response time for often used small data. When I use the webmail interface now, my subjective impression is, that it is faster. I am talking about listing emails (subject, date, sender, size) and displaying the content of some emails. FYI, my maildir storage has 849 MB with 35000 files in 91 folders.

Bottom line is: do not expect a lot of bandwidth increase with this, but if you have a workload which generates random read requests and you want to decrease the read latency, it could be a cheap solution to add a (big) USB stick as a cache device.

SUN OpenStorage presentation

At work (client site) SUN made a presentation about their OpenStorage products (Sun Storage 7000 Unified Storage Systems) today.

From a technology point of view, the software side is nothing new to me. Using SSDs for zfs as a read-/write-cache is something we can do (partly) already since at least Solaris 10u6 (that is the lowest Solaris 10 version we have installed here, so I can not check quickly if the ZIL can be on a separate disk in previous versions of Solaris, but I think we have to wait until we updated to Solaris 10u8 until we can have the L2ARC on a separate disk) or in FreeBSD. All other nice ZFS features available in the OpenStorage web interface are also not surprising.

But the demonstration with the Storage Simulator impressed me. The interaction with Windows via CIFS makes the older version of files in snapshots available in Windows (I assume this is the Volume Shadow Copy feature of Windows), and the statistics available via DTrace in the web interface are also impressive. All this technology seems to be well integrated into an easy to use package for heterogeneous environments. If you would like to setup something like this by hand, you would need to have a lot of knowledge about a lot of stuff (and in the FreeBSD case, you would probably need to augment the kernel with additional DTrace probes to be able to get a similar granularity of the statistics), nothing a small company is willing to pay.

I know that I can get a lot of information with DTrace (from time to time I have some free cycles to extend the FreeBSD DTrace implementation with additional DTrace probes for the linuxulator), but what they did with DTrace in the OpenStorage software is great. If you try to do this at home yourself, you need some time to implement something like this (I do not think you can take the DTrace scripts and run them on FreeBSD, this will probably take some weeks until it works).

It is also the first time I see this new CIFS implementation from SUN in ZFS life in action. It looks well done. Integration with AD looks more easy than doing it by hand in Samba (at least from looking at the OpenStorage web interface). If we could get this in FreeBSD… it would rock!

The entire OpenStorage web interface looks usable. I think SUN has a product there which allows them to enter new markets. A product which they can sell to companies which did not buy something from SUN before (even Windows-only companies). I think even those Windows admins which never touch a command line interface (read: the low-level ones; not comparable at all with the really high-profile Windows admins of our client) could be able to get this up and running.

As it seems at the moment, our client will get a Sun Storage F5100 Flash Array for technology evaluation in the beginning of next year. Unfortunately the technology looks to easy to handle, so I assume I have to take care about more complex things when this machine arrives… :(

Share/Bookmark