A while ago I committed the linuxulator D-Trace probesÂ I talked about earlier. I waited a little bit for this announcement to make sure I have not broken anything. Nobody complained so far, so I assume nothing obviously bad crept in.
The >500 probes I committed do not cover the entire linuxulator, but are a good start. Adding new ones is straight forward, if someone is interested in a juniorâ€“kernelâ€“hacker task, this would be one. Just ask me (or ask on emulation@), and I can guide you through it.
I updated my linuxulator-dtrace patch to a recent -current. I already compiled it on i386 and arundel@ has it compiled on amd64. I counted more than 500 new DTrace probes. Now that DTrace rescans for SDT probes when a kernel module is loaded, there is no kernel panic anymore when the linux module is loaded after the DTrace modules and you want to use DTrace. I try to commit this at a morning of a day where I can fix things during the day in case some problems show up which I did not notice during my testing.
I set the expiration date of linux_base-fc4 (only used by 7.x and upstream way past its EoL) and all dependent ports. It is set to the EoL of the last 7.x release, which can not use a later linux_base port. I also added a comment which explains that the date is the EoL of the last 7.x release.
programming errors (errors which can happen if someone made a mistake, but should not happen)
The programming-error checks give hints about userland programming errors respectively a hint about the reason of error return values due to resource shortage or maybe a wrong combination of parameters. An example error message for this case is â€œApplication %s issued a sysctl which failed the length restrictions.nThe length passed is %d, the min length supported is 1 and the max length supported is %d.nâ€?.
The stats-script (tailored specially to the linuxulator, but this can easily be extended to the rest of the kernel) can report about:
number of calls to a kernel function per executable binary (not per PID!): allows to see where an optimization would be beneficial for a given application
graph of CPU time spend in kernel functions per executable binary: together with the number of calls to this function this allows to determine if a kernel optimization would be beneficial / is possible for a given application
graph of longest running (CPU-time!) kernel function in total
Unfortunately this can not be committed to HEAD as-is. The DTrace SDT provider can not handle probes which are added to the kernel after the SDT provider is already loaded. This means that you either have to compile the linuxulator statically into the kernel, or you have to load the SDT kernel module after the linuxulator module is loaded. If you do not respect this, you get a kernel panic on first access of one of the providers in the linuxulator (AFAIR this includes listing the probes available in the kernel).
I forward ported my DTrace probes for the FreeBSDlinuxulator from a 2008-current to a recent â€“current. I have not the complete FreeBSDlinuxulator covered, but a big part is already done. I can check the major locks in the linuxulator, trace futexes, and I have a D-script which yells at a lot of errors which could happen but should not.
Some of my D-scripts need some changes, as real-world testing showed that they are not really working as expected. They can get overwhelmed by the amount of speculation and dynamic variables (error message: dynamic variable drops with non-empty dirty list). For the dynamic variables problem I found a discussion on the net with some suggestions. For the speculation part I expect similar tuning-possibilities.
Unfortunately the D-script which checks the internal locks fails to compile. Seems there is a little misunderstanding on my side how the D-language is supposed to work.
I try to get some time later to have a look at those problems.
During my development I stumbled over some generic DTrace problems with the SDT provider I use for my probes:
If you load the Linux module after the SDT module, your system will panic as soon as you want to access some probes, e.g. â€œdtrace â€“lâ€? will panic the system. Loading the Linux module before the SDT module prevents the panic.
Unloading the SDT module while the Linux module with the SDT probes is still loaded panics the system too. Do not unload the Linux module if you run with my patch.
According to avg@ those are known problems, but I think nobody is working on this. This is bad, because this means I can not commit my current patchset.
If someone wants to try the new DTrace probes for the linuxulator, feel free to go to http://www.Leidinger.net/FreeBSD/current-patches/ and download linuxulator-dtrace.diff. I do not offer a working hyperlink here on purpose, the SDT bugs can hurt if you are not careful, and I want to make the use of this patch a strong opt-in because of this. If the patch hurts you, it is your fault, you have been warned.
Rui Paulo recently committed to FreeBSD HEAD the userland DTrace support, marking the completion of this FreeBSD Foundation sponsored project. Rui summarizes his work as follows:
DTrace, which originated on Solaris 10, is a comprehensive tracing framework that allows the instrumentation of software. FreeBSD has had DTrace support since 7.0, but until now tracing userland programs was not possible. Now that this project is complete, anyone can use DTrace with userland programs.
Tracing and instrumenting userland programs is very important because it allows the understanding of what's going on, especially on highly complex systems such as databases, web servers, and language interpreters. Since DTrace on FreeBSD now has the ability to instrument both the kernel and the userland program, you can get very meaningful data on how your program is behaving and why.
Companies building products on FreeBSD now have the ability to create better products and find about problems faster then before.
This project focused on allowing the creation of DTrace pid probes, userland statically defined probes (aka USDT), importing plockstat (a DTrace utility to measure lock contention in the pthread library), importing dtruss (a system call tracing utility similar to ktrace) and enabling FreeBSD DTrace support on MySQL and PostgreSQL.
Merges to upcoming FreeBSD releases (8.2 and 7.4) are being thought out, but it's likely they will happen.
Rui Paulo has been awarded a grant to add DTrace userland support to FreeBSD.
DTrace is a general purpose and lightweight tracing framework that allows administrators, developers and users to investigate causes of system failure or performance bottlenecks. The FreeBSD operating system has had support for kernel-only DTrace since FreeBSD 8.0, but DTrace userland support was missing. Having userland support in DTrace allows inspection of userland software itself and its correlation with the kernel, thus allowing a much better picture of what exactly is going on behind the scenes.
This project will first concentrate on adding libproc support for symbol to address mapping, address to symbol mapping, breakpoint setup and the rtld interactions with DTrace. Next it will focus on DTrace process control, importing the pid provider and adapting it to FreeBSD and porting the userland statically defined probe provider (usdt). Finally it will bring in the plockstat provider.
"By having userland DTrace support, companies can make their products perform much better on FreeBSD due to the fact that they now have access to this amazing tool," said FreeBSD developer Rui Paulo. He also said, "When we mix the userland support with the kernel side DTrace support, we can also make FreeBSD a better operating system because we can investigate performance bottlenecks much easier."
The project should be completed by September 2010.
Last month the FreeBSD DevSummit was once again held the two days prior to BSDCan. While the DevSummit is aimed primarily at FreeBSD Developers, some invitees were from other organizations that use, contribute to, or are otherwise interested in the development of FreeBSD. Such a mix offers opportunities to discuss pain points and ways to collaborate.
One of the invited speakers was Greg Smith from 2ndQuadrant, a company that provides professional services and support for PostgreSQL. Greg wrote about his experience at the DevSummit on the 2ndQuadrant blog and has given permission to repost that entry here. It should be noted that the FreeBSD Foundation is currently funding a project for userspace DTrace.
This week I did something I'd prefer to never repeat: I left the country, did something useful, and made it back again in the same day. The occasion was the FreeBSD Developer Summit, held just before BSDCan--the convention that happens in Ottawa the week before PGCon every year. So I get to head right back again next week, but stay a while that time.
The FreeBSD developers were nice enough to sponsor my trip so that we could talk about both the business and technical hurdles that I felt were keeping the sort of companies I work with from deploying their databases on FreeBSD more often than they do. My slightly updated slides are available on our talks page, I cleaned up a couple of things from what was presented (the most important rewording I'll talk about below).
I was very pleased at how friendly and receptive the developers were even to some of my critical comments. FreeBSD and PostgreSQL have very like minded communities: open for any purpose BSD license, academic roots, developers focused on stability, and even a strong documentation culture. There's been plenty of cross-over too.
Much of the PostgreSQL infrastructure has been run using FreeBSD jails for quite some time (although plans are moving to use more Debian in its place, details on why at Inside the PostgreSQL Project Infrastructure). My running joke during the talk was that if PostgreSQL developers are eating their own dog food by deploying critical infrastructure that depends on the database, much of that has been served in a FreeBSD bowl. (The lunch at the conference session was pizza, much better choice)
And there's been plenty of FreeBSD development that's used PostgreSQL benchmarking as a measuring stick for the success of their advances. The very popular Introducing FreeBSD 7.0 slides that not only showed their achieving performance parity against Linux during that release, it doubled as a document showing how PostgreSQL outscales MySQL. Cheers all around for community driven, BSD licensed code.
One bit of audience contention during my talk came from my assertion that not having support for Emulex fiber channel cards in FreeBSD was preventing a significant amount of "big iron" adoption for databases, due to their perception as the market leader for connecting up expensive hardware like SANs. The guys from FreeBSD hardware and support vendor iXsystems called me out on that, suggesting that the alternative vendor here--QLogic--is both completely trusted by the big boys and has top notch FreeBSD driver quality.
I did a bit more research into whether I was suffering from sampling bias from the set of people I'd talked to about this, and it looks like that was the case. While Emulex claims they've been named Sun's "Best-in-Class Supplier for OEM products", and all the Sun FC cards I've personally run into came from them, there are tons of Sun rebrands of both Emulex and QLogic cards. Same thing is true at all the other vendors I mentioned in my talk; you can get FC cards from both manufacturers via HP and Dell too. I think my general point, that not supporting both Emulex and QLogic hurts the perception of FreeBSD as a serious choice for large businesses, still stands; it's just not quite as bad as I'd feared. Accordingly, I tweaked the wording in the slides I'm publishing, to better match reality here than the ones I presented.
In additional to the solid core they've been growing for years, FreeBSD's license has allowed them to incorporate two very valuable features Sun released as open-source, ZFS and DTrace, into their operating system, both of which are incompatible with Linux's license and are extremely valuable for PostgreSQL deployments. It's still not ideal yet; FreeBSD DTrace can currently be used only by root for example. Limitations such as these have in the past kept me from being particularly motivated to work with FreeBSD. The existence of a free commercial Solaris that ran on generic hardware, combined with the steady progress and open enough community around OpenSolaris, satisfied my needs better. While not many of my PostgreSQL installations have been on Solaris, its has a monopoly share for hosting the terabyte scale databases I've worked with. High quality filesystem snapshots via ZFS and the additional piece of mind you get from disk block checksums alone justified those platform decisions.
The problem today is that hating everything about how Oracle does business is what got me working with PostgreSQL in the first place, and now that they own Sun they're doing the same things to Solaris. No more Solaris on non-Sun hardware, serious cutbacks on the open-source version (OpenSolaris looks like a walking corpse to me), cutting off even basic OS patches unless you have a support contract--that's what we've seen just in the first round from Oracle here. Solaris isn't free in any sense of the word again, we're right back to the same dynamics that pushed me away from them and toward Linux fifteen years ago.
But I continue to be dissapointed at how little focus there is on quality control in Linux. How poorly the filesystem mechanics work for the sorts of database work I do doesn't help either. The Linux OOM killer might as well be named the Linux PostgreSQL Hater for how it acts on my servers. And those sexy Solaris features I know work so well for databases, still not there (even if SystemTap is getting better at DTrace emulation).
Meanwhile, FreeBSD has the whole "free" thing sorted out right in their name, and their quality control paranoia is similar to that of your typical good DBA. It looks to me like they're very close to fully assimilating ZFS and DTrace to the point where they can start improving them, rather than just working on getting the original feature set Solaris already had complete and the matching code stable. I think all of us who work on business critical PostgreSQL deployments and who value free software should do a sanity check on just what dog food we're chewing on, and start making sure there's a FreeBSD bowl there at least sometimes. From what I heard this week, the FreeBSD developers are gearing for another round of chewing on ours too. They're looking into database oriented performance improvements as part of future development, and they're not any happier about using MySQL for that than I am about running PostgreSQL on Solaris. Looks like it might be bowls of dog food all around. Nobody said that leading the software industry was going to be tasty.