Category Archives: pmc

Finding low hanging fruit with PMC, or "O(wtf)" ?

I've lately been focusing on performance counter stuff on Sandy Bridge (Xeon and non-Xeon.) Part of this has been fixing some of the counters that were wrong. Part has been digesting the Intel tuning guides and the Intel micro-architecture for Sandy Bridge. It's a little different to the older school pipeline driven architecture that rules the MIPS world.

So, I fired up some of my scripts (at on a live cache pushing a whole lot of live video netflix traffic. The scripts use the PMC framework in global counter mode rather than sampling mode, so it's cheap to do and doesn't affect performance.

What I found:

  1. The pipeline slots per cycle metric is around 16% - so there's a lot of stalling going on.
  2. There's a lot of memory traffic going on - around 50% of clock cycles are spent in LLC_MISS - ie, it wasn't in L1, L2 or L3/LLC (last-level cache) and thus has to be fetched from memory.
So, I started digging into why there were so many memory accesses. It turns out the biggest abuser was the cross-CPU IPI involved in synchronising page mapping tables - there are a few places calling pmap_invalidate_range() as part of sendfile() buffer completion and this was causing issues. I pointed this out, someone else has addressed it internally. (Ideally if the IO path uses unmapped buffers on amd64, there shouldn't be any need to map them in and out of KVA.) I think that saved about 4% of total clock cycles spent being stalled.

Then I found a lot of stalling going on in the mwait and ACPI sleep path. It turns out that these paths seem to involve doing ISA space IO port accesses. These are .. very slow. I've just flipped my testing over to use no mwait and use HLT.

Next - flowtable had been turned on during experimentation and I had noticed that the flowtable expire/flush code would periodically spike up. It spiked up more when more clients and more TCP flows were connected. It showed up in both memory accesses and clock cycles busy PMCs - and the reason made me laugh out loud.

The flowtable uses a bitstring_t - effectively an array of bytes treated as a bitmap, like select() FD_SET's - and would walk this to look for flows to expire.

The expiry code would walk the list looking for flows to expire - it would loop over the entire set, calling ffs() over the whole set to look for the next new flow to check.

.. so looping over looping over the whole set. O(n^2). Right there, in the flow cleaning path. Doing byte offset fetches, rather than 32-bit fetches. Everything about it was ridiculous. As we scaled up to serve more flows the flowcleaner CPU cycle count was spiking really, really hard.

I pointed this out in an email to my coworkers and fell asleep. It was fixed when I awoke - a co-worker fixed it to be correctly O(n) whilst I was sleeping. It's now totally disappeared from the CPU cycle and stall analysis.

So, I've just been chipping away at things here and there. There are some larger scale issues that I really want to address but I'd like to make sure all the odd, silly and remaining low hanging fruit are addressed. Then comes the fun stuff.

Profiling on superscalar architectures, or "no, instruction counts don’t necessarily matter nowdays.."

I could spend a lot of time brain dumping various things to do with profiling on anything more recent than Pentium 1 (or, well, anything remotely to do with MIPS of that era, but I digress.) In any case, there's plenty of useful writings about it so I'm not going to - just jump over to .

However, I thought I'd provide a basic example of where "instructions" doesn't actually matter, as well as a shortcoming of the current FreeBSD tools.

My network testing stack does a whole lot of read() and write() syscalls to achieve its goal. For those who know what's going on, I hope you know where I'm taking this. Anyway..

Firstly, the standard. "pmcstat -S instructions -T". This prints a "top" like output counting instructions retired.
Figure 1. # pmcstat -S instructions -T -w 5

This looks like the contention is in the mutexes protecting socket receive and the TCP output path. Sure, but why is it contending?

The problem with doing it based on instructions retired is that it hides any issues to do with stalls. There's a bunch of sources of potential stalls - memory reads, memory writes, stuff not being in cache but being needed for instructions that are running. They're generally either side-effects of operations not being able to complete in time (eg if you have a whole lot of completed operations that need to push stuff out to memory to continue, but there's no free bandwidth to queue memory writes) but sometimes it's just from straight bulk memory copies.

If you're interested about the Intel microarchitecture and how all of these pieces fit together to process an instruction stream in parallel, with all of the memory controller paths coming in and out, have a read of this: .

Ok, so let's look at general stalls. There's a bunch of L1, L2, LLC (last level cache, think "L3" here) operations that can be looked at, as well as stuff that FreeBSD's PMC framework doesn't support  - notably some of the stuff on the "uncore" - the shared cache and pipelines between all cores on a socket. It supports the events implemented using MSRs, but not events implemented using the PCIe configuration space.

So, without further ado:

Figure 2. # pmcstat -S RESOURCE_STALLS.ANY -T -w 5
Yup. This looks much more like what I'd expect. The CPU is stalled doing copyout(). This is a mostly-read() workload, so that's what I'd expect. mb_free_ext() is interesting; I'll go look at that.

Now, copyout() is doing a bulk copy. So, yes - I'd expect that to be hurting. mb_free_ext() shouldn't be doing very much work though - I'll do some digging to see what's going on there.

The final output is from the Intel performance tuning overview tools. You can find them here - . There's a nice overview tool (pcm.x) which will output the basic system overview. I like this; it gives a very simple overview of how efficient things are.
Figure 3. "pmc.x 1" running on FreeBSD-10.

Now, this isn't a stock version of pcm.x - I've hacked it up to look slightly saner when doing live reporting - but it still provides exactly the same output in that format. Note the instructions per CPU cycle and the amount of cache misses. It's .. very inefficient. Tsk.

So in summary - don't just do instruction count based profiling. You want to establish whether there are any memory and cache bottlenecks. If you're doing HPC, you want to also check to see if you're hitting SSE, FPU, divider unit and other kinds of math processing stalls.

Now - what would I like to see in FreeBSD?

  • The hwpmc framework needs to grow support for more of the socket and system events - right now it's very "core" focused.
  • Some better top-level tools to provide a system summary like Intel's pcm.x tool would be nice.
  • Some better documentation (read: better than just this wiki page!) looking at how to actually profile and understand system behaviour would be desirable.
  • .. and tying this into dtrace would be great too.
  • Once we get some (more) NUMA awareness, it would be great to have the uncore stuff reporting on things like QPI traffic, cache line and memory accesses from remote sockets/cores, and other stuff to do with NUMA allocation and awareness.
Mostly, however, I'd like to get this stuff into the hands of developers sooner rather than later so they can start running this as a normal day to day thing.