I've lately been focusing on performance counter stuff on Sandy Bridge (Xeon and non-Xeon.) Part of this has been fixing some of the counters that were wrong. Part has been digesting the Intel tuning guides and the Intel micro-architecture for Sandy Bridge. It's a little different to the older school pipeline driven architecture that rules the MIPS world.
So, I fired up some of my scripts (at http://github.com/erikarn/hwpmc) on a live cache pushing a whole lot of live video netflix traffic. The scripts use the PMC framework in global counter mode rather than sampling mode, so it's cheap to do and doesn't affect performance.
What I found:
- The pipeline slots per cycle metric is around 16% - so there's a lot of stalling going on.
- There's a lot of memory traffic going on - around 50% of clock cycles are spent in LLC_MISS - ie, it wasn't in L1, L2 or L3/LLC (last-level cache) and thus has to be fetched from memory.
So, I started digging into why there were so many memory accesses. It turns out the biggest abuser was the cross-CPU IPI involved in synchronising page mapping tables - there are a few places calling pmap_invalidate_range() as part of sendfile() buffer completion and this was causing issues. I pointed this out, someone else has addressed it internally. (Ideally if the IO path uses unmapped buffers on amd64, there shouldn't be any need to map them in and out of KVA.) I think that saved about 4% of total clock cycles spent being stalled.
Then I found a lot of stalling going on in the mwait and ACPI sleep path. It turns out that these paths seem to involve doing ISA space IO port accesses. These are .. very slow. I've just flipped my testing over to use no mwait and use HLT.
Next - flowtable had been turned on during experimentation and I had noticed that the flowtable expire/flush code would periodically spike up. It spiked up more when more clients and more TCP flows were connected. It showed up in both memory accesses and clock cycles busy PMCs - and the reason made me laugh out loud.
The flowtable uses a bitstring_t - effectively an array of bytes treated as a bitmap, like select() FD_SET's - and would walk this to look for flows to expire.
The expiry code would walk the list looking for flows to expire - it would loop over the entire set, calling ffs() over the whole set to look for the next new flow to check.
.. so looping over looping over the whole set. O(n^2). Right there, in the flow cleaning path. Doing byte offset fetches, rather than 32-bit fetches. Everything about it was ridiculous. As we scaled up to serve more flows the flowcleaner CPU cycle count was spiking really, really hard.
I pointed this out in an email to my coworkers and fell asleep. It was fixed when I awoke - a co-worker fixed it to be correctly O(n) whilst I was sleeping. It's now totally disappeared from the CPU cycle and stall analysis.
So, I've just been chipping away at things here and there. There are some larger scale issues that I really want to address but I'd like to make sure all the odd, silly and remaining low hanging fruit are addressed. Then comes the fun stuff.