Author Archives: adrian

Porting a wifi driver from openbsd – AR9170

I told myself a long, long time ago that I really don't want to be working on USB wireless. It's not that I dislike USB or wireless; it's just that the hoops required to get it all working in a stable way is quite a bit to keep in your mind. But, I decided recently that it's about time I learnt how it worked and I was very sad that we still didn't have any working USB wifi devices that also operated with 802.11n.

So, I picked a NIC and dove in.

I picked if_rsu(4) - it's the RTL8188SU / RTL8192SU series hardware from Realtek. It turned out I chose reasonably well.

First off - it's a "fullmac" device - meaning that outside of a handful of things, the device firmware offloads a lot of the 802.11 complications. The driver does hardware initialisation and the wireless stack speaks WPA/WPA2/etc for negotiating encryption, but the hardware handles scanning, authentication, 802.11n aggregation negotiation and most management frame work.

Secondly - it's ported from OpenBSD. The OpenBSD folk do a good job of getting drivers up and running, but there tend to be some sharp edges and the 802.11n bits just don't work.

So, besides currently doing encryption in software, the rsu(4) driver behaves rather well. I'll write a separate article about that. This article is about the AR9170, or otus(4) driver in FreeBSD/OpenBSD parlance.

Now, the AR9170 is a ZyDAS device with an Atheros 802.11n PHY and radio. It's quite a hybrid beast. It's also buggy - there are issues with QoS frames and 802.11n aggregation that make it impossible to behave well. So, for now I'm treating it like a 11abg device and I'll worry about 802.11n when someone gives me patches to make it work.

The OpenBSD driver is based on the initial otus driver that Atheros provided to the Linux developers circa 2009. The firmware blob is closed and very old - the ar9170fw project is still out there on the internet (and I have a mirror at but I can't get it to build on a recent FreeBSD install so a firmware update will take time. But, it does seem to work.

There are a few pieces to think about when porting a USB driver. The biggest piece is that it's not memory mapped IO or IO port based - everything is a message. There are USB device control commands you can send which will sleep until they're done, but the majority of stuff is done using bulk transmit and receive endpoints and that's all conveniently asynchronous. But it complicates things in the driver world.

Memory mapped and IO port drivers treat device IO as this magical "I do it, then the next instruction executes when it's done" mostly serialised paradigm. It's a lie, of course - the intel x86 CPUs will pretend things are occuring in a specific order, but a lot of platforms require you to mark memory as uncached or use memory / cache flush operations to ensure things go out to the device in any particularly controlled manner. But USB doesn't - outside of USB control transfers, USB devices tend to look like remote network devices and this includes register accesses. Now, the RTL8188SU driver (rsu(4)) implements the firmware upload and register accesses using control transfers, so it's all pretty easy to get the driver initialisation and attaching working before you care about the asynchronous parts. But the AR9170 driver implements register accesses as firmware commands - and so I have to get a lot more of the stack up and working first.

So, here's what I did.

First up - I commented out almost all of the device driver, and focused on getting the probe, attach and detach methods working. That wasn't too hard. But yes, almost all the code was commented out.

Next up was firmware loading. This was done using control transfers, so I didn't have to worry about implementing the bulk transmit and receive endpoint handling. I had to convert the firmware load path to the FreeBSD firmware API rather than the OpenBSD API, but that was mostly trivial.

Then I realised I wasn't doing any driver locking - so yes, I ensured I did the bare minimum of driver locking required to stop the kernel panicing. OpenBSD doesn't use locks, they use old style BSD spl() levels.

Next up was command transmit and receive. Now, I needed to setup the USB endpoints - which FreeBSD makes really easy to do using a structure to define what endpoints are what. It was pretty clean. The complicated bit is the bulk callback - it handles transfer statuses and transfer initiation. This is the bit that took me a little time to wrap my head around.

The USB stuff handles things in-sequence. Everything going to an endpoint here gets handled in the sequence you queue it. It also will process the bulk callback in a single worker thread taskqueue, rather than the driver author having to worry about creating their own worker threads. So, this is what
you end up doing:

  • The bulk callback has three states: USB_ST_TRANSFERRED, USB_ST_SETUP, and everything else (error.)
  • USB_ST_TRANSFERRED says "I've finished a transfer".
  • USB_ST_SETUP says "I've been asked to initiate a transfer."
  • Any driver thread starts a transmit by calling usbd_transfer_start() on the usb_xfer struct, which will kick off a call into the bulk callback with USB_ST_SETUP.
  • So, the driver has to maintain its own queues of "pending", "active" and "waiting" transactions. "pending" is the queue to put outbound transmit messages on. "active" is the queue you put messages that you've submitted when USB_ST_SETUP is called. When USB_ST_TRANSFERRED or an error is called, you pop off the top entry from "active" and you finish with it, then you fall through to USB_ST_SETUP to start a new transfer.
It's a little complicated because you have to maintain your own submission queues in/out of the USB stack, but in practice it's just a linked list.

So, I stole the framework from rsu(4) for buffer management, transmit submission and completion. It submitted things fine. I also registered buffers for receive, and .. nothing happened. I would send a PING message to the firmware to see if it was awake, and I'd get nothing from the receive pipe.

Then I remembered an interesting bug from when I tried this in 2012 - the AR9170 firmware required the IRQ endpoint to be setup, even though no interrupt messages were ever posted. So, I set the endpoint up, started reception on it.. and now I started to see receive messages. My PING messages were being PONG'ed.

But here's the first complication - although everything is asynchronous here, a lot of places want to send a command and wait for a response. For the PING command it's waiting for a matching PONG response. For setting frequency, starting calibration, etc, you get back interesting status from the firmware. But for things like register read commands, you have to wait until you get the register value back before you can continue. We need to be able to put the caller to sleep until the response comes back, or some timeout occurs.

So, cmd_otus() submits a transfer buffer and then will msleep() on it for up to a second, waiting for a response. When a command is transmitted, a couple of things can occur:
  • Once the transfer succeeds, if the command needs no response then we just send a wakeup to notify the sender that we've sent it, and we free the buffer.
  • If the transfer succeeds but the command needs a response, then we put it on the "waiting" queue.
Then in the receive path we pull out firmware notifications and if they're responses we copy the response into the callers provided buffer, call wakeup() to wake up the caller, and free the buffer.

OpenBSD cheats - it only has one single outstanding command buffer for all threads to use.

The tricky, unimplemented bit here is error handling - if I yank out a NIC during active commands then the driver will sleep for a second, wakeup with an error and pass an error back. But, the rest of the driver doesn't know anything was sleeping, so state gets freed from underneath it. I need to go and add what OpenBSD does - refcount when the driver is entered from say, the transmit and ioctl paths, and then upon detach just wait for pending things to finish before freeing.

Ok, so that got command transmit/receive and sleep/wake notification working okay. Next up is packet reception and basic initialization. That was mostly the same - the same hardware bits are needed, the same 802.11 packet format is needed for the stack. The main differences here were in the OpenBSD versus FreeBSD net80211 interface layout - FreeBSD has vaps (virtual access points, etc) but OpenBSD does not. It's still pre-vap work, so there's only one interface. This required a little bit of splitting to put the vap bits in vap routines, and driver bits in the driver. The notable exceptions are vap_create, vap_destroy and newstate.

Next up was realising OpenBSD is also still driving 802.11 state from the driver, not from the net80211 stack. FreeBSD drives the state changes and tells the driver what to do. That required me undoing some manual state transitions (eg otus_init() setting the state to SCAN or RUN depending upon the interface mode) and just letting net80211 do it.

So, net80211 created a vap, called otus_init(), then brought up the interface, set the initial vap state to SCAN via a call to newstate and started changing channels. This worked fine. I had some locking concerns - check the driver to see what I did. It was pretty straightforward.

And then yes - because the receive path was pretty simple and I got straight 802.11 frames back, yes, I started seeing beacons in a tcpdump session. This was great.

Then I ripped up a bunch of callback code that isn't needed. A few years ago FreeBSD's USB drivers maintained their own taskqueue to defer things like crypto key setting, state changes and such. Now net80211 has a per-device taskqueue that it runs these things on, and a lot of the driver calls are done as deferred tasks. OpenBSD doesn't have this so the drivers create their own deferred task and async callback framework to schedule these. It's duplicated work and I removed all of that from the driver.

Next up is transmit. This is trickier for a few reasons.

First, FreeBSD doesn't use if_start() anymore, with network stack provided queues. I have to maintain my own queue and free net80211 node references as appropriate. It took a while to craft up a correctly behaving transmit side when I fixed rsu(4), so I just stole it for the AR9170. I'll describe that in a subsequent article about rsu.

FreeBSD's net80211 stack handles 802.11 encapsulation itself; we're not handed ethernet frames unless we ask for them. So, I don't call ieee80211_encap(). Yes, I do call for software encryption as required, and that was done.

The biggest sticking point is the rate control. FreeBSD's net80211 stack has a reasonable implementation of transmit rate control modules and it's per vap and per associated node. I don't have to do anything too manual for it. OpenBSD did a bunch of manual work to do the AMRR setup/teardown/updating, so I had to rip it out and call the ratectl init/destroy methods in the vap create/destroy methods.

Next up was what ni->ni_txrate represented. In OpenBSD it seems like an index into the rate control table. In FreeBSD it's the 802.11 rate to use! So, I ripped out a bunch of rate table stuff in the driver and replaced it with a couple of mapping functions to go 802.11 rate to AR9170 hardware rate. That worked like a charm, and transmit works fine.

The last annoying thing with transmit is how the firmware tells us about failed frames. We don't get a completion message upon each frame - the later firmware does this, but the original blob doesn't. We only get told upon retries and errors. So, I hacked up something where the transmit path counts outbound packets, the RX command path counts retries/errors, and each time I transmit a packet I update net80211 with the transmit/retry/error counts. This works pretty well.

Finally - teardown. The correct order for teardown is:
  • Shut down the MAC - eg, disable TX/RX DMA, etc
  • Disable the USB transfers, wait until they're done
  • Free the transmit/receive buffers and any net80211 node references they may have; and
  • then call ieee80211_ifdetach() to ensure vaps and the top level interface is destroyed.
The initial port called ieee80211_ifdetach() too early and the subsequent node references would refer to now-freed nodes and vaps, causing lots of hilarity.

And that's that. I haven't made 802.11n work; I haven't fixed up the radiotap support so received 802.11 packets in tcpdump actually provide the right rate/channel/etc. That's all details that I'll do when i feel like it. But, the driver is stable, there aren't any lock ordering issues that I've seen so far, and it actually behaves remarkably well.

FreeBSD now has NUMA? Why’d it take so long?

I just committed "NUMA" to FreeBSD. Well, no, I didn't. I did almost no actual NUMA-y work in FreeBSD. I just exposed the existing NUMA stuff in FreeBSD out and re-enabled it.

FreeBSD-9 introduced basic NUMA awareness in the physical allocator (sys/vm/vm_phys.c.) It implemented first-touch page allocation, and then fell back to searching through the domains, round-robin style. It wasn't perfect, for some workloads it was apparently okay. But it had some shortcomings - it wasn't configurable, UMA and other subsystems didn't know about NUMA domains, and the scheduler really didn't know about NUMA domains. So I'm sure there are plenty of workloads which it didn't work for.

That was all ripped out before FreeBSD-10. FreeBSD-10 NUMA just implements round-robin physical page allocation. It still tracks the per-domain physical memory regions, but it doesn't do any kind of NUMA aware allocation. From what I can gather, it was removed until something 'better' would land.

However, nothing (yet) has landed. So I decided I'd take a look into it. I found that for a lot of simple workloads (ie, where you're doing lots of anonymous memory allocation - eg, you're doing math crunching) the FreeBSD-9 model works fine. It's also a perfectly good starting point for experimenting.

So all my NUMA work in -HEAD does is provide an API to exactly the above. It doesn't teach the kernel APIs about domain aware allocations - there's currently no way to ask for memory from a specific domain when calling UMA, or contigmalloc, etc. The scheduler doesn't know about NUMA, so threads/processes will migrate off-socket very quickly unless you explicitly limit things. Devices don't yet do NUMA local work - the ACPI code is in there to enumerate which NUMA domain they're in, but it's not used anywhere just yet.

Then what is it good for?

If you're doing math workloads where you read in data into memory, do a bunch of work, and spit it out - it works fine. If you're running bhyve instances, you can run them using numactl and have them pinned to a local NUMA domain. Those coarse-grained things work fine. You can also change the system default back to round-robin and use first-touch or fixed-domain for specific processes. It's useful for exactly the same subset of tasks as it was in FreeBSD-9, but now it's at least configurable.

So what's next?

Well, my main aim is to get the minimum done so kernel side work is NUMA aware. This includes UMA, contigmalloc, malloc, mbuf allocation and such. It'd be nice to tag VM objects with a domain allocation policy, but that's currently out of scope. I'd also like to plumb in domain configuration into devices and allow devices to allocate memory for different driver threads with different policies.

But the first thing that showed up is that KVA allocation and superpages get in the way of malloc/contigmalloc working. Allocating memory in FreeBSD first allocates KVA space, then back-fills it with pages. As far as malloc/contigmalloc is concerned, KVA is KVA and it finds the first available space in a time-fast way. It then backfills it with physical pages. The superpage reservation bits (sys/vm/vm_reserv.[ch]) join together regions that are contiguous and in the same superpage and turn it into an allocation from the same superpage. These have no idea about NUMA domains. So, if you allocate a 4KiB page via malloc() from domain 0 and then try to allocate a 4KiB page from domain 1, it will likely mess it up:

  • First page gets allocated - first KVA, then the underlying 2mb superpage is allocated and a 4k page is returned - from physical memory domain 0;
  • Second page gets allocated - first KVA, and if it's adjacent or within the same 2mb superpage as the above allocation, it'll "fake" the page allocation via refcounting and it'll really be that same underlying superpage - but it's from physical memory domain 0.
I have to teach both vm_reserv and the KVA allocator about NUMA domains, enough so domain specific allocations don't use KVA that's adjacent. It was suggested that I create a second layer of KVA allocators that allocate KVA from the main resource allocator in superpage chunks (here it's 2mb) and then I do domain-specific allocations from them. It'll change how things get fragmented a bit, but it does mean that I won't fall afoul of things.

So, I'll do the above as an experiment and I'll push the VM policy evaluation up a little into malloc/contigmalloc. I'll see how that experiment goes and I'll post diffs for testing/evaluation.

The importance of mentoring, or "how I got involved in FreeBSD"..

Here's how I was introduced into this UNIX world, or "wait, WHO was your WHAT?"

So, here's 11ish or so year old Adrian. It's the early 90s. I was hiding in my bedroom, trying to make another crystal set out of random parts and scraping away the paint at my windowsill. In walks my Aunty, who introduces her new boyfriend.

"Hi, I'm Julian." he said. That wasn't all that interesting.

"Oh, are you making a crystal set?" .. ok, so that was interesting.

And, that was that. Suddenly, someone role-model-y shows up in my life out of the blue. There I was, an 11 year old who felt very mostly alone most of the time, and someone shows up who I can look up to and think I can relate to. So, I'm a sponge for everything he shows me. Whenever he comes over, he has some new story to tell, some new thing to show me. He would show me better ways of building transistor switch circuits when I was in the "make large arcs with car alternator" phase of my early teens. And, when I saved up and bought a PC, he started to show me programming.

Now, I was already programming. My parents had saved up and bought me an Amstrad CPC464. We had a second-hand commodore 64 for a short while, but that eventually somehow stopped working and I didn't have the clue to fix it. But I was programming Locomotive BASIC and dabbling in Z80 assembly when I was 12, and had "upgraded" to Turbo Pascal 6 when I hit high school. (Yes, school taught Turbo Pascal at Grade 10 level, and I decided to learn it a bit earlier. That's .. wow, that dates me.) I hadn't yet really stumbled into C yet. I had heard about it, but I didn't have anything that could write it.

Julian explained task switching to me one day during a walk along the beach. He explained that computers can just appear to be doing multiple things at once - but the CPU only does one thing at a time, and you can just switch things really quickly to give the appearance that it's multitasking. With that bright spark planted in my head, I went home and started dreaming up ways to make my Z80 based CPC do something like this.

My mother dragged me to McDonalds to apply for a job the moment I was legally able to (14 years, 9 months) and I saw a computer at a second hand shop - it was a $500 IBM PC/AT, with EGA monitor, two floppy disks and a printer. We put down a down-payment and I paid it off myself with my minimum wage money. Once I had that home I quickly erm, "acquired" a copy of Turbo Pascal for home and was off drawing funny little fractals.

So yes - it's Julian's fault I discovered FreeBSD. Yes, this is Julian Elischer. One day he showed me his computer, running something called BSD. He was trying to explain bourne shell scripting and the installer. I nodded, very confused, and eventually went back to the VGA programming book he lent me. He also showed me fractint running in X on his monochome 486 DX2-50 laptop. I had no idea what was going on under the scenes, only that the fractals were much more interesting than the ones I was drawing. So I took the VGA book home and started learning how to use the higher resolutions available. One thing stuck in my mind: so much bit-plane work. Ugh. One other thing stuck in my mind - reading from VGA memory is one of the slowest things you can do. Don't do it. Ever. (Do you hear that console driver authors? Don't do it. It's bad.)

One day he explained pointers to me. I had erm, "acquired" a copy of Turbo C 2.0 from a friend after failing to make much traction with the less friendly versions (Tiny C, for example.) I had coded up a few things, but I didn't really "get" it. So he sat me down with a pen and paper, and drew diagrams to explain what was going on. I remember that lightbulb going off in the back of my mind, as I dimly connected the whole idea of types and sizes together - and that was it. I was off and doing bad things to C code.

I eventually saved up enough for an updated 286 motherboard, then an updated graphics card (full VGA!), then a sound blaster card, and finally a 486-DX33 motherboard. He introduced me to his friend Peter (who had, and I believe still has, a rather extensive electronics collection) and handed me a FreeBSD-1.1 CDROM. I took it home, put it in, and .. it didn't do anything. My 486 had a soundblaster pro + CD-ROM, and .. well, FreeBSD-1.1 didn't speak to that hardware. So, I eventually put Slackware Linux 3.0 on the thing, and became a Linux nerd for a bit.

I did eventually try FreeBSD-1.1 on it - after putting a lot of FreeBSD bits on a lot of floppies - but I couldn't figure out what to do when it booted. This is going to sound silly - but the lack of colorls turned me off. I know, it seems silly now, but that's honestly why I went back to Slackware.

I eventually went back to FreeBSD in the 2.x era once I had an IDE CDROM and I was working part time at an ISP after (high) school finished. Yes, I figured out how to get colorls to work, I got in trouble disagreeing with a Michael (O, not M) at iiNet about Squid on Linux versus FreeBSD, and well.. stuff. Here was this 17yo kid disagreeing with things and acting like he knew everything. I'm sure it was endearing.

Fast-forward a couple years, and I had been hacking on FreeBSD here and there. I got in a little erm, "trouble" before I finished high school, which phk reminded me of - when they granted me a commit bit. I forget when this was, but I wouldn't have been much older than 20.

So - this is why mentoring kids is important. It may seem like a waste of time; it may seem like they don't understand, but we were all there once. We wanted someone to relate to, someone to look up to, and something interesting to do. Julian was that person for me, and I owe both him and my mother (of course) pretty much everything about my existence in this silly little computer industry.

(This is also why you don't skimp on hardware support for popular, if cheaper platforms and "shiny" looking features if you want people to adopt your stuff -  but that's a different rant.)

Ok, that's done. I'm going back to hacking on VGA/VESA boot loader support for FreeBSD-HEAD. That's long overdue, and I want my pretty splash screen.

RTL-SDR on FreeBSD, or "hey, cool, I live near an airport, I wonder if ADSB works.."

I bought one of those cheap RTL-SDR units a few months ago. There's no real kernel code required for it - all of the rtl-sdr code just uses the generic USB userland API which is shared between many operating systems.

So, getting it going was pretty easy:

# pkg install rtl-sdr

Then, using it to test ADSB is pretty easy:

# rtl_adsb -V -S 

.. this is verbose and listens to short packets.

Where I live (near San Jose Airport!) I receive a lot of ADSB transmissions. It's quite interesting.

Ok, so next - what about something more GUI like? Someone's already done it - . There's already a package for it:

# pkg install dump1090
# dump1090 --net --aggressive

Then, point a webserver at http://localhost:8080/ and watch!

freebsd-wifi-build, or "wait, you can run freebsd on atheros MIPS access points? where do I get that?"

I've been running FreeBSD at home as my primary internet/wifi access for a few years now. It's cheap, it's easy to do, and I've tried very hard to wrap up the whole process into a mostly-simple build system that spits out a useful image to use.

It's pretty simple in concept - I take FreeBSD-HEAD, build it with some cut-down options, create a custom filesystem image with some custom boot scripts and a custom configuration file, and provide an image that you can TFTP (using a serial console and ethernet cable) or upload directly to the AP if it supports it.

The supported hardware list is here:

Now, it's not a huge list like OpenWRT, but that's mostly because I don't have an infinite supply of Atheros MIPS based routers. I think I'll get some of the TP-Link Archer series stuff next.

Building it is pretty simple:

You checkout the build repo, check out FreeBSD-HEAD, install a couple of packages, and run the build for your board. Once it's done, the images for your board appear in ../tftpboot/. There's a wiki page for each of the supported boards with a walkthrough with how to get FreeBSD going on it.

It comes up on with 'user' and 'root' users, with no password. So, the first thing you should do after installation is telnet in, configure /etc/cfg/rc.conf with your actual LAN IPs, set the user/root passwords, and then 'cfg_save' to save things. Then, reboot and voila!

The configuration file format looks like FreeBSD but it isn't. I'm keeping it somewhat hierarchical-looking in naming but flat in implementation so I can migrate it to something like a sqlite or luci backend in the future.

It's good enough for me to be able to set up an AP to be a bridge with a management IP address and configure the ethernet switch. Others have added ipfw support to do NAT and firewalling - I'm going to add configuration rules for NAT, IPFW and routing soon so it's all integrated.

It's FreeBSD, all the way through:

$ uname -a
FreeBSD tl-wdr3600 11.0-CURRENT FreeBSD 11.0-CURRENT #0 r282406M: Wed May 6 22:27:16 PDT 2015 adrian@lucy-11i386:/usr/home/adrian/work/freebsd/head-embedded/obj/mips/mips.mips/usr/home/adrian/work/freebsd/head-embedded/src/sys/TL-WDR4300 mips
$ ifconfig wlan0 list sta
18:ee:69:15:f4:12 2 1 26M 37.0 45 2703 51888 EPS AQEHTRM RSN HTCAP WME
04:e5:36:0d:1b:0d 1 1 19M 23.0 15 1524 47072 EPS AQEPHTR RSN HTCAP WME
cc:3a:61:0e:33:a0 3 1 19M 32.0 30 2585 43072 EPS AQEPHTR RSN HTCAP WME
40:0e:85:1a:f1:69 4 1 19M 25.0 30 1138 54800 EPS AQEPHTR RSN HTCAP WME
00:0f:13:97:14:54 5 1 54M 30.0 45 1808 57424 EPS AE RSN
00:22:fa:c2:d1:20 6 1 26M 24.5 0 574 57776 EPS AQEHTRS RSN HTCAP WME

So if you'd like a FreeBSD based device to act as your home gateway, this is where you can start. It's not pfsense, but it's designed to run on things much smaller than pfsense supports and it's a good introduction into the world of FreeBSD embedded.

Intel DDIO, LLC cache, buffer alignment, prefetching, shared locks and packet rates.

I've been digging into the low level behaviour of high throughput packet classification and pushing for my job. The initial suggestions from everyone was "use netmap!" Which was cool, but it only seems to to fast packet work if you're only ever really flipping packets between receive and transmit rings. Once you start actually looking into the payload, you start having to take memory misses and things can slow down quite a bit. An L3 miss (ie, RAM access) on Sandybridge is ~50ns. (There's also costs involved in walking the TLB, but I won't cover that here.)

For background: .

But! Intel has this magical thing called DDIO. In theory (and there's a lot of theory here), DMA is done via a small (~10%) fraction of LLC (L3) cache, which is shared between all cores. If the data is already in cache when the CPU accesses it, it will be quick. Also, if you then wish to DMA out data from something in cache, it doesn't have to get flushed to memory first - it's just DMAed straight out of cache.

However! When I was doing packet bridge testing (using netmap + bridge, 64 byte payloads), I noticed that I was doing a significant amount of memory bandwidth. It wasn't quite at the rate of 10G worth of bridged data, but DDIO should be doing almost all of that work for me at 64 byte payloads.

So, to reproduce: run netmap bridge (eg 'bridge -i netmap:ix0 -i netmap:ix1') and run pkt-gen between two nodes.

This is the output of 'pcm-memory.x 1' from the intel-pcm toolkit (which is available as a binary package on FreeBSD.)

--                   System Read Throughput(MB/s):    300.68                  --
--                  System Write Throughput(MB/s):    970.81                  --
--                 System Memory Throughput(MB/s):   1271.48                  --

The first theory - the bridging isn't occuring fast enough to service what's in LLC before it gets flushed out by other packets. So, assume:

  1. It's 1/10th of the LLC - which is 1/10th of an 8 core * 2.5MB per core setup, is ~ 2MB.
  2. 64 byte payloads are being cached.
  3. Perfect (!) LLC use.
That's 32,768 packets at a time. Now, netmap is doing ~ 1000 packets a batch and it's keeping up line rate bridging on one core (~14 million packets per second), so it's not likely that.

Ok, so what if it's not perfect LLC usage?

Then I thought back to cache line aliasing and other issues that I've previously written about. What if the buffers are perfectly aligned (say, 2048 byte aligned) - the cache line aliasing effects should also manifest themselves as low LLC utilisation.

Luckily netmap has a twiddle - 'dev.netmap.buf_size' / 'dev.netmap.priv_buf_size'. They're both .. 2048. So yes, the default buffer sizes are aligned, and there's likely some very poor LLC utilisation going on.

So, I tried 1920 - that's 2048 - (2 * 64) - ie, two cache lines less than 2048.

--                   System Read Throughput(MB/s):    104.92                  --
--                  System Write Throughput(MB/s):    382.32                  --
--                 System Memory Throughput(MB/s):    487.24                  --

It's now using significantly less memory bandwidth to do the same thing. I'm guessing this is because I'm now using the LLC much more efficiently.

Ok, so that's nice - but what about when it comes time to actually look at the packet contents to make decisions?

I've modified a copy of bridge to do a few things, mostly inspired by netmap-ipfw:
  • It does batch receive from netmap;
  • but it then looks at the ethernet header do decap that;
  • then it gets the IPv4 src/dst addresses;
  • .. and looks them up in a (very large) traditional hash table.
I also have a modified copy of pkt-gen that will use completely random source and destination IPv4 addresses and ports, so as to elicit some very terrible behaviour.

With an empty hash set, but still dereferencing the ethernet header and IPv4 source/destination, handling a packet at a time, no batching, no prefetching and only using one core/thread to run:

  • Bridges about 6.5 million pps;
  • .. maxes out the CPU core;
  • Memory access: 1000MB/sec read; 423MB/sec write (~1400MB/sec in total).
  • Bridges around 10 million pps;
  • 98% of a CPU core;
  • Memory access: 125MB/sec read, 32MB/sec write, ~ 153MB/sec in total.
So, it's a significant drop in memory throughput and a massive increase in pps for a single core.

Ok, so most of the CPU time is now spent looking at the ethernet header in the demux routine and in the hash table lookup. It's a blank hash table, so it's just the memory access needed to see if the bucket has anything in it. I'm guessing it's because the CPU is loading in the ethernet and IP header into a cache line, so it's not already there from DDIO.

I next added in prefetching the ethernet header. I don't have the code to do that, so I can't report numbers at the moment. But what I did there was I looped over everything in the netmap RX ring, dereferenced the ethernet header, and then did per-packet processing. This was interesting, but I wanted to try batching out next. So, after some significant refactoring, I arranged the code to look like this:
  1. Pull in up to 1024 entries from the netmap receive ring;
  2. Loop through, up to 16 at a time, and place them in a batch
  3. For each packet in a batch do:
    1. For each packet in the batch: optional prefetch on the ethernet header
    2. For each packet in the batch: decapsulate ethernet/IP header;
    3. For each packet in the batch: optional prefetch on the hash table bucket head;
    4. For each packet in the batch: do hash table lookup, decide whether to forward/block
    5. For each packet in the batch: forward (ie, ignore the forward/block for now.)
I had things be optional so I could turn on/off prefetching and control the batch size.

So, with an empty hash table, no prefetching and only changing the batch size, at buf_size=1920:
  • Batch size of 1: 10 million pps;
  • Batch size of 2: 11.1 million pps;
  • Batch size of 4: 11.7 million pps.
Hm, that's cute. What about with prefetching of ethernet header? At buf_size=1920:
  • Batch size of 1: 10 million pps;
  • Batch size of 2: 10.8 million pps;
  • Batch size of 4: 11.5 million pps.
Ok, so that's not that useful. Prefetching on the bucket header here isn't worthwhile, because the buckets are all empty (and thus NULL pointers.)

But, I want to also be doing hash table lookups. I loaded in a reasonably large hash table set (~ 6 million entries), and I absolutely accept that a traditional hash table is not exactly memory or cache footprint happy. I was specifically after what the performance was like for a traditional hash table. Said hash table has 524,288 buckets, and each points to an array of IPv4 addresses to search. So yes, not very optimal by any measure, but it's the kind of thing you'd expect to find in an existing project.

With no prefetching, and a 6 million entry hash table:

At 2048 byte buffers:
  • Batch size of 1: 3.7 million pps;
  • Batch size of 2: 4.5 million pps;
  • Batch size of 4: 4.8 million pps.
At 1920 byte buffers:
  • Batch size of 1: 5 million pps;
  • Batch size of 2: 5.6 million pps;
  • Batch size of 4: 5.6 million pps.
That's a very inefficient hash table - each bucket is going to have around 11 IPv4 entries in it, and that's checking almost a cache line worth of IPv4 addresses in it. Not very nice. But, it's within a cache line worth of data, so in theory it's not too terrible.

What about with prefetching? All at 1920 byte buffers:
  • Batch size of 4, ethernet prefetching: 5.5 million pps
  • Batch size of 4, hash bucket prefetching: 7.7 million pps
  • Batch size of 4, ethernet + hash bucket prefetching: 7.5 million pps
So in this instance, there's no real benefit from doing prefetching on both.

For one last test, let's bump the bucket count from 524,288 to 2,097,152. These again are all at buf_size=1920:
  • Batch size of 1, no prefetching: 6.1 million pps;
  • Batch size of 2, no prefetching: 7.1 million pps;
  • Batch size of 4, no prefetching: 7.1 million pps;
  • Batch size of 4, hash bucket prefetching: 8.9 million pps.
Now, I didn't quite predict this. I figured that since I was reading in the full cache line anyway, having up to 11 entries in it to linearly check would be cheap. It turns out that no, that's not exactly true.

The difference between the naive way (no prefetching, no batching) to 4-packet batching, hash bucket prefetching is not trivial - it's ~ 50% faster. Going all the way to a larger hash bucket was ~75% faster. Now, this hash implementation is not exactly cache footprint friendly - it's bigger than the LLC, so with random flows and thus no real useful cache behaviour it's going to degrade to quite a few memory accesses.

This has been quite a fun trip down the optimisation peephole. I'm going to spend a bunch of time writing down the hardware performance counters involved in analysing this stuff and I'll look to write a follow-up post with details about that.

One final things: threads and locking. I wanted to clearly demonstrate the cost of shared read locks on a setup like this. There's been lots of discussions about the right kind of locking and concurrency strategies, so I figured I'd just do a simple test in this setup and explain how terrible it can get.

So, no read-locks between threads on the hash table, batch size of 4, hash bucket prefetching, buf_size=1920:
  • 1 thread: 8.9 million pps;
  • 4 threads: 12 million pps.
But with a read lock on the hash table lookups:
  • 1 thread: 7 million pps;
  • 4 threads: 4.7 million pps.
I'm guessing that as I add more threads, the performance will drop.

Even taking a rwlock as a reader lock in pthreads is expensive - it's purely just an atomic increment/decrement in FreeBSD, but it's still not free. I'm getting the lock once for two hash table lookups - ie, the source and destination IP hash table lookups are done under one lock. I'm sure if I took the lock for the whole batch hash table lookup it'd work out a little better on a small number of CPU cores, but I think this demonstrates my point - read locks aren't going to cut it when you have a frequently accessed thing to protect.

The best bit about this post? The prefetching, terrible (large) hash table performance and general cache abuse is not new. Doing batching on superscalar Intel CPUs is not new. Documenting DDIO effectiveness using non-power-of-two-aligned buffer sizes is new, but it's just a rehash of the existing cache aliasing effect. But, I now have a little test bed to experiment with these things without having to try and involve the rest of a kernel.

Yes, I'll publish code soon.

Using the arswitch ethernet switch on FreeBSD

I sat down a few weeks ago to make the AR8327 ethernet switch work and in doing so I wanted to add per-port and 802.1q VLAN support. It turned out that I .. didn't know as much I thought I did about the etherswitch support. So, after a whole bunch of trial-and-error, I wrapped my head around things. This post is mostly a braindump so if I do forget I have something written down about it - at least until I turn it into a FreeBSD manpage.

There's three modes:
  • default - all ports are in the same VLAN;
  • per-port - each port can be in a VLAN 'group';
  • dot1q - each port can be in multiple VLAN groups, with 802.1q tagging going on.
The per-port VLAN group is for switches that don't have an arbitrary VLAN table - you just assign each port an ID from some low set of values (say, 16), and then the VLAN tag can either be added or not added. I think the RTL8366 switch is like this, but I'd have to check.

The dot1q VLAN is for switches that support multiple VLANs, each can have an arbitrary VLAN ID (0..4095) with optional other VLAN options (like tag-in-tag support.)

The etherswitch configuration side has a few options and they're supported by different hardware:
  • Each port has a port VLAN ID - this is the "native port" for dot1q support. I don't think it has any particular meaning in the per-port VLAN code in arswitch but I could be terribly wrong. I thought it did when I initially did the port, but the documentation is .. lacking.
  • Then there's a set of per-port flags - eg q-in-q, 802.1q tagging, etc.
  • Then there's the vlangroup - each vlangroup has a vlan ID, and then a set of port members. Each port member can be tagged or untagged.
This is where things get odd.

Firstly - the AR934x SoC switch support doesn't include VLANs. I need to add that. I'm not sure which side of the wall this falls.

The switches previous to the AR8327 support per-port and VLAN configuration, but they don't support per-port-per-VLAN tagging. Ie, you can configure 802.1q VLANs, and you can enable tagging on the port - but it tags all packets that aren't the port 'VLAN ID'.

The per-port VLAN ID seems ignored by the arswitch code - it's only used by the dot1q support.

So I think (and it hasn't yet been tested) that on the earlier switches, I can use per-port VLANs with tagging by:
  • Configuring per port vlans - "etherswitch config vlan_mode port"
  • Adding vlangroups as appropriate with membership - tag/untag doesn't matter
  • Set the CPU port up to have tagging - "etherswitch port0 addtag"
When configuring dot1q VLANs, the mode is "config vlan_mode dot1q" and the 802.1q VLAN IDs are used, but the above still holds - the port is tagged or untagged.

But on the AR8327, the VLAN map hardware actually supports enabling/disabling tagging on a per-port-per-VLAN basis. Ie, when the VLAN table is programmed with the port membership, it takes a list of both the ports and whether the ports are tagged/untagged/open/filtered. So, I don't think per-port VLAN tagging works - only dot1q tagging. Maybe I can make it work, but I haven't really sat down for long enough with the documentation to see what combinations are required.
  • Configure the hardware - "etherswitch config vlan_mode dot1q"
  • Add vlangroups as appropriate, set pvid as appropriate
  • For each vlangroup membership, the port can be tagged or untagged - eg to tag the cpu port 0, you'd use '0t' as the port member. That says "port0 is a member, and it's tagged."
I still have a whole lot more to add - the ingress/egress filters aren't configurable, the per-port vlan stuff needs to be made much more sensible and consistent - and the AR934x SoC switch needs to support VLANs. Oh, and much more documentation. But, hey, I can get the thing spitting out VLAN tags, so when it's time to setup my home network with some VLANs, i'll be sure to document what I did and share it with everyone.

Cache Line Aliasing #2, or "What happens when you page align everything"

After a little more digging into the Intel performance side of things, I discovered one of the big reasons for the performance drop on this particular workload: how Intel CPUs do memory reordering.

The TL;DR is this - there's some hardware inside the Intel CPUs that tracks memory ordering and cache contents - but they don't use all the address bits.

The relevant chapter in the intel optimisation guide is 3.6.8 - Capacity Limits and Aliasing in Caches. The specific thing I was hitting was in - Store Forwarding Aliasing.

Assembly/Compiler Coding Rule 56. (H impact, M generality) Avoid having a store followed by a non-dependent load with addresses that differ by a multiple of 4 KBytes. Also, lay out data or order computation to avoid having cache lines that have linear addresses that are a multiple of 64 KBytes apart in the same working set. Avoid having more than 4 cache lines that are some multiple of 2 KBytes apart in the same first-level cache working set, and avoid having more than 8 cache lines that are some multiple of 4 KBytes apart in the same first-level cache working set.

So, given this, what can be done? In this workload, a bunch of large matrices were allocated via jemalloc, which page aligns large allocations. In the default invocation of the benchmark (where the allocation padding size is 0), the memory access patterns showed a very large number of counter events on "LD_BLOCKS_PARTIAL.ADDRESS_ALIAS" - which is the number of 64k address aliases on the Sandy Bridge Xeon processors I've been testing on. (The same occurs on Westmere, Ivy Bridge and Haswell.) As I vary the padding size, the address aliasing value drops, the memory access counters increase, and the general performance increases.

On the test boxes I have (running pmcstat -w 120 -C -p LD_BLOCKS_PARTIAL.ADDRESS_ALIAS ./himenobmtxpa M )

0 217799413 830.995025
64 18138386 1624.296713
96 8876469 1662.486298
128 19281984 1645.370750
192 18247069 1643.119908
256 18511952 1661.426341
320 19636951 1674.154119
352 19716236 1686.694053
384 19684863 1681.110499
448 18189029 1683.163673
512 19380987 1691.937818

So there's still plenty of aliasing going on at different padding offsets, however it's a very marked drop between 0 and, well, anything.

It turns out that someone's gone and done a bunch more digging into the effects of various CPU magic under the hood. The last paper in the list (Analysing Contextual Bias..) looks at Aliasing and Cache Effects and the effect of memory layout. There's some cute (and sobering!) analysis of the performance changes due to something as simple as the length of your login name in the UNIX environment. It's worth reading.

The summary? Maybe page alignment of all of your memory accesses isn't the way to go.

For further reading:

cache line aliasing effects, or "why is freebsd slower than linux?"

There was some threads on FreeBSD/DragonflyBSD mailing lists a few years ago (2012?) which talked about some math benchmarks being much slower on FreeBSD/DragonflyBSD versus Linux.

When the same benchmark is run on FreeBSD/DragonflyBSD using the Linux layer (ie, a linux binary compiled for linux, but run on BSD) it gives the same or better behaviour.

Some digging was done, and it turned out it was due to memory allocation patterns and memory layout. The jemalloc library allocates large chunks at page aligned boundaries, whereas the allocator in glibc under Linux does not.

I've put the code online in the hope that others can test and verify this:

The branch 'local/freebsd' has my local change to allow the allocator offset to be specified. The offset compounds on each allocation - so with an 'n' byte offset, the first allocation is 0 bytes offset from the page boundary, the next is 'n' bytes offset from the page boundary, the next is '2n' bytes offset, etc.

You can experiment with different values and get completely different behavioural results. It's non-trivial: there's a 100% speedup by using a 127 byte offset for each allocation, versus a 0 byte offset.

I'd like to investigate cache line aliasing effects further. There was work done a few years ago to offset mbuf headers in the FreeBSD kernel so they weren't all page-aligned or 256/512/1024 byte aligned - and apparently this gave a significant performance improvement. But it wasn't folded into FreeBSD. What I'd like to do is come up with some better strategies / profiling guides for identifying when this is actually happening so the underlying objects being accessed can be adjusted.

So - if anyone out there has any tips, hints or suggestions on how to do this, please let me know. I'd like to document and automate this testing.

FreeBSD on the POWER8: it’s alive!

A post to freebsd-ppc from a couple of months ago asked if we had support for POWER8 and offered to provide remote access to anyone interested in working on it. I was sufficiently intrigued that I approached the FreeBSD powerpc hackers to ask about it, and was informed that it'd be nice, but we didn't have hardware.

After a bit of wrangling of hardware logistics and with the FreeBSD Foundation purchasing a box, a Tyan POWER8 evaluation server appeared. Nathan Whitehorn started poking at it and managed to get a basic "hello world" going, but stalled on issues with the Linux KVM virtualisation environment.

Fast forward a few weeks - he's figured out the KVM issues, their lack of support for some mandated hypervisor APIs and other bugs - FreeBSD now boots inside of the hypervisor environment and seems stable enough to do development on.

He then found the existing powerpc pmap (physical memory management) code wasn't very SMP friendly - it works fine on one and two CPU powerpc machines, but this POWER8 evaluation board is a 4-core, 32-thread CPU. So a few days of development went by and he rewrote most of the pmap code to be much more fine grained locked and scale much, much better than the existing code. (He also found the PS3 hypervisor layer isn't thread-safe.)

What's been done thus far?

  • FreeBSD boots inside the hypervisor environment;
  • Virtualised console, networking and storage all work;
  • (in progress) new, scalable pmap implementation;
  • Initial support for the Vector-Scalar Extension (VSX) that's found on POWER7 and POWER8.
So, I'm impressed. Nathan's done a fantastic job bringing the whole thing up. There's some further work on the new powerpc technology that needs doing (things like the new vector processing units, performance counter support and such) and I'm sure Justin and Nathan will poke powerpc dtrace support into further good shape. I'm going to see if we can fix a chelsio 40G NIC into one of these and work with their developers to fix any endian/busdma issues that creep up, and then do some network stack scaling testing with it. There's also the missing hardware/hypervisor support to run FreeBSD on the bare metal, which would be a fantastic achievement.

Now I kind of want some larger POWER8 hardware.

TDMA (somewhat) working on AR9380 chips

(Wow, I have a lot of posts to write to catch up on things.)

I've just brought up FreeBSD's TDMA support on the AR9380 chipset. Specifically, the AR9331, since I have a Carambola 2 on me today.

It was pretty simple to bring up - I was missing the beacon configuration HAL call that the TDMA code expected. It's only used by the TDMA code - the STA and AP modes rely on the normal HAL beacon methods that date back to the Atheros HAL.

The only problem - it seems something is up with ANI (noise immunity) and sensitivity on at least the AR9331. It doesn't seem to behave well on slightly loaded channels and thus the beacons don't always go out when they're supposed to.

But, if you've been wanting to play with TDMA on the later Atheros chips, now you can!

On profiling HTTP, or "god damnit people, why are all the open source tools slow?"

Something that's been a challenge at work (and at other things in the past) has been "how do I generate enough traffic to test this thing?"

If you're running some public facing boxes then sure, you can do A/B testing. But what if you're not able to test it in the real world? What if you need to do testing before you ship, and the traffic levels have to be stupid high?

So, what do you do?

I've done this a few times. When doing squid and other reverse proxy development, I would run tools like apachebench, httperf, even web polygraph - but these things scaled poorly. They didn't handle tens of thousands of concurrent connections and scale to both slow and fast clients - their use of poll() and select() just wouldn't work out well.

Something I did at Netflix was to start building TCP testing tools that more than 65,000 concurrent sockets. My aim is much higher, but one has to start somewhere. I was testing out the network stack rather than specifically doing HTTP testing. Here at my current job, I'm much more interested in real HTTP and all processing.

I looked at what's out there, and it's not very pretty. I need to be able to do 10G of traffic, looking upwards towards 20G and 40G of HTTP in the future. After a little more digging into what was out there - and finding httperf actually reverted my changes to use libevent and went back to poll/select! - I decided it was about time I just started writing something minimal to stress test things and build upon it as the need arose. I want something that eventually ends up like web polygraph - multiple client/server sets with different URL choices from a pool, a variety of client IP addresses, and other things like how often to make the requests and other request pacing.

So, I grabbed libevent, libevhtp from Mark Ellzey and threw them together. It turned out okish - libevent/libevhtp still does a bunch of memcpy()'ing inside the buffer management routines that makes 40G on one box infeasible at the moment, but it's good enough to get a few gigabit of client traffic on one core. There were some hiccups which I'll cover below, but it's good enough to build upon.

What did I learn?
  • Well, it turns out the client code in libevhtp was a bit immature. Mark and I talked a bit about it on IRC and then I found there was an outstanding pull request that found and fixed a bunch of these. So, my code has turned into another thing - a libevhtp client and server test suite.
  • The libevhtp threading model is fine for a couple of CPUs, but it's the standardish *NIX model of "one thread does accept, farms work off to other threads." So it's not going to scale well at high request rates to multiple CPUs. That's cool; that's what the FreeBSD-HEAD RSS work is for.
  •  There's memcpy()'ing in the libevhtp body handling code. It's not a big deal at 1G, but at 10G it's definitely noticeable. I've spoken to Mark about it.
But, it's a good starting point. Once the rest of the bugs get shaken out, it'll be a good high throughput HTTP traffic tester.

What would I do next, after the bugs?

  • the server will eventually grow the ability to generate controllable sized responses. That way the client can control how big a response to send and thus can create a mix of requests/replies.
  • .. and HTTP request body testing would be nice.
  • The client side needs to grow the ability to create client pools, like web polygraph, where certain subsets of clients get certain behaviours (like a pool of IPs to use, separate pool of URLs to fetch from, the time between each HTTP request, etc.)
The other trick is how to simulate lots (and I do mean lots) of IP addresses. I don't want to create separate loopback connections for each - that would be crazy. Instead, it'd be good to use the transparent interception support in FreeBSD IPFW that allows both connections from and connections to arbitrary IP addresses. A little trickery with IP routing so we don't need more than 1 ARP entry for each server and voila!

Oh, and the code?

… mmm emulators.

I occasionally get asked to test out FreeBSD/MIPS patches for people, as they don't have physical hardware present. I can understand that - the hardware is cheap and plentiful, but not everyone wants to have a spare access point around just to test out MIPS changes on.

However QEMU does a pretty good job of emulating MIPS if you're just testing out non-hardware patches. There's even instructions on the FreeBSD wiki for how to do this! So I decided to teach my wifi build system about the various QEMU MIPS emulator targets so it can spit out a kernel and mfsroot to use for QEMU.


It turns out that it wasn't all that hard. The main trick was to use qemu-devel, not qemu. There are bugs in the non-development QEMU branch that mean it works great for Linux but not FreeBSD.

The kernel configurations in FreeBSD had bitrotted a little bit (they were missing the random device, for example) but besides that the build, install and QEMU startup just worked. I now have FreeBSD/MIPS of each variety (32 bit, 64 bit, Little-Endian, Big-Endian) running under QEMU and building FreeBSD-HEAD as a basic test.

Next is figuring out how to build gdb to target each of the above and have it speak to the QEMU GDB stub. That should make it very easy to do MIPS platform debugging.

I also hear rumours about this stuff working somewhat for ARM and PPC, so I'll see how hard it is to run QEMU for those platforms and whether FreeBSD will just boot and run on each.

More RSS UDP tests – this time on a Dell R720

I've recently had the chance to run my RSS UDP test suite up on a pair of Dell R720s. They came with on-board 10G Intel NICs (ixgbe(4) in FreeBSD) so I figured I'd run my test suite up on it.

Thank you to the Enterprise Storage Division at Dell for providing hardware for me to develop on!

The config is like in the previous blog post, but now I have two 8-core Sandy Bridge Xeon CPUs to play with. To simply things (and to not have to try and solve NUMA related issues) I'm running this on the first socket. The Intel NIC is attached to the first CPU socket.


  • CPU: Intel(R) Xeon(R) CPU E5-2650 0 @ 2.00GHz (2000.04-MHz K8-class CPU) x 2
  • RAM: 64GiB
  • HTT disabled

# ... until ncpus is tunable, make it use 8 buckets.

This time I want to test with 8 streams, so after some trial and error I found the right IPv4 addresses to use:

  • Server:
  • Client:,,,,,,,
The test was like before - the server ran one rss-udp-srv program that spawns one thread per RSS bucket. The client side runs rss-clt programs to generate traffic - but now there's eight of them instead of four.

The results are what I expected: the contention is in the same place (UDP receive) and it's per-core - it doesn't contend between CPU cores.

Each CPU is transmitting and receiving 215,000 510-byte UDP frames a second. It scales linearly - 1 CPU is 215,000 TX/RX frames a second. 8 CPUs is 215,000 TX/RX frames a second * 8. There's no degrading as the CPU core count increases.

That's 1.72 million packets per second. At 510 bytes frames it's about 7 gigabits/sec in and out.

The other 8 cores are idle. Ideally we'd be able to run an application in those cores - so hopefully I can get my network / rss library up and running enough to prototype an RSS-aware memcached and see if it'll handle this particular workload.

It's a far cry from what I think we can likely achieve - but please keep in mind that I know I could do more awesome looking results with netmap, PF_RING or Intel's DPDK software. What I'm trying to do is push the existing kernel networking subsystem to its limits so the issues can be exposed and fixed.

So, where's the CPU going?

In the UDP server program (pid 1620), it looks thus:

# pmcstat -P CPU_CLK_UNHALTED_CORE -T -w 1 -p 1620
PMC: [CPU_CLK_UNHALTED_CORE] Samples: 34298 (100.0%) , 155 unresolved

  8.0 kernel     fget_unlocked        kern_sendit:4.2 kern_recvit:3.9
  7.0 kernel     copyout              soreceive_dgram:5.6 amd64_syscall:0.9
  3.6 kernel     __mtx_unlock_flags   ixgbe_mq_start
  3.5 kernel     copyin               m_uiotombuf:1.8 amd64_syscall:1.2
  3.4 kernel     memcpy               ip_output:2.9 ether_output:0.6
  3.4 kernel     toeplitz_hash        rss_hash_ip4_2tuple
  3.3 kernel     bcopy                rss_hash_ip4_2tuple:1.4 rss_proto_software_hash_v4:0.9
  3.0 kernel     _mtx_lock_spin_cooki pmclog_reserve
  2.7 kernel     udp_send             sosend_dgram
  2.5 kernel     ip_output            udp_send

In the NIC receive / transmit thread(s) (pid 12), it looks thus:

# pmcstat -P CPU_CLK_UNHALTED_CORE -T -w 1 -p 12

PMC: [CPU_CLK_UNHALTED_CORE] Samples: 79319 (100.0%) , 0 unresolved

 10.3 kernel     ixgbe_rxeof          ixgbe_msix_que
  9.3 kernel     __mtx_unlock_flags   ixgbe_rxeof:4.8 netisr_dispatch_src:2.1 in_pcblookup_mbuf:1.3
  8.3 kernel     __mtx_lock_flags     ixgbe_rxeof:2.8 netisr_dispatch_src:2.4 udp_append:1.2 in_pcblookup_mbuf:1.1 knote:0.6
  3.8 kernel     bcmp                 netisr_dispatch_src
  3.6 kernel     uma_zalloc_arg       sbappendaddr_locked_internal:2.0 m_getjcl:1.6
  3.4 kernel     ip_input             netisr_dispatch_src
  3.4 kernel     lock_profile_release __mtx_unlock_flags
  3.4 kernel     in_pcblookup_mbuf    udp_input
  3.0 kernel     ether_nh_input       netisr_dispatch_src
  2.4 kernel     udp_input            ip_input
  2.4 kernel     mb_free_ext          m_freem
  2.2 kernel     lock_profile_obtain_ __mtx_lock_flags
  2.1 kernel     ixgbe_refresh_mbufs  ixgbe_rxeof

It looks like there's some obvious optimisations to poke at (what the heck is fget_unlocked() doing up there?) and yes, copyout/copyin are really terrible but currently unavoidable. The toeplitz hash and bcopy aren't very nice but they're occuring in the transmit path because at the moment there's no nice way to efficiently set both the outbound RSS hash and RSS bucket ID and send to a non-connected socket destination (ie, specify the destination IP:port as part of the send.) There's also some lock contention that needs to be addressed.

The output of the netisr queue statistics looks good:

root@abaddon:/home/adrian/git/github/erikarn/freebsd-rss # netstat -Q
Setting                        Current        Limit
Thread count                         8            8
Default queue limit                256        10240
Dispatch policy                 direct          n/a
Threads bound to CPUs          enabled          n/a

Name   Proto QLimit Policy Dispatch Flags
ip         1    256    cpu   hybrid   C--
igmp       2    256 source  default   ---
rtsock     3    256 source  default   ---
arp        4    256 source  default   ---
ether      5    256    cpu   direct   C--
ip6        6    256   flow  default   ---
ip_direct     9    256    cpu   hybrid   C--

WSID CPU   Name     Len WMark   Disp'd  HDisp'd   QDrops   Queued  Handled
   0   0   ip         0    25        0 839349259        0       49 839349308
   0   0   igmp       0     0        0        0        0        0        0
   0   0   rtsock     0     2        0        0        0       92       92
   0   0   arp        0     0      118        0        0        0      118
   0   0   ether      0     0 839349600        0        0        0 839349600
   0   0   ip6        0     0        0        0        0        0        0
   0   0   ip_direct     0     0        0        0        0        0        0
   1   1   ip         0    20        0 829928186        0      286 829928472
   1   1   igmp       0     0        0        0        0        0        0
   1   1   rtsock     0     0        0        0        0        0        0
   1   1   arp        0     0        0        0        0        0        0
   1   1   ether      0     0 829928672        0        0        0 829928672
   1   1   ip6        0     0        0        0        0        0        0
   1   1   ip_direct     0     0        0        0        0        0        0
   2   2   ip         0     0        0 835558437        0        0 835558437
   2   2   igmp       0     0        0        0        0        0        0
   2   2   rtsock     0     0        0        0        0        0        0
   2   2   arp        0     0        0        0        0        0        0
   2   2   ether      0     0 835558610        0        0        0 835558610
   2   2   ip6        0     0        0        0        0        0        0
   2   2   ip_direct     0     0        0        0        0        0        0
   3   3   ip         0     1        0 850271162        0       23 850271185
   3   3   igmp       0     0        0        0        0        0        0
   3   3   rtsock     0     0        0        0        0        0        0
   3   3   arp        0     0        0        0        0        0        0
   3   3   ether      0     0 850271163        0        0        0 850271163
   3   3   ip6        0     0        0        0        0        0        0
   3   3   ip_direct     0     0        0        0        0        0        0
   4   4   ip         0    23        0 817439448        0      345 817439793
   4   4   igmp       0     0        0        0        0        0        0
   4   4   rtsock     0     0        0        0        0        0        0
   4   4   arp        0     0        0        0        0        0        0
   4   4   ether      0     0 817439625        0        0        0 817439625
   4   4   ip6        0     0        0        0        0        0        0
   4   4   ip_direct     0     0        0        0        0        0        0
   5   5   ip         0    19        0 817862508        0      332 817862840
   5   5   igmp       0     0        0        0        0        0        0
   5   5   rtsock     0     0        0        0        0        0        0
   5   5   arp        0     0        0        0        0        0        0
   5   5   ether      0     0 817862675        0        0        0 817862675
   5   5   ip6        0     0        0        0        0        0        0
   5   5   ip_direct     0     0        0        0        0        0        0
   6   6   ip         0    19        0 817281399        0      457 817281856
   6   6   igmp       0     0        0        0        0        0        0
   6   6   rtsock     0     0        0        0        0        0        0
   6   6   arp        0     0        0        0        0        0        0
   6   6   ether      0     0 817281665        0        0        0 817281665
   6   6   ip6        0     0        0        0        0        0        0
   6   6   ip_direct     0     0        0        0        0        0        0
   7   7   ip         0     0        0 813562616        0        0 813562616
   7   7   igmp       0     0        0        0        0        0        0
   7   7   rtsock     0     0        0        0        0        0        0
   7   7   arp        0     0        0        0        0        0        0
   7   7   ether      0     0 813562620        0        0        0 813562620
   7   7   ip6        0     0        0        0        0        0        0
   7   7   ip_direct     0     0        0        0        0        0        0
root@abaddon:/home/adrian/git/github/erikarn/freebsd-rss # 

It looks like everything is being dispatched correctly; nothing is being queued and/or dropped.

But yes, we're running out of socket buffers because each core is 100% pinned:

root@abaddon:/home/adrian/git/github/erikarn/freebsd-rss # netstat -sp udp
        6773040390 datagrams received
        0 with incomplete header
        0 with bad data length field
        0 with bad checksum
        0 with no checksum
        17450880 dropped due to no socket
        136 broadcast/multicast datagrams undelivered
        1634117674 dropped due to full socket buffers
        0 not for hashed pcb
        5121471700 delivered
        5121471044 datagrams output
        0 times multicast source filter matched

There's definitely room for improvement.

UDP RSS update: ixbge(4) turned out to have issues..

I started digging deeper into the RSS performance on my home test platform. Four cores and one (desktop) socket isn't all that much, but it's a good starting point for this.

It turns out that there was some lock contention inside netisr. Which made no sense, as RSS should be keeping all the flows local to each CPU.

After a bunch of digging, I discovered that the NIC was occasionally receiving packets into the wrong ring. Have a look at tihs:

Sep 12 08:04:32 adrian-hackbox kernel: ix0: ixgbe_rxeof: 100034:
m=0xfffff80047713d00; flowid=0x21f7db62; rxr->me=3
Sep 12 08:04:32 adrian-hackbox kernel: ix0: ixgbe_rxeof: 100034:
m=0xfffff8004742e100; flowid=0x21f7db62; rxr->me=3
Sep 12 08:04:32 adrian-hackbox kernel: ix0: ixgbe_rxeof: 100034:
m=0xfffff800474c2e00; flowid=0x21f7db62; rxr->me=3
Sep 12 08:04:32 adrian-hackbox kernel: ix0: ixgbe_rxeof: 100034:
m=0xfffff800474c5000; flowid=0x21f7db62; rxr->me=3
Sep 12 08:04:32 adrian-hackbox kernel: ix0: ixgbe_rxeof: 100034:
m=0xfffff8004742ec00; flowid=0x21f7db62; rxr->me=3
Sep 12 08:04:32 adrian-hackbox kernel: ix0: ixgbe_rxeof: 100032:
m=0xfffff8004727a700; flowid=0x335a5c03; rxr->me=2
Sep 12 08:04:32 adrian-hackbox kernel: ix0: ixgbe_rxeof: 100032:
m=0xfffff80006f11600; flowid=0x335a5c03; rxr->me=2
Sep 12 08:04:32 adrian-hackbox kernel: ix0: ixgbe_rxeof: 100032:
m=0xfffff80047279b00; flowid=0x335a5c03; rxr->me=2
Sep 12 08:04:32 adrian-hackbox kernel: ix0: ixgbe_rxeof: 100032:
m=0xfffff80006f0b700; flowid=0x335a5c03; rxr->me=2

The RX flowid was correct - I hashed the packets in software too and verified the software hash equaled the hardware hash. But they were turning up on the wrong receive queue. "rxr->me" is the queue id; the hardware should be hashing on the last 7 bits. 0x3 -> ring 3, 0x2 -> ring 2.

It also only happened when I was sending traffic to more than one receive ring. Everything was okay if I just transmitted to a single receive ring.

Luckily for me, some developers from Verisign saw some odd behaviour in their TCP stress testing and had dug in a bit further. They were seeing corrupted frames on the receive side that looked a lot like internal NIC configuration state. They figured out that the ixgbe(4) driver wasn't initialising the flow director and receive units correctly - the FreeBSD driver was not correctly setting up the amount of memory each was allocated on the NIC and they were overlapping. They also found a handful of incorrectly handled errors and double-freed mbufs.

So, with that all fixed, their TCP problem went away and my UDP tests started properly behaving themselves. Now all the flows are ending up on the right CPUs.

The flow director code was also dynamically programming flows into the NIC to try and rebalance traffic. Trouble is, I think it's a bit buggy and it's likely not working well with generic receive offload (LRO).

What's it mean for normal people? Well, it's fixed in FreeBSD-HEAD now. I'm hoping I or someone else will backport it to FreeBSD-10 soon. It fixes my UDP tests - now I hit around 1.3 million packets per second transmit and receive on my test rig; the server now has around 10-15% CPU free. It also fixed issues that Verisign were seeing with their high transaction rate TCP tests. I'm hoping that it fixes the odd corner cases that people have seen with Intel 10 gigabit hardware on FreeBSD and makes LRO generally more useful and stable.

Next up - some code refactoring, then finishing off IPv6 RSS!


Receive side scaling: testing UDP throughput

I think it's about time I shared some more details about the RSS stuff going into FreeBSD and how I'm testing it.

For now I'm focusing on IPv4 + UDP on the Intel 10GE NICs. The TCP side of things is done (and the IPv6 side of things works too!) but enough of the performance walls show up in the IPv4 UDP case that it's worth sticking to it for now.

I'm testing on a pair of 4-core boxes at home. They're not special - and they're very specifically not trying to be server-class hardware. I'd like to see where these bottlenecks are even at low core count.

The test setup in question:

Testing software:

  • It requires libevent2 - an updated copy; previous versions of libevent2 didn't handle FreeBSD specific errors gracefully and would early error out of the IO loop.


  • CPU: Intel(R) Core(TM) i5-3550 CPU @ 3.30GHz (3292.59-MHz K8-class CPU)
  • There's no SMT/HTT, but I've disabled it in the BIOS just to be sure
  • 4GB RAM
  • FreeBSD-HEAD, amd64
  • NIC:  '82599EB 10-Gigabit SFI/SFP+ Network Connection
  • ix0:

# for now redirect processing just makes the lock overhead suck even more.
# disable it.



# experiment with deferred dispatch for RSS

kernel config:

include GENERIC

device netmap
options RSS
options PCBGROUP

# in-system lock profiling

# Flowtable - the rtentry locking is a bit .. slow.
options   FLOWTABLE

# This debugging code has too much overhead to do accurate
# testing with.
nooptions         INVARIANTS
nooptions         INVARIANT_SUPPORT
nooptions         WITNESS
nooptions         WITNESS_SKIPSPIN

The server runs the "rss-udp-srv" process, which behaves like a multi-threaded UDP echo server on port 8080.


The client box is slightly more powerful to compensate for (currently) not using completely affinity-aware RSS UDP transmit code.

  • CPU: Intel(R) Core(TM) i5-4460  CPU @ 3.20GHz (3192.68-MHz K8-class CPU)
  • SMT/HTT: Disabled in BIOS
  • 8GB RAM
  • FreeBSD-HEAD amd64
  • Same kernel config, loader and sysctl config as the server
  • ix0: configured as,,,,
The client runs 'udp-clt' programs to source and sink traffic to the server.

Running things

The server-side simply runs the listen server, configured to respond to each frame:

$ rss-udp-srv 1

The client-side runs four couples of udp-clt, each from different IP addresses. These are run in parallel (i do it in different screens, so I can quickly see what's going on):

$ ./udp-clt -l -r -p 8080 -n 10000000000 -s 510
$ ./udp-clt -l -r -p 8080 -n 10000000000 -s 510
$ ./udp-clt -l -r -p 8080 -n 10000000000 -s 510
$ ./udp-clt -l -r -p 8080 -n 10000000000 -s 510

The IP addresses are chosen so that the 2-tuple topelitz hash using the default Microsoft key hash to different RSS buckets that live on individual CPUs.

Results: Round one

When the server is responding to each frame, the following occurs. The numbers are "number of frames generated by the client (netstat)", "number of frames received by the server (netstat)", "number of frames seen by udp-rss-srv", "number of responses transmitted from udp-rss-srv", "number of frames seen by the server (netstat)"
  • 1 udp-clt process: 710,000; 710,000; 296,000; 283,000; 281,000
  • 2 udp-clt processes: 1,300,000; 1,300,000; 592,000; 592,000; 575,000
  • 3 udp-clt processes: 1,800,000; 1,800,000; 636,000; 636,000; 600,000
  • 4 udp-clt processes: 2,100,000; 2,100,000; 255,000; 255,000; 255,000
So, it's not actually linear past two cores. The question here is: why?

There are a couple of parts to this.

Firstly - I had left turbo boost on. What this translated to:

  • One core active: ~ 30% increase in clock speed
  • Two cores active: ~ 30% increase in clock speed
  • Three cores active: ~ 25% increase in clock speed
  • Four cores active: ~ 15% increase in clock speed.
Secondly and more importantly - I had left flow control enabled. This made a world of difference.

The revised results are mostly linear - with more active RSS buckets (and thus CPUs) things seem to get slightly more efficient:
  • 1 udp-clt process: 710,000; 710,000; 266,000; 266,000; 266,000
  • 2 udp-clt processes: 1,300,000; 1,300,000; 512,000; 512,000; 512,000
  • 3 udp-clt processes: 1,800,000; 1,800,000; 810,000; 810,000; 810,000
  • 4 udp-clt processes: 2,100,000; 2,100,000; 1,120,000; 1,120,000; 1,120,000

Finally, let's repeat the process but only receiving instead also echoing back the packet to the client:

$ rss-udp-srv 0
  • 1 udp-clt process: 710,000; 710,000; 204,000
  • 2 udp-clt processes: 1,300,000; 1,300,000; 378,000
  • 3 udp-clt processes: 1,800,000; 1,800,000; 645,000
  • 4 udp-clt processes: 2,100,000; 2,100,000; 900,000
The receive-only workload is actually worse off versus the transmit + receive workload!

What's going on here?

Well, a little digging shows that in both instances - even with a single udp-clt thread running which means only one CPU on the server side is actually active! - there's active lock contention.

Here's an example dtrace output for measuring lock contention with only one active process, where one CPU is involved (and the other three are idle):

Receive only, 5 seconds:

root@adrian-hackbox:/home/adrian/git/github/erikarn/freebsd-rss # dtrace -n 'lockstat:::adaptive-block { @[stack()] = sum(arg1); }'
dtrace: description 'lockstat:::adaptive-block ' matched 1 probe


Transmit + receive, 5 seconds:

dtrace: description 'lockstat:::adaptive-block ' matched 1 probe




Somehow it seems there's less lock contention / blocking going on when both transmit and receive is running!

So then I dug into it using the lock profiling suite. This is for 5 seconds with receive-only traffic on a single RSS bucket / CPU (all other CPUs are idle):

# sysctl = 1; sleep 5 ; sysctl

root@adrian-hackbox:/home/adrian/git/github/erikarn/freebsd-rss # sysctl ; sleep 5 ; sysctl 1 -> 1 1 -> 0

root@adrian-hackbox:/home/adrian/git/github/erikarn/freebsd-rss # sysctl | head -2 ; sysctl | sort -nk4 | tail -10 
     max  wait_max       total  wait_total       count    avg wait_avg cnt_hold cnt_lock name
    1496         0       10900           0          28    389      0  0      0 /usr/home/adrian/work/freebsd/head/src/sys/dev/usb/usb_device.c:2755 (sx:USB config SX lock) 
       0         0          31           1          67      0      0  0      4 /usr/home/adrian/work/freebsd/head/src/sys/kern/sched_ule.c:888 (spin mutex:sched lock 2)
       0         0        2715           1       49740      0      0  0      7 /usr/home/adrian/work/freebsd/head/src/sys/dev/random/random_harvestq.c:294 (spin mutex:entropy harvest mutex)
       1         0          51           1         131      0      0  0      2 /usr/home/adrian/work/freebsd/head/src/sys/kern/sched_ule.c:1179 (spin mutex:sched lock 1)
       0         0          69           2         170      0      0  0      8 /usr/home/adrian/work/freebsd/head/src/sys/kern/sched_ule.c:886 (spin mutex:sched lock 2)
       0         0       40389           2      287649      0      0  0      8 /usr/home/adrian/work/freebsd/head/src/sys/kern/kern_intr.c:1359 (spin mutex:sched lock 2)
       0         2           2           4          12      0      0  0      2 /usr/home/adrian/work/freebsd/head/src/sys/dev/usb/usb_device.c:2762 (sleep mutex:Giant)
      15        20        6556         520        2254      2      0  0    105 /usr/home/adrian/work/freebsd/head/src/sys/dev/acpica/Osd/OsdSynch.c:535 (spin mutex:ACPI lock (0xfffff80002b10f00))
       4         5      195967       65888     3445501      0      0  0  28975 /usr/home/adrian/work/freebsd/head/src/sys/netinet/udp_usrreq.c:369 (sleep mutex:so_rcv)

Notice the lock contention for the so_rcv (socket receive buffer) handling? What's going on here is pretty amusing - it turns out that because there's so much receive traffic going on, the userland process receiving the data is being preempted by the NIC receive thread very often - and when this happens, there's a good chance it's going to be within the small window that the receive socket buffer lock is held. Once this happens, the NIC receive thread processes frames until it gets to one that requires it to grab the same sock buffer lock that is already held by userland - and it fails - so the NIC thread sleeps until the userland thread finishes consuming a packet. Then the CPU flips back to the NIC thread and continues processing a packet.

When the userland code is also transmitting frames it's increasing the amount of time in between socket receives and decreasing the probability of hitting the lock contention condition above.

Note there's no contention between CPUs here - this is entirely contention within a single CPU.

So for now I'm happy that the UDP IPv4 path is scaling well enough with RSS on a single core. The main performance problem here is the socket receive buffer locking (and, yes, copyin() / copyout().)


Receive Side Scaling: figuring out how to handle IP fragments

The TL:DR; of this is - IP fragments are annoying.

If everything was awesome and there were never IP fragments, all TCP and UDP frames would always have the TCP/UDP header stamped on them, and the NIC could hash the TCP/UDP header in hardware to calculate the destination queue to receive traffic on.

However, everything isn't awesome and there will be cases where IP frames are fragmented. When this happens, the first frame in the fragment has the IPv4 header and the TCP/UDP header - but the subsequent fragments only have the IPv4 header. That means there's not enough information in the rest of the fragments to hash them to the same hash value and thus hardware queue as the first fragment - only the first has the full IPv4+TCP/UDP information.

The Intel and Chelsio NICs will hash on all packets that are fragmented by only hashing on the IPv4 details. So, if it's a fragmented TCP or UDP frame, it will hash the first fragment the same as the others - it'll ignore the TCP/UDP details and only hash on the IPv4 frame. This means that all the fragments in a given IP datagram will hash to the same value and thus the same queue.

But if there are a mix of fragmented and non-fragmented packets in a given flow - for example, small versus larger UDP frames - then some may be hashed via the IPv4+TCP or IPv4+UDP details and some will just be hashed via the IPv4 details. This means that packets in the same flow will end up being received in different receive queues and thus highly likely be processed out of order.

The Linux intel driver code flipped off IPv4+UDP hashing a while ago - they hash UDP frames by their IPv4 details only and then do whatever other load balancing in the kernel they choose. I found this and updated the FreeBSD drivers to do the same. This should result in less out of order UDP frames for UDP heavy workloads. I'm not sure about the Chelsio driver yet - when I convert it to the RSS framework it'll disable IPv4+UDP hashing if that isn't enabled at boot time. This is a good stop-gap, but it's not the whole story.

TCP is where it gets annoying. People don't want to flip off IPv4+TCP hashing as they're convinced that the TCP MSS negotiation and path-MTU discovery stuff will prevent there from being any IP fragmented TCP frames. But, well, that's not really viable in the real world. There are too many misconfigured networks out there and IP fragmentation does occur. So this is also a problem for TCP. This means that the IPv4 fragmented TCP frames in those sessions will come into another receive queue and CPU and this will show up as out of order data.

So, what's this all have to do with receive side scaling?

With RSS, there's a well defined hash for packets and a configuration for what the operating system and NICs are supposed to be doing. It's entirely possible that we'll configure IPv4+TCP to be hashed and also entirely possible we'll see IP fragments showing up on other CPUs. So in order to have the TCP stack run on the right CPU, the IP fragments need to be assembled on whichever CPU they're received upon and then re-injected into the correct destination queue to run on the correct CPU.

Fortunately the FreeBSD netisr scheme makes this easy.

So what I'm doing in my branch (and what will soon show up in -HEAD) is thus:

  • UDP is still hashed as IPv4-only frames for now. I'll change that later to hash on IPv4+UDP and have things reinjected on the correct destination RSS bucket / netisr queue / CPU.
  • I create one netisr thread, pinned to a CPU, for each RSS CPU that's defined.
    • Ideally I'd create one netisr thread for each RSS bucket and pin that, but that'll come later.
  • IP fragments will be hashed to whatever the IPv4 hash calculates, so fragment reassembly will occur on some CPU;
    • .. and it's the same CPU for all frames in a fragmented datagram.
  • Then when the fragment is reassembled, a software hash is calculated for the newly reassembled frame.
    • If RSS is configured to hash for IPv4 only, then it'll see that the hash on the reassembled datagram matches the configured hash for that packet type and reuse it.
    • So, if it's UDP right now, it'll see that UDP is only hashing on IPv4 details and reuse it.
    • .. but if IPv4+UDP hashing is configured, it'll software hash the packet and assign the new flow type and RSS hash.
  • Then, it'll reinject the frame into netisr to be requeued and reprocessed.
  • .. this uses the nh_m2cpuid function to calculate the destination CPU for the given RSS hash.
    • If it's handled on the same destination CPU then it'll be handled.
    • If it's handled on a different destination CPU then it'll be queued to that netisr and dispatched appropriately.
This works. It's not great, and I'd rather the IP fragment reassembly code was much more efficient, but it's correct. I'm going for correctness here to begin with.

Now, before you ask - yes, IPv6 has fragments and yes, I have to do the same thing for IPv6 flows. Most of the code is written.

Finally - the same thing applies to things like IPv4 tunnels, IPv6-in-IPv4 tunnels, IPSEC tunnels and the like. The NIC hashes the packets on the IPv4 header details but once the packet is de-encapsulated, it needs to be reinjected back into the correct CPU for further processing.

Application awareness of receive side scaling (RSS) on FreeBSD

Part of testing this receive side scaling work is designing a set of APIs that allow for some kind of affinity awareness. It's not easy - the general case is difficult and highly varying. But something has to be tested! So, where should it begin?

The main tricky part of this is the difference between incoming, outgoing and listening sockets.

For incoming traffic, the NIC has already calculated the RSS hash value and there's already a map between RSS hash and destination CPU. Well, destination queue to be much more precise; then there's a CPU for that queue.

For outgoing traffic, the thread(s) in question can be scheduled on any CPU core and as you have more cores, it's increasingly unlikely to be the right one. In FreeBSD, the default is to direct dispatch transmit related socket and protocol work in the thread that started it, save a handful of places like TCP timers. Once the driver if_transmit() method is called to transmit a frame it can check the mbuf to see what the flowid is and map that to a destination transmit queue. Before RSS, that's typically done to keep packets vaguely in some semblance of in-order behaviour - ie, for a given traffic flow between two endpoints (say, IP, or TCP, or UDP) the packets should be transmitted in-order. It wasn't really done for CPU affinity reasons.

Before RSS, there was no real consistency with how drivers hashed traffic upon receive, nor any rules on how it should select an outbound transmit queue for a given buffer. Most multi-queue drivers got it "mostly right". They definitely didn't try to make any CPU affinity choices - it was all done to preserve the in-order behaviour of traffic flows.

For an incoming socket, all the information about the destination CPU can be calculated from the RSS hash provided during frame reception. So, for TCP, the RSS hash for the received ACK during the three way handshake goes into the inpcb entry. For UDP it's not so simple (and the inpcb doesn't get a hash entry for UDP - I'll explain why below.)

For an outgoing socket, all the information about the eventual destination CPU isn't necessarily available. If the application knows the source/destination IP and source/destination port then it (or the kernel) can calculate the RSS hash that the hardware would calculate upon frame reception and use that to populate the inpcb. However this isn't typically known - frequently the source IP and port won't be explicitly defined and it'll be up to the kernel to choose them for the application. So, during socket creation, the destination CPU can't be known.

So to make it simple (and to make it simple for me to ensure the driver and protocol stack parts are working right) my focus has been on incoming sockets and incoming packets, rather than trying to handle outgoing sockets. I can handle outbound sockets easily enough - I just need to do a software hash calculation once all of the required information is available (ie, the source IP and port is selected) and populate the inpcb with that particular value. But I decided to not have to try and debug that at the same time as I debugged the driver side and the protocol stack side, so it's a "later" task.

For TCP, traffic for a given connection will use the same source/destination IP and source/destination port values. So for a given socket, it'll always hash to the same value. However, for UDP, it's quite possible to get UDP traffic from a variety of different source IP/ports and respond from a variety of different source/IP ports. This means that the RSS hash value that we can store in the inpcb isn't at all guaranteed to be the same for all subsequent socket writes.

Ok, so given all of that above information, how exactly is this supposed to work?

Well, the slightly more interesting and pressing problem is how to break out incoming requests/packets to multiple receive threads. In traditional UNIX socket setups, there are a couple of common design patterns for farming off incoming requests to multiple worker threads:

  • There's one thread that just does accept() (for TCP) or recv() (for UDP) and it then farms off new connections to userland worker threads; or
  • There are multiple userland worker threads which all wait on a single socket for accept() or recv() - and hope that the OS will only wake up one thread to hand work to.
It turns out that the OS may wake up one thread at a time for accept() or recv() but then userland threads will sit in a loop trying to accept connections / packets - and then you tend to find they get called a lot only to find another worker thread that was running stole the workload. Oops.

I decided this wasn't really acceptable for the RSS work. I needed a way to redirect traffic to a thread that's also pinned to the same CPU as the receive RSS bucket. I decided the cheapest way would be to allow multiple PCB entries for the same socket details (eg, multiple TCP sockets listening on *:80). Since the PCBGROUPS code in this instance has one PCB hash per RSS bucket, all I had to do was to teach the stack that wildcard listen PCB entries (eg, *:80) could also exist in each PCB hash bucket and to use those in preference to the global PCB hash.

The idea behind this decision is pretty simple - Robert Watson already did all this great work in setting up and debugging PCBGROUPS and then made the RSS work leverage that. All I'd have to do is to have one userland thread in each RSS bucket and have the listen socket for that thread be in the RSS bucket. Then any incoming packet would first check the PCBGROUP that matched the RSS bucket indicated by the RSS hash from the hardware - and it'd find the "right" PCB entry in the "right" PCBGROUP PCB has table for the "right" RSS bucket.

That's what I did for both TCP and UDP.

So the programming model is thus:

  • First, query the RSS sysctl (net.inet.rss) for the RSS configuration - this gives the number of RSS buckets and the RSS bucket -> CPU mapping.
  • Then create one worker thread per RSS bucket..
  • .. and pin each thread to the indicated CPU.
  • Next, each worker thread creates one listen socket..
  • .. sets the IP_BINDANY or IP6_BINDANY option to indicate that there'll be multiple RSS entries bound to the given listen details (eg, binding to *:80);
  • .. then IP_RSS_LISTEN_BUCKET to set which RSS bucket the incoming socket should live in;
  • Then for UDP - call bind()
  • Or for TCP - call bind(), then call listen()
Each worker thread will then receive TCP connections / UDP frames that are local to that CPU. Writing data out the TCP socket will also stay local to that CPU. Writing UDP frames out doesn't - and I'm about to cover that.

Yes, it's annoying because now you're not just able to choose an IO model that's convenient for your application / coding style. Oops.

Ok, so what's up with UDP?

The problem with UDP is that outbound responses may be to an arbitrary destination setup and thus may actually be considered "local" to another CPU. Most common services don't do this - they'll send the UDP response to the same remote IP and port that it was sent from.

My plan for UDP (and TCP in some instances, see below!) is four-fold:

  • When receiving UDP frames, optionally mark them with RSS hash and flowid information.
  • When transmitting UDP frames, allow userspace to inform the kernel about a pre-calculated RSS hash / flow information.
  • For the fully-connected setup (ie, where a single socket is connect() ed to a given UDP remote IP:port and frame exchange only occurs between the fixed IP and port details) - cache the RSS flow information in the inpcb;
  • .. and for all other situations (if it's not connected, if there's no hint from userland, if it's going to a destination that isn't in the inpcb) - just do a software hash calculation on the outgoing details.
I mostly have the the first two UDP options implemented (ie, where userland caches the information to re-use when transmitting the response) and I'll commit them to FreeBSD soon. The other two options are the "correct" way to do the default methods but it'll take some time to get right.

Ok, so does it work?

I don't have graphs. Mostly because I'm slack. I'll do up some before I present this - likely at BSDCan 2015.

My testing has been done with Intel 1G and 10G NICs on desktop Ivy Bridge 4-core hardware. So yes, server class hardware will behave better.

For incoming TCP workloads (eg a webserver) then yes, there's no lock contention between CPUs in the NIC driver or network stack any longer. The main lock contention between CPUs is the VM and allocator paths. If you're doing disk IO then that'll also show up.

For incoming UDP workloads, I've seen it scale linearly on 10G NICs (ixgbe(4)) from one to four cores. This is with no-defragmentation, 510 byte sized datagrams.

Ie, 1 core reception (ie, all flows to one core) was ~ 250,000 pps into userland with just straight UDP reception and no flow/hash information via recvmsg(); 135,000 pps into userland with UDP reception and flow/hash information via recvmsg().

4 core reception was ~ 1.1 million pps into userland, roughly ~ 255,000 pps per core. There's no contention between CPU cores at all.

Unfortunately what I was sending was markedly different. The driver quite happily received 1.1 million frames on one queue and up to 2.1 million when all four queues were busy. So there's definitely room for improvement.

Now, there is lock contention - it's just not between CPU cores. Now that I'm getting past the between-core contention, we see the within-core contention.

For TCP HTTP request reception and bulk response transmission, most of the contention I'm currently seeing is between the driver transmit paths. So, the following occurs:

  • TCP stack writes some data out;
  • NIC if_transmit() method is called;
  • It tries to grab the queue lock and succeeds;
It then appends the frame to the buf_ring and schedules a transmit out the NIC. This bit is fine.

But then whilst the transmit lock is held, because the driver is taking frames from the buf_ring to push into the NIC TX DMA queue
  • The NIC queue interrupt fires, scheduling the software interrupt thread;
  • This pre-empts the existing running transmit thread;
  • The NIC code tries to grab the transmit lock to handle completed transmissions;
  • .. and it fails, because the code it preempted holds the transmit lock already.
So there's some context switching and thrashing going on there which needs to be addressed.

Ok, what about UDP? It turns out there's some lock contention with the socket receive buffer.

The soreceive_dgram() routine grabs the socket receive buffer (SOCKBUF_LOCK()) to see if there's anything to return. If not, and if it can sleep, it'll call sbwait() that will release the lock and msleep() waiting for the protocol stack to indicate that something has been received. However, since we're receiving packets at such a very high rate, it seems that the receive protocol path contends with the socket buffer lock that is held by the userland code trying to receive a datagram. It pre-empts the user thread, tries to grab the lock and fails - and then goes to sleep until the userland code finishes with the lock. soreceive_dgram() doesn't hold the lock for very long - but I do see upwards of a million context switches a second.

To wrap up - I'm pleased with how things are going. I've found and fixed some issues with the igb(4) and ixgbe(4) drivers that were partly my fault and the traffic is now quite happily and correctly being processed in parallel. There are issues with scaling within a core that are now being exposed and I'm glad to say I'm going to ignore them for now and focus on wrapping up what I've started.

There's a bunch more to talk about and I'm going to do it in follow-up posts.
  • what I'm going to do about UDP transmit in more detail;
  • what about creating outbound connections and how applications can be structured to handle this;
  • handling IP fragments and rehashing packets to be mostly in-order - and what happens when we can't guarantee ordering with the hardware hashing UDP frames to a 4-tuple;
  • CPU hash rebalancing - what if a specific bucket gets too much CPU load for some reason;
  • randomly creating a toeplitz RSS hash key at bootup and how that should be verified;
  • multi-socket CPU and IO domain awareness;
  • .. and whatever else I'm going to stumble across whilst I'm slowly fleshing this stuff out.
I hope to get the UDP transmit side of things completed in the next couple of weeks so I can teach memcached about TCP and UDP RSS. After that, who knows!

Hacking on Receive Side Scaling (RSS) on FreeBSD

RSS is a Microsoft invention that tries to keep a given TCP or UDP flow (and I think IP, but I haven't yet tried that) on a given CPU core. The idea is to try and keep both flow-local data and flow-local locking on a single CPU core, increasing the chances that data is hot in the CPU core cache and reducing the chance of lock overhead.

You can find the RSS overview and programming details here:

RSS and supporting technology has been making its way into FreeBSD for quite some time but it's not in any real shape that application developers can take advantage of.

Firstly, there's "PCBGROUPS", which looks to group PCB (protocol control block) data for a connection local to a CPU. Instead of there being one global PCB table for the system (well, VIMAGE for FreeBSD - each virtual image instance has its own PCB table) with one lock protecting it, there's now multiple PCB tables, one per "thing". Here, the thing is whatever the kernel developer thinks is worth grouping them by.

Now, until the RSS work went in, this code was in FreeBSD but sat unused. A kernel developer could provide the hooks needed to map TCP (and maybe UDP later) flows to a "thing" and have that map to a PCB group table - but it required some glue to stamp incoming connections and outgoing packets with some identifier (which we call a "flowid" in FreeBSD) with something that can map to said "thing". Then whenever a PCB lookup was needed, it would first try the lookup in the table mapped to by the mapping between the "flowid" and "thing" - if it was successful, it wouldn't have to use the global PCB table to do the lookup.

This is only good for established connections - creating and destroying a connection still requires manipulating that global PCB table and the single PCB table lock. I'm going to ignore fixing that for now, as that is a bigger issue.

Then Robert Watson added the RSS work done under contract to Juniper Networks, Inc. RSS provides one kind of mapping between the flowid from the NIC and which CPU to run work on. So that part worked great - but there wasn't any way for the application user to take advantage of it. Additionally, there's no driver awareness of it yet - I'll discuss this shortly.

So I grabbed a bunch of this work whilst at Netflix and tried to make sense of it. It turns out that if you can keep the work local to a CPU, a lot of the lock contention in the networking stack melts away. Here's what's going on:

  • The receive thread(s) in the NIC driver processing packets are typically doing direct dispatch to the network stack - so they're running the receive side of the TCP stack;
  • .. and the receive side of the network stack includes ACKs, which triggers the transmit side of the network stack;
  • There's typically some deferred thread(s) in the NIC driver transmitting packets to each NIC queue;
  • There's also application threads trying to queue data to the TCP socket, which also can dig into the socket and TCP stack state, which involves grabbing locks;
  • And there's also timers firing to update state, and doing this involves grabbing locks.
Without RSS and without lining everything up on CPU cores, all the above can run on different cores. Whenever any of them try running at the same time, lock contention can occur and that particular task can stop. If the lock contention blocks the transmit or receive NIC threads, then not only is that connection affected - the whole NIC processing is affected.

There's still lock contention in the network stack - especially if you're doing a lot of new, short connections. The good folk at Verisign are working on that particular corner of the problem so I'm happy to defer to them.

So, I ended up doing a bunch of little pieces to get this lined up right:
  • The per-CPU timer callwheels can now be optionally pinned to their CPU cores, so timer events running on CPU X actually do run on CPU X (yes, that was amusing to find..);
  • There's support in the TCP stack for per-CPU timers, but it's not enabled by default;
  • ... and it also didn't query RSS, netisr or anything to figure out how to map a flowid to a given CPU to run a timer on;
  • Then to make matters worse, incoming TCP sessions didn't have a flowid assigned to the PCB until after the first data packet was read - which meant that the initial timer work would all assume CPU 0 and any queries on that particular PCB would return flowid=0 - so it would not find it in the right PCBGROUP.
So those are fixed in FreeBSD-HEAD. The per-CPU TCP timer and pinned-CPU timers aren't enabled by default - I'll only flip that on when I'm confident that the RSS stuff is working.

So that lets all the RSS stuff correctly work. But there wasn't a nice way to query the per-connection flowid or RSS information. So I then extended netstat to have 'R' as a flag - it returns the flowid and the flowid type. I'll add RSS information once I have a nice way to extract it out in bulk. It's still a good diagnostic tool to ensure that the IPv4/IPv6 hashing is working correctly.

Then I had to teach a driver about RSS so I could actually test it all out. I have some igb(4) hardware at home, so I did the minimal work required to teach it about the RSS key and assigning things to the correct CPUs. It's still incomplete but it's good enough to get off the ground. I'll go into more details about the driver requirements in a follow-up blogpost.

Finally, how are application developers supposed to use it? I'll cover that particular bit in another follow-up blog post as there's quite a lot to cover there.