Author Archives: adrian

FreeBSD on the POWER8: it’s alive!

A post to freebsd-ppc from a couple of months ago asked if we had support for POWER8 and offered to provide remote access to anyone interested in working on it. I was sufficiently intrigued that I approached the FreeBSD powerpc hackers to ask about it, and was informed that it'd be nice, but we didn't have hardware.

After a bit of wrangling of hardware logistics and with the FreeBSD Foundation purchasing a box, a Tyan POWER8 evaluation server appeared. Nathan Whitehorn started poking at it and managed to get a basic "hello world" going, but stalled on issues with the Linux KVM virtualisation environment.

Fast forward a few weeks - he's figured out the KVM issues, their lack of support for some mandated hypervisor APIs and other bugs - FreeBSD now boots inside of the hypervisor environment and seems stable enough to do development on.

He then found the existing powerpc pmap (physical memory management) code wasn't very SMP friendly - it works fine on one and two CPU powerpc machines, but this POWER8 evaluation board is a 4-core, 32-thread CPU. So a few days of development went by and he rewrote most of the pmap code to be much more fine grained locked and scale much, much better than the existing code. (He also found the PS3 hypervisor layer isn't thread-safe.)

What's been done thus far?

  • FreeBSD boots inside the hypervisor environment;
  • Virtualised console, networking and storage all work;
  • (in progress) new, scalable pmap implementation;
  • Initial support for the Vector-Scalar Extension (VSX) that's found on POWER7 and POWER8.
So, I'm impressed. Nathan's done a fantastic job bringing the whole thing up. There's some further work on the new powerpc technology that needs doing (things like the new vector processing units, performance counter support and such) and I'm sure Justin and Nathan will poke powerpc dtrace support into further good shape. I'm going to see if we can fix a chelsio 40G NIC into one of these and work with their developers to fix any endian/busdma issues that creep up, and then do some network stack scaling testing with it. There's also the missing hardware/hypervisor support to run FreeBSD on the bare metal, which would be a fantastic achievement.

Now I kind of want some larger POWER8 hardware.

TDMA (somewhat) working on AR9380 chips

(Wow, I have a lot of posts to write to catch up on things.)

I've just brought up FreeBSD's TDMA support on the AR9380 chipset. Specifically, the AR9331, since I have a Carambola 2 on me today.

It was pretty simple to bring up - I was missing the beacon configuration HAL call that the TDMA code expected. It's only used by the TDMA code - the STA and AP modes rely on the normal HAL beacon methods that date back to the Atheros HAL.

The only problem - it seems something is up with ANI (noise immunity) and sensitivity on at least the AR9331. It doesn't seem to behave well on slightly loaded channels and thus the beacons don't always go out when they're supposed to.

But, if you've been wanting to play with TDMA on the later Atheros chips, now you can!

On profiling HTTP, or "god damnit people, why are all the open source tools slow?"

Something that's been a challenge at work (and at other things in the past) has been "how do I generate enough traffic to test this thing?"

If you're running some public facing boxes then sure, you can do A/B testing. But what if you're not able to test it in the real world? What if you need to do testing before you ship, and the traffic levels have to be stupid high?

So, what do you do?

I've done this a few times. When doing squid and other reverse proxy development, I would run tools like apachebench, httperf, even web polygraph - but these things scaled poorly. They didn't handle tens of thousands of concurrent connections and scale to both slow and fast clients - their use of poll() and select() just wouldn't work out well.

Something I did at Netflix was to start building TCP testing tools that more than 65,000 concurrent sockets. My aim is much higher, but one has to start somewhere. I was testing out the network stack rather than specifically doing HTTP testing. Here at my current job, I'm much more interested in real HTTP and all processing.

I looked at what's out there, and it's not very pretty. I need to be able to do 10G of traffic, looking upwards towards 20G and 40G of HTTP in the future. After a little more digging into what was out there - and finding httperf actually reverted my changes to use libevent and went back to poll/select! - I decided it was about time I just started writing something minimal to stress test things and build upon it as the need arose. I want something that eventually ends up like web polygraph - multiple client/server sets with different URL choices from a pool, a variety of client IP addresses, and other things like how often to make the requests and other request pacing.

So, I grabbed libevent, libevhtp from Mark Ellzey and threw them together. It turned out okish - libevent/libevhtp still does a bunch of memcpy()'ing inside the buffer management routines that makes 40G on one box infeasible at the moment, but it's good enough to get a few gigabit of client traffic on one core. There were some hiccups which I'll cover below, but it's good enough to build upon.

What did I learn?
  • Well, it turns out the client code in libevhtp was a bit immature. Mark and I talked a bit about it on IRC and then I found there was an outstanding pull request that found and fixed a bunch of these. So, my code has turned into another thing - a libevhtp client and server test suite.
  • The libevhtp threading model is fine for a couple of CPUs, but it's the standardish *NIX model of "one thread does accept, farms work off to other threads." So it's not going to scale well at high request rates to multiple CPUs. That's cool; that's what the FreeBSD-HEAD RSS work is for.
  •  There's memcpy()'ing in the libevhtp body handling code. It's not a big deal at 1G, but at 10G it's definitely noticeable. I've spoken to Mark about it.
But, it's a good starting point. Once the rest of the bugs get shaken out, it'll be a good high throughput HTTP traffic tester.

What would I do next, after the bugs?

  • the server will eventually grow the ability to generate controllable sized responses. That way the client can control how big a response to send and thus can create a mix of requests/replies.
  • .. and HTTP request body testing would be nice.
  • The client side needs to grow the ability to create client pools, like web polygraph, where certain subsets of clients get certain behaviours (like a pool of IPs to use, separate pool of URLs to fetch from, the time between each HTTP request, etc.)
The other trick is how to simulate lots (and I do mean lots) of IP addresses. I don't want to create separate loopback connections for each - that would be crazy. Instead, it'd be good to use the transparent interception support in FreeBSD IPFW that allows both connections from and connections to arbitrary IP addresses. A little trickery with IP routing so we don't need more than 1 ARP entry for each server and voila!

Oh, and the code?

… mmm emulators.

I occasionally get asked to test out FreeBSD/MIPS patches for people, as they don't have physical hardware present. I can understand that - the hardware is cheap and plentiful, but not everyone wants to have a spare access point around just to test out MIPS changes on.

However QEMU does a pretty good job of emulating MIPS if you're just testing out non-hardware patches. There's even instructions on the FreeBSD wiki for how to do this! So I decided to teach my wifi build system about the various QEMU MIPS emulator targets so it can spit out a kernel and mfsroot to use for QEMU.


It turns out that it wasn't all that hard. The main trick was to use qemu-devel, not qemu. There are bugs in the non-development QEMU branch that mean it works great for Linux but not FreeBSD.

The kernel configurations in FreeBSD had bitrotted a little bit (they were missing the random device, for example) but besides that the build, install and QEMU startup just worked. I now have FreeBSD/MIPS of each variety (32 bit, 64 bit, Little-Endian, Big-Endian) running under QEMU and building FreeBSD-HEAD as a basic test.

Next is figuring out how to build gdb to target each of the above and have it speak to the QEMU GDB stub. That should make it very easy to do MIPS platform debugging.

I also hear rumours about this stuff working somewhat for ARM and PPC, so I'll see how hard it is to run QEMU for those platforms and whether FreeBSD will just boot and run on each.

More RSS UDP tests – this time on a Dell R720

I've recently had the chance to run my RSS UDP test suite up on a pair of Dell R720s. They came with on-board 10G Intel NICs (ixgbe(4) in FreeBSD) so I figured I'd run my test suite up on it.

Thank you to the Enterprise Storage Division at Dell for providing hardware for me to develop on!

The config is like in the previous blog post, but now I have two 8-core Sandy Bridge Xeon CPUs to play with. To simply things (and to not have to try and solve NUMA related issues) I'm running this on the first socket. The Intel NIC is attached to the first CPU socket.


  • CPU: Intel(R) Xeon(R) CPU E5-2650 0 @ 2.00GHz (2000.04-MHz K8-class CPU) x 2
  • RAM: 64GiB
  • HTT disabled

# ... until ncpus is tunable, make it use 8 buckets.

This time I want to test with 8 streams, so after some trial and error I found the right IPv4 addresses to use:

  • Server:
  • Client:,,,,,,,
The test was like before - the server ran one rss-udp-srv program that spawns one thread per RSS bucket. The client side runs rss-clt programs to generate traffic - but now there's eight of them instead of four.

The results are what I expected: the contention is in the same place (UDP receive) and it's per-core - it doesn't contend between CPU cores.

Each CPU is transmitting and receiving 215,000 510-byte UDP frames a second. It scales linearly - 1 CPU is 215,000 TX/RX frames a second. 8 CPUs is 215,000 TX/RX frames a second * 8. There's no degrading as the CPU core count increases.

That's 1.72 million packets per second. At 510 bytes frames it's about 7 gigabits/sec in and out.

The other 8 cores are idle. Ideally we'd be able to run an application in those cores - so hopefully I can get my network / rss library up and running enough to prototype an RSS-aware memcached and see if it'll handle this particular workload.

It's a far cry from what I think we can likely achieve - but please keep in mind that I know I could do more awesome looking results with netmap, PF_RING or Intel's DPDK software. What I'm trying to do is push the existing kernel networking subsystem to its limits so the issues can be exposed and fixed.

So, where's the CPU going?

In the UDP server program (pid 1620), it looks thus:

# pmcstat -P CPU_CLK_UNHALTED_CORE -T -w 1 -p 1620
PMC: [CPU_CLK_UNHALTED_CORE] Samples: 34298 (100.0%) , 155 unresolved

  8.0 kernel     fget_unlocked        kern_sendit:4.2 kern_recvit:3.9
  7.0 kernel     copyout              soreceive_dgram:5.6 amd64_syscall:0.9
  3.6 kernel     __mtx_unlock_flags   ixgbe_mq_start
  3.5 kernel     copyin               m_uiotombuf:1.8 amd64_syscall:1.2
  3.4 kernel     memcpy               ip_output:2.9 ether_output:0.6
  3.4 kernel     toeplitz_hash        rss_hash_ip4_2tuple
  3.3 kernel     bcopy                rss_hash_ip4_2tuple:1.4 rss_proto_software_hash_v4:0.9
  3.0 kernel     _mtx_lock_spin_cooki pmclog_reserve
  2.7 kernel     udp_send             sosend_dgram
  2.5 kernel     ip_output            udp_send

In the NIC receive / transmit thread(s) (pid 12), it looks thus:

# pmcstat -P CPU_CLK_UNHALTED_CORE -T -w 1 -p 12

PMC: [CPU_CLK_UNHALTED_CORE] Samples: 79319 (100.0%) , 0 unresolved

 10.3 kernel     ixgbe_rxeof          ixgbe_msix_que
  9.3 kernel     __mtx_unlock_flags   ixgbe_rxeof:4.8 netisr_dispatch_src:2.1 in_pcblookup_mbuf:1.3
  8.3 kernel     __mtx_lock_flags     ixgbe_rxeof:2.8 netisr_dispatch_src:2.4 udp_append:1.2 in_pcblookup_mbuf:1.1 knote:0.6
  3.8 kernel     bcmp                 netisr_dispatch_src
  3.6 kernel     uma_zalloc_arg       sbappendaddr_locked_internal:2.0 m_getjcl:1.6
  3.4 kernel     ip_input             netisr_dispatch_src
  3.4 kernel     lock_profile_release __mtx_unlock_flags
  3.4 kernel     in_pcblookup_mbuf    udp_input
  3.0 kernel     ether_nh_input       netisr_dispatch_src
  2.4 kernel     udp_input            ip_input
  2.4 kernel     mb_free_ext          m_freem
  2.2 kernel     lock_profile_obtain_ __mtx_lock_flags
  2.1 kernel     ixgbe_refresh_mbufs  ixgbe_rxeof

It looks like there's some obvious optimisations to poke at (what the heck is fget_unlocked() doing up there?) and yes, copyout/copyin are really terrible but currently unavoidable. The toeplitz hash and bcopy aren't very nice but they're occuring in the transmit path because at the moment there's no nice way to efficiently set both the outbound RSS hash and RSS bucket ID and send to a non-connected socket destination (ie, specify the destination IP:port as part of the send.) There's also some lock contention that needs to be addressed.

The output of the netisr queue statistics looks good:

root@abaddon:/home/adrian/git/github/erikarn/freebsd-rss # netstat -Q
Setting                        Current        Limit
Thread count                         8            8
Default queue limit                256        10240
Dispatch policy                 direct          n/a
Threads bound to CPUs          enabled          n/a

Name   Proto QLimit Policy Dispatch Flags
ip         1    256    cpu   hybrid   C--
igmp       2    256 source  default   ---
rtsock     3    256 source  default   ---
arp        4    256 source  default   ---
ether      5    256    cpu   direct   C--
ip6        6    256   flow  default   ---
ip_direct     9    256    cpu   hybrid   C--

WSID CPU   Name     Len WMark   Disp'd  HDisp'd   QDrops   Queued  Handled
   0   0   ip         0    25        0 839349259        0       49 839349308
   0   0   igmp       0     0        0        0        0        0        0
   0   0   rtsock     0     2        0        0        0       92       92
   0   0   arp        0     0      118        0        0        0      118
   0   0   ether      0     0 839349600        0        0        0 839349600
   0   0   ip6        0     0        0        0        0        0        0
   0   0   ip_direct     0     0        0        0        0        0        0
   1   1   ip         0    20        0 829928186        0      286 829928472
   1   1   igmp       0     0        0        0        0        0        0
   1   1   rtsock     0     0        0        0        0        0        0
   1   1   arp        0     0        0        0        0        0        0
   1   1   ether      0     0 829928672        0        0        0 829928672
   1   1   ip6        0     0        0        0        0        0        0
   1   1   ip_direct     0     0        0        0        0        0        0
   2   2   ip         0     0        0 835558437        0        0 835558437
   2   2   igmp       0     0        0        0        0        0        0
   2   2   rtsock     0     0        0        0        0        0        0
   2   2   arp        0     0        0        0        0        0        0
   2   2   ether      0     0 835558610        0        0        0 835558610
   2   2   ip6        0     0        0        0        0        0        0
   2   2   ip_direct     0     0        0        0        0        0        0
   3   3   ip         0     1        0 850271162        0       23 850271185
   3   3   igmp       0     0        0        0        0        0        0
   3   3   rtsock     0     0        0        0        0        0        0
   3   3   arp        0     0        0        0        0        0        0
   3   3   ether      0     0 850271163        0        0        0 850271163
   3   3   ip6        0     0        0        0        0        0        0
   3   3   ip_direct     0     0        0        0        0        0        0
   4   4   ip         0    23        0 817439448        0      345 817439793
   4   4   igmp       0     0        0        0        0        0        0
   4   4   rtsock     0     0        0        0        0        0        0
   4   4   arp        0     0        0        0        0        0        0
   4   4   ether      0     0 817439625        0        0        0 817439625
   4   4   ip6        0     0        0        0        0        0        0
   4   4   ip_direct     0     0        0        0        0        0        0
   5   5   ip         0    19        0 817862508        0      332 817862840
   5   5   igmp       0     0        0        0        0        0        0
   5   5   rtsock     0     0        0        0        0        0        0
   5   5   arp        0     0        0        0        0        0        0
   5   5   ether      0     0 817862675        0        0        0 817862675
   5   5   ip6        0     0        0        0        0        0        0
   5   5   ip_direct     0     0        0        0        0        0        0
   6   6   ip         0    19        0 817281399        0      457 817281856
   6   6   igmp       0     0        0        0        0        0        0
   6   6   rtsock     0     0        0        0        0        0        0
   6   6   arp        0     0        0        0        0        0        0
   6   6   ether      0     0 817281665        0        0        0 817281665
   6   6   ip6        0     0        0        0        0        0        0
   6   6   ip_direct     0     0        0        0        0        0        0
   7   7   ip         0     0        0 813562616        0        0 813562616
   7   7   igmp       0     0        0        0        0        0        0
   7   7   rtsock     0     0        0        0        0        0        0
   7   7   arp        0     0        0        0        0        0        0
   7   7   ether      0     0 813562620        0        0        0 813562620
   7   7   ip6        0     0        0        0        0        0        0
   7   7   ip_direct     0     0        0        0        0        0        0
root@abaddon:/home/adrian/git/github/erikarn/freebsd-rss # 

It looks like everything is being dispatched correctly; nothing is being queued and/or dropped.

But yes, we're running out of socket buffers because each core is 100% pinned:

root@abaddon:/home/adrian/git/github/erikarn/freebsd-rss # netstat -sp udp
        6773040390 datagrams received
        0 with incomplete header
        0 with bad data length field
        0 with bad checksum
        0 with no checksum
        17450880 dropped due to no socket
        136 broadcast/multicast datagrams undelivered
        1634117674 dropped due to full socket buffers
        0 not for hashed pcb
        5121471700 delivered
        5121471044 datagrams output
        0 times multicast source filter matched

There's definitely room for improvement.

UDP RSS update: ixbge(4) turned out to have issues..

I started digging deeper into the RSS performance on my home test platform. Four cores and one (desktop) socket isn't all that much, but it's a good starting point for this.

It turns out that there was some lock contention inside netisr. Which made no sense, as RSS should be keeping all the flows local to each CPU.

After a bunch of digging, I discovered that the NIC was occasionally receiving packets into the wrong ring. Have a look at tihs:

Sep 12 08:04:32 adrian-hackbox kernel: ix0: ixgbe_rxeof: 100034:
m=0xfffff80047713d00; flowid=0x21f7db62; rxr->me=3
Sep 12 08:04:32 adrian-hackbox kernel: ix0: ixgbe_rxeof: 100034:
m=0xfffff8004742e100; flowid=0x21f7db62; rxr->me=3
Sep 12 08:04:32 adrian-hackbox kernel: ix0: ixgbe_rxeof: 100034:
m=0xfffff800474c2e00; flowid=0x21f7db62; rxr->me=3
Sep 12 08:04:32 adrian-hackbox kernel: ix0: ixgbe_rxeof: 100034:
m=0xfffff800474c5000; flowid=0x21f7db62; rxr->me=3
Sep 12 08:04:32 adrian-hackbox kernel: ix0: ixgbe_rxeof: 100034:
m=0xfffff8004742ec00; flowid=0x21f7db62; rxr->me=3
Sep 12 08:04:32 adrian-hackbox kernel: ix0: ixgbe_rxeof: 100032:
m=0xfffff8004727a700; flowid=0x335a5c03; rxr->me=2
Sep 12 08:04:32 adrian-hackbox kernel: ix0: ixgbe_rxeof: 100032:
m=0xfffff80006f11600; flowid=0x335a5c03; rxr->me=2
Sep 12 08:04:32 adrian-hackbox kernel: ix0: ixgbe_rxeof: 100032:
m=0xfffff80047279b00; flowid=0x335a5c03; rxr->me=2
Sep 12 08:04:32 adrian-hackbox kernel: ix0: ixgbe_rxeof: 100032:
m=0xfffff80006f0b700; flowid=0x335a5c03; rxr->me=2

The RX flowid was correct - I hashed the packets in software too and verified the software hash equaled the hardware hash. But they were turning up on the wrong receive queue. "rxr->me" is the queue id; the hardware should be hashing on the last 7 bits. 0x3 -> ring 3, 0x2 -> ring 2.

It also only happened when I was sending traffic to more than one receive ring. Everything was okay if I just transmitted to a single receive ring.

Luckily for me, some developers from Verisign saw some odd behaviour in their TCP stress testing and had dug in a bit further. They were seeing corrupted frames on the receive side that looked a lot like internal NIC configuration state. They figured out that the ixgbe(4) driver wasn't initialising the flow director and receive units correctly - the FreeBSD driver was not correctly setting up the amount of memory each was allocated on the NIC and they were overlapping. They also found a handful of incorrectly handled errors and double-freed mbufs.

So, with that all fixed, their TCP problem went away and my UDP tests started properly behaving themselves. Now all the flows are ending up on the right CPUs.

The flow director code was also dynamically programming flows into the NIC to try and rebalance traffic. Trouble is, I think it's a bit buggy and it's likely not working well with generic receive offload (LRO).

What's it mean for normal people? Well, it's fixed in FreeBSD-HEAD now. I'm hoping I or someone else will backport it to FreeBSD-10 soon. It fixes my UDP tests - now I hit around 1.3 million packets per second transmit and receive on my test rig; the server now has around 10-15% CPU free. It also fixed issues that Verisign were seeing with their high transaction rate TCP tests. I'm hoping that it fixes the odd corner cases that people have seen with Intel 10 gigabit hardware on FreeBSD and makes LRO generally more useful and stable.

Next up - some code refactoring, then finishing off IPv6 RSS!


Receive side scaling: testing UDP throughput

I think it's about time I shared some more details about the RSS stuff going into FreeBSD and how I'm testing it.

For now I'm focusing on IPv4 + UDP on the Intel 10GE NICs. The TCP side of things is done (and the IPv6 side of things works too!) but enough of the performance walls show up in the IPv4 UDP case that it's worth sticking to it for now.

I'm testing on a pair of 4-core boxes at home. They're not special - and they're very specifically not trying to be server-class hardware. I'd like to see where these bottlenecks are even at low core count.

The test setup in question:

Testing software:

  • It requires libevent2 - an updated copy; previous versions of libevent2 didn't handle FreeBSD specific errors gracefully and would early error out of the IO loop.


  • CPU: Intel(R) Core(TM) i5-3550 CPU @ 3.30GHz (3292.59-MHz K8-class CPU)
  • There's no SMT/HTT, but I've disabled it in the BIOS just to be sure
  • 4GB RAM
  • FreeBSD-HEAD, amd64
  • NIC:  '82599EB 10-Gigabit SFI/SFP+ Network Connection
  • ix0:

# for now redirect processing just makes the lock overhead suck even more.
# disable it.



# experiment with deferred dispatch for RSS

kernel config:

include GENERIC

device netmap
options RSS
options PCBGROUP

# in-system lock profiling

# Flowtable - the rtentry locking is a bit .. slow.
options   FLOWTABLE

# This debugging code has too much overhead to do accurate
# testing with.
nooptions         INVARIANTS
nooptions         INVARIANT_SUPPORT
nooptions         WITNESS
nooptions         WITNESS_SKIPSPIN

The server runs the "rss-udp-srv" process, which behaves like a multi-threaded UDP echo server on port 8080.


The client box is slightly more powerful to compensate for (currently) not using completely affinity-aware RSS UDP transmit code.

  • CPU: Intel(R) Core(TM) i5-4460  CPU @ 3.20GHz (3192.68-MHz K8-class CPU)
  • SMT/HTT: Disabled in BIOS
  • 8GB RAM
  • FreeBSD-HEAD amd64
  • Same kernel config, loader and sysctl config as the server
  • ix0: configured as,,,,
The client runs 'udp-clt' programs to source and sink traffic to the server.

Running things

The server-side simply runs the listen server, configured to respond to each frame:

$ rss-udp-srv 1

The client-side runs four couples of udp-clt, each from different IP addresses. These are run in parallel (i do it in different screens, so I can quickly see what's going on):

$ ./udp-clt -l -r -p 8080 -n 10000000000 -s 510
$ ./udp-clt -l -r -p 8080 -n 10000000000 -s 510
$ ./udp-clt -l -r -p 8080 -n 10000000000 -s 510
$ ./udp-clt -l -r -p 8080 -n 10000000000 -s 510

The IP addresses are chosen so that the 2-tuple topelitz hash using the default Microsoft key hash to different RSS buckets that live on individual CPUs.

Results: Round one

When the server is responding to each frame, the following occurs. The numbers are "number of frames generated by the client (netstat)", "number of frames received by the server (netstat)", "number of frames seen by udp-rss-srv", "number of responses transmitted from udp-rss-srv", "number of frames seen by the server (netstat)"
  • 1 udp-clt process: 710,000; 710,000; 296,000; 283,000; 281,000
  • 2 udp-clt processes: 1,300,000; 1,300,000; 592,000; 592,000; 575,000
  • 3 udp-clt processes: 1,800,000; 1,800,000; 636,000; 636,000; 600,000
  • 4 udp-clt processes: 2,100,000; 2,100,000; 255,000; 255,000; 255,000
So, it's not actually linear past two cores. The question here is: why?

There are a couple of parts to this.

Firstly - I had left turbo boost on. What this translated to:

  • One core active: ~ 30% increase in clock speed
  • Two cores active: ~ 30% increase in clock speed
  • Three cores active: ~ 25% increase in clock speed
  • Four cores active: ~ 15% increase in clock speed.
Secondly and more importantly - I had left flow control enabled. This made a world of difference.

The revised results are mostly linear - with more active RSS buckets (and thus CPUs) things seem to get slightly more efficient:
  • 1 udp-clt process: 710,000; 710,000; 266,000; 266,000; 266,000
  • 2 udp-clt processes: 1,300,000; 1,300,000; 512,000; 512,000; 512,000
  • 3 udp-clt processes: 1,800,000; 1,800,000; 810,000; 810,000; 810,000
  • 4 udp-clt processes: 2,100,000; 2,100,000; 1,120,000; 1,120,000; 1,120,000

Finally, let's repeat the process but only receiving instead also echoing back the packet to the client:

$ rss-udp-srv 0
  • 1 udp-clt process: 710,000; 710,000; 204,000
  • 2 udp-clt processes: 1,300,000; 1,300,000; 378,000
  • 3 udp-clt processes: 1,800,000; 1,800,000; 645,000
  • 4 udp-clt processes: 2,100,000; 2,100,000; 900,000
The receive-only workload is actually worse off versus the transmit + receive workload!

What's going on here?

Well, a little digging shows that in both instances - even with a single udp-clt thread running which means only one CPU on the server side is actually active! - there's active lock contention.

Here's an example dtrace output for measuring lock contention with only one active process, where one CPU is involved (and the other three are idle):

Receive only, 5 seconds:

root@adrian-hackbox:/home/adrian/git/github/erikarn/freebsd-rss # dtrace -n 'lockstat:::adaptive-block { @[stack()] = sum(arg1); }'
dtrace: description 'lockstat:::adaptive-block ' matched 1 probe


Transmit + receive, 5 seconds:

dtrace: description 'lockstat:::adaptive-block ' matched 1 probe




Somehow it seems there's less lock contention / blocking going on when both transmit and receive is running!

So then I dug into it using the lock profiling suite. This is for 5 seconds with receive-only traffic on a single RSS bucket / CPU (all other CPUs are idle):

# sysctl = 1; sleep 5 ; sysctl

root@adrian-hackbox:/home/adrian/git/github/erikarn/freebsd-rss # sysctl ; sleep 5 ; sysctl 1 -> 1 1 -> 0

root@adrian-hackbox:/home/adrian/git/github/erikarn/freebsd-rss # sysctl | head -2 ; sysctl | sort -nk4 | tail -10 
     max  wait_max       total  wait_total       count    avg wait_avg cnt_hold cnt_lock name
    1496         0       10900           0          28    389      0  0      0 /usr/home/adrian/work/freebsd/head/src/sys/dev/usb/usb_device.c:2755 (sx:USB config SX lock) 
       0         0          31           1          67      0      0  0      4 /usr/home/adrian/work/freebsd/head/src/sys/kern/sched_ule.c:888 (spin mutex:sched lock 2)
       0         0        2715           1       49740      0      0  0      7 /usr/home/adrian/work/freebsd/head/src/sys/dev/random/random_harvestq.c:294 (spin mutex:entropy harvest mutex)
       1         0          51           1         131      0      0  0      2 /usr/home/adrian/work/freebsd/head/src/sys/kern/sched_ule.c:1179 (spin mutex:sched lock 1)
       0         0          69           2         170      0      0  0      8 /usr/home/adrian/work/freebsd/head/src/sys/kern/sched_ule.c:886 (spin mutex:sched lock 2)
       0         0       40389           2      287649      0      0  0      8 /usr/home/adrian/work/freebsd/head/src/sys/kern/kern_intr.c:1359 (spin mutex:sched lock 2)
       0         2           2           4          12      0      0  0      2 /usr/home/adrian/work/freebsd/head/src/sys/dev/usb/usb_device.c:2762 (sleep mutex:Giant)
      15        20        6556         520        2254      2      0  0    105 /usr/home/adrian/work/freebsd/head/src/sys/dev/acpica/Osd/OsdSynch.c:535 (spin mutex:ACPI lock (0xfffff80002b10f00))
       4         5      195967       65888     3445501      0      0  0  28975 /usr/home/adrian/work/freebsd/head/src/sys/netinet/udp_usrreq.c:369 (sleep mutex:so_rcv)

Notice the lock contention for the so_rcv (socket receive buffer) handling? What's going on here is pretty amusing - it turns out that because there's so much receive traffic going on, the userland process receiving the data is being preempted by the NIC receive thread very often - and when this happens, there's a good chance it's going to be within the small window that the receive socket buffer lock is held. Once this happens, the NIC receive thread processes frames until it gets to one that requires it to grab the same sock buffer lock that is already held by userland - and it fails - so the NIC thread sleeps until the userland thread finishes consuming a packet. Then the CPU flips back to the NIC thread and continues processing a packet.

When the userland code is also transmitting frames it's increasing the amount of time in between socket receives and decreasing the probability of hitting the lock contention condition above.

Note there's no contention between CPUs here - this is entirely contention within a single CPU.

So for now I'm happy that the UDP IPv4 path is scaling well enough with RSS on a single core. The main performance problem here is the socket receive buffer locking (and, yes, copyin() / copyout().)


Receive Side Scaling: figuring out how to handle IP fragments

The TL:DR; of this is - IP fragments are annoying.

If everything was awesome and there were never IP fragments, all TCP and UDP frames would always have the TCP/UDP header stamped on them, and the NIC could hash the TCP/UDP header in hardware to calculate the destination queue to receive traffic on.

However, everything isn't awesome and there will be cases where IP frames are fragmented. When this happens, the first frame in the fragment has the IPv4 header and the TCP/UDP header - but the subsequent fragments only have the IPv4 header. That means there's not enough information in the rest of the fragments to hash them to the same hash value and thus hardware queue as the first fragment - only the first has the full IPv4+TCP/UDP information.

The Intel and Chelsio NICs will hash on all packets that are fragmented by only hashing on the IPv4 details. So, if it's a fragmented TCP or UDP frame, it will hash the first fragment the same as the others - it'll ignore the TCP/UDP details and only hash on the IPv4 frame. This means that all the fragments in a given IP datagram will hash to the same value and thus the same queue.

But if there are a mix of fragmented and non-fragmented packets in a given flow - for example, small versus larger UDP frames - then some may be hashed via the IPv4+TCP or IPv4+UDP details and some will just be hashed via the IPv4 details. This means that packets in the same flow will end up being received in different receive queues and thus highly likely be processed out of order.

The Linux intel driver code flipped off IPv4+UDP hashing a while ago - they hash UDP frames by their IPv4 details only and then do whatever other load balancing in the kernel they choose. I found this and updated the FreeBSD drivers to do the same. This should result in less out of order UDP frames for UDP heavy workloads. I'm not sure about the Chelsio driver yet - when I convert it to the RSS framework it'll disable IPv4+UDP hashing if that isn't enabled at boot time. This is a good stop-gap, but it's not the whole story.

TCP is where it gets annoying. People don't want to flip off IPv4+TCP hashing as they're convinced that the TCP MSS negotiation and path-MTU discovery stuff will prevent there from being any IP fragmented TCP frames. But, well, that's not really viable in the real world. There are too many misconfigured networks out there and IP fragmentation does occur. So this is also a problem for TCP. This means that the IPv4 fragmented TCP frames in those sessions will come into another receive queue and CPU and this will show up as out of order data.

So, what's this all have to do with receive side scaling?

With RSS, there's a well defined hash for packets and a configuration for what the operating system and NICs are supposed to be doing. It's entirely possible that we'll configure IPv4+TCP to be hashed and also entirely possible we'll see IP fragments showing up on other CPUs. So in order to have the TCP stack run on the right CPU, the IP fragments need to be assembled on whichever CPU they're received upon and then re-injected into the correct destination queue to run on the correct CPU.

Fortunately the FreeBSD netisr scheme makes this easy.

So what I'm doing in my branch (and what will soon show up in -HEAD) is thus:

  • UDP is still hashed as IPv4-only frames for now. I'll change that later to hash on IPv4+UDP and have things reinjected on the correct destination RSS bucket / netisr queue / CPU.
  • I create one netisr thread, pinned to a CPU, for each RSS CPU that's defined.
    • Ideally I'd create one netisr thread for each RSS bucket and pin that, but that'll come later.
  • IP fragments will be hashed to whatever the IPv4 hash calculates, so fragment reassembly will occur on some CPU;
    • .. and it's the same CPU for all frames in a fragmented datagram.
  • Then when the fragment is reassembled, a software hash is calculated for the newly reassembled frame.
    • If RSS is configured to hash for IPv4 only, then it'll see that the hash on the reassembled datagram matches the configured hash for that packet type and reuse it.
    • So, if it's UDP right now, it'll see that UDP is only hashing on IPv4 details and reuse it.
    • .. but if IPv4+UDP hashing is configured, it'll software hash the packet and assign the new flow type and RSS hash.
  • Then, it'll reinject the frame into netisr to be requeued and reprocessed.
  • .. this uses the nh_m2cpuid function to calculate the destination CPU for the given RSS hash.
    • If it's handled on the same destination CPU then it'll be handled.
    • If it's handled on a different destination CPU then it'll be queued to that netisr and dispatched appropriately.
This works. It's not great, and I'd rather the IP fragment reassembly code was much more efficient, but it's correct. I'm going for correctness here to begin with.

Now, before you ask - yes, IPv6 has fragments and yes, I have to do the same thing for IPv6 flows. Most of the code is written.

Finally - the same thing applies to things like IPv4 tunnels, IPv6-in-IPv4 tunnels, IPSEC tunnels and the like. The NIC hashes the packets on the IPv4 header details but once the packet is de-encapsulated, it needs to be reinjected back into the correct CPU for further processing.

Application awareness of receive side scaling (RSS) on FreeBSD

Part of testing this receive side scaling work is designing a set of APIs that allow for some kind of affinity awareness. It's not easy - the general case is difficult and highly varying. But something has to be tested! So, where should it begin?

The main tricky part of this is the difference between incoming, outgoing and listening sockets.

For incoming traffic, the NIC has already calculated the RSS hash value and there's already a map between RSS hash and destination CPU. Well, destination queue to be much more precise; then there's a CPU for that queue.

For outgoing traffic, the thread(s) in question can be scheduled on any CPU core and as you have more cores, it's increasingly unlikely to be the right one. In FreeBSD, the default is to direct dispatch transmit related socket and protocol work in the thread that started it, save a handful of places like TCP timers. Once the driver if_transmit() method is called to transmit a frame it can check the mbuf to see what the flowid is and map that to a destination transmit queue. Before RSS, that's typically done to keep packets vaguely in some semblance of in-order behaviour - ie, for a given traffic flow between two endpoints (say, IP, or TCP, or UDP) the packets should be transmitted in-order. It wasn't really done for CPU affinity reasons.

Before RSS, there was no real consistency with how drivers hashed traffic upon receive, nor any rules on how it should select an outbound transmit queue for a given buffer. Most multi-queue drivers got it "mostly right". They definitely didn't try to make any CPU affinity choices - it was all done to preserve the in-order behaviour of traffic flows.

For an incoming socket, all the information about the destination CPU can be calculated from the RSS hash provided during frame reception. So, for TCP, the RSS hash for the received ACK during the three way handshake goes into the inpcb entry. For UDP it's not so simple (and the inpcb doesn't get a hash entry for UDP - I'll explain why below.)

For an outgoing socket, all the information about the eventual destination CPU isn't necessarily available. If the application knows the source/destination IP and source/destination port then it (or the kernel) can calculate the RSS hash that the hardware would calculate upon frame reception and use that to populate the inpcb. However this isn't typically known - frequently the source IP and port won't be explicitly defined and it'll be up to the kernel to choose them for the application. So, during socket creation, the destination CPU can't be known.

So to make it simple (and to make it simple for me to ensure the driver and protocol stack parts are working right) my focus has been on incoming sockets and incoming packets, rather than trying to handle outgoing sockets. I can handle outbound sockets easily enough - I just need to do a software hash calculation once all of the required information is available (ie, the source IP and port is selected) and populate the inpcb with that particular value. But I decided to not have to try and debug that at the same time as I debugged the driver side and the protocol stack side, so it's a "later" task.

For TCP, traffic for a given connection will use the same source/destination IP and source/destination port values. So for a given socket, it'll always hash to the same value. However, for UDP, it's quite possible to get UDP traffic from a variety of different source IP/ports and respond from a variety of different source/IP ports. This means that the RSS hash value that we can store in the inpcb isn't at all guaranteed to be the same for all subsequent socket writes.

Ok, so given all of that above information, how exactly is this supposed to work?

Well, the slightly more interesting and pressing problem is how to break out incoming requests/packets to multiple receive threads. In traditional UNIX socket setups, there are a couple of common design patterns for farming off incoming requests to multiple worker threads:

  • There's one thread that just does accept() (for TCP) or recv() (for UDP) and it then farms off new connections to userland worker threads; or
  • There are multiple userland worker threads which all wait on a single socket for accept() or recv() - and hope that the OS will only wake up one thread to hand work to.
It turns out that the OS may wake up one thread at a time for accept() or recv() but then userland threads will sit in a loop trying to accept connections / packets - and then you tend to find they get called a lot only to find another worker thread that was running stole the workload. Oops.

I decided this wasn't really acceptable for the RSS work. I needed a way to redirect traffic to a thread that's also pinned to the same CPU as the receive RSS bucket. I decided the cheapest way would be to allow multiple PCB entries for the same socket details (eg, multiple TCP sockets listening on *:80). Since the PCBGROUPS code in this instance has one PCB hash per RSS bucket, all I had to do was to teach the stack that wildcard listen PCB entries (eg, *:80) could also exist in each PCB hash bucket and to use those in preference to the global PCB hash.

The idea behind this decision is pretty simple - Robert Watson already did all this great work in setting up and debugging PCBGROUPS and then made the RSS work leverage that. All I'd have to do is to have one userland thread in each RSS bucket and have the listen socket for that thread be in the RSS bucket. Then any incoming packet would first check the PCBGROUP that matched the RSS bucket indicated by the RSS hash from the hardware - and it'd find the "right" PCB entry in the "right" PCBGROUP PCB has table for the "right" RSS bucket.

That's what I did for both TCP and UDP.

So the programming model is thus:

  • First, query the RSS sysctl (net.inet.rss) for the RSS configuration - this gives the number of RSS buckets and the RSS bucket -> CPU mapping.
  • Then create one worker thread per RSS bucket..
  • .. and pin each thread to the indicated CPU.
  • Next, each worker thread creates one listen socket..
  • .. sets the IP_BINDANY or IP6_BINDANY option to indicate that there'll be multiple RSS entries bound to the given listen details (eg, binding to *:80);
  • .. then IP_RSS_LISTEN_BUCKET to set which RSS bucket the incoming socket should live in;
  • Then for UDP - call bind()
  • Or for TCP - call bind(), then call listen()
Each worker thread will then receive TCP connections / UDP frames that are local to that CPU. Writing data out the TCP socket will also stay local to that CPU. Writing UDP frames out doesn't - and I'm about to cover that.

Yes, it's annoying because now you're not just able to choose an IO model that's convenient for your application / coding style. Oops.

Ok, so what's up with UDP?

The problem with UDP is that outbound responses may be to an arbitrary destination setup and thus may actually be considered "local" to another CPU. Most common services don't do this - they'll send the UDP response to the same remote IP and port that it was sent from.

My plan for UDP (and TCP in some instances, see below!) is four-fold:

  • When receiving UDP frames, optionally mark them with RSS hash and flowid information.
  • When transmitting UDP frames, allow userspace to inform the kernel about a pre-calculated RSS hash / flow information.
  • For the fully-connected setup (ie, where a single socket is connect() ed to a given UDP remote IP:port and frame exchange only occurs between the fixed IP and port details) - cache the RSS flow information in the inpcb;
  • .. and for all other situations (if it's not connected, if there's no hint from userland, if it's going to a destination that isn't in the inpcb) - just do a software hash calculation on the outgoing details.
I mostly have the the first two UDP options implemented (ie, where userland caches the information to re-use when transmitting the response) and I'll commit them to FreeBSD soon. The other two options are the "correct" way to do the default methods but it'll take some time to get right.

Ok, so does it work?

I don't have graphs. Mostly because I'm slack. I'll do up some before I present this - likely at BSDCan 2015.

My testing has been done with Intel 1G and 10G NICs on desktop Ivy Bridge 4-core hardware. So yes, server class hardware will behave better.

For incoming TCP workloads (eg a webserver) then yes, there's no lock contention between CPUs in the NIC driver or network stack any longer. The main lock contention between CPUs is the VM and allocator paths. If you're doing disk IO then that'll also show up.

For incoming UDP workloads, I've seen it scale linearly on 10G NICs (ixgbe(4)) from one to four cores. This is with no-defragmentation, 510 byte sized datagrams.

Ie, 1 core reception (ie, all flows to one core) was ~ 250,000 pps into userland with just straight UDP reception and no flow/hash information via recvmsg(); 135,000 pps into userland with UDP reception and flow/hash information via recvmsg().

4 core reception was ~ 1.1 million pps into userland, roughly ~ 255,000 pps per core. There's no contention between CPU cores at all.

Unfortunately what I was sending was markedly different. The driver quite happily received 1.1 million frames on one queue and up to 2.1 million when all four queues were busy. So there's definitely room for improvement.

Now, there is lock contention - it's just not between CPU cores. Now that I'm getting past the between-core contention, we see the within-core contention.

For TCP HTTP request reception and bulk response transmission, most of the contention I'm currently seeing is between the driver transmit paths. So, the following occurs:

  • TCP stack writes some data out;
  • NIC if_transmit() method is called;
  • It tries to grab the queue lock and succeeds;
It then appends the frame to the buf_ring and schedules a transmit out the NIC. This bit is fine.

But then whilst the transmit lock is held, because the driver is taking frames from the buf_ring to push into the NIC TX DMA queue
  • The NIC queue interrupt fires, scheduling the software interrupt thread;
  • This pre-empts the existing running transmit thread;
  • The NIC code tries to grab the transmit lock to handle completed transmissions;
  • .. and it fails, because the code it preempted holds the transmit lock already.
So there's some context switching and thrashing going on there which needs to be addressed.

Ok, what about UDP? It turns out there's some lock contention with the socket receive buffer.

The soreceive_dgram() routine grabs the socket receive buffer (SOCKBUF_LOCK()) to see if there's anything to return. If not, and if it can sleep, it'll call sbwait() that will release the lock and msleep() waiting for the protocol stack to indicate that something has been received. However, since we're receiving packets at such a very high rate, it seems that the receive protocol path contends with the socket buffer lock that is held by the userland code trying to receive a datagram. It pre-empts the user thread, tries to grab the lock and fails - and then goes to sleep until the userland code finishes with the lock. soreceive_dgram() doesn't hold the lock for very long - but I do see upwards of a million context switches a second.

To wrap up - I'm pleased with how things are going. I've found and fixed some issues with the igb(4) and ixgbe(4) drivers that were partly my fault and the traffic is now quite happily and correctly being processed in parallel. There are issues with scaling within a core that are now being exposed and I'm glad to say I'm going to ignore them for now and focus on wrapping up what I've started.

There's a bunch more to talk about and I'm going to do it in follow-up posts.
  • what I'm going to do about UDP transmit in more detail;
  • what about creating outbound connections and how applications can be structured to handle this;
  • handling IP fragments and rehashing packets to be mostly in-order - and what happens when we can't guarantee ordering with the hardware hashing UDP frames to a 4-tuple;
  • CPU hash rebalancing - what if a specific bucket gets too much CPU load for some reason;
  • randomly creating a toeplitz RSS hash key at bootup and how that should be verified;
  • multi-socket CPU and IO domain awareness;
  • .. and whatever else I'm going to stumble across whilst I'm slowly fleshing this stuff out.
I hope to get the UDP transmit side of things completed in the next couple of weeks so I can teach memcached about TCP and UDP RSS. After that, who knows!

Hacking on Receive Side Scaling (RSS) on FreeBSD

RSS is a Microsoft invention that tries to keep a given TCP or UDP flow (and I think IP, but I haven't yet tried that) on a given CPU core. The idea is to try and keep both flow-local data and flow-local locking on a single CPU core, increasing the chances that data is hot in the CPU core cache and reducing the chance of lock overhead.

You can find the RSS overview and programming details here:

RSS and supporting technology has been making its way into FreeBSD for quite some time but it's not in any real shape that application developers can take advantage of.

Firstly, there's "PCBGROUPS", which looks to group PCB (protocol control block) data for a connection local to a CPU. Instead of there being one global PCB table for the system (well, VIMAGE for FreeBSD - each virtual image instance has its own PCB table) with one lock protecting it, there's now multiple PCB tables, one per "thing". Here, the thing is whatever the kernel developer thinks is worth grouping them by.

Now, until the RSS work went in, this code was in FreeBSD but sat unused. A kernel developer could provide the hooks needed to map TCP (and maybe UDP later) flows to a "thing" and have that map to a PCB group table - but it required some glue to stamp incoming connections and outgoing packets with some identifier (which we call a "flowid" in FreeBSD) with something that can map to said "thing". Then whenever a PCB lookup was needed, it would first try the lookup in the table mapped to by the mapping between the "flowid" and "thing" - if it was successful, it wouldn't have to use the global PCB table to do the lookup.

This is only good for established connections - creating and destroying a connection still requires manipulating that global PCB table and the single PCB table lock. I'm going to ignore fixing that for now, as that is a bigger issue.

Then Robert Watson added the RSS work done under contract to Juniper Networks, Inc. RSS provides one kind of mapping between the flowid from the NIC and which CPU to run work on. So that part worked great - but there wasn't any way for the application user to take advantage of it. Additionally, there's no driver awareness of it yet - I'll discuss this shortly.

So I grabbed a bunch of this work whilst at Netflix and tried to make sense of it. It turns out that if you can keep the work local to a CPU, a lot of the lock contention in the networking stack melts away. Here's what's going on:

  • The receive thread(s) in the NIC driver processing packets are typically doing direct dispatch to the network stack - so they're running the receive side of the TCP stack;
  • .. and the receive side of the network stack includes ACKs, which triggers the transmit side of the network stack;
  • There's typically some deferred thread(s) in the NIC driver transmitting packets to each NIC queue;
  • There's also application threads trying to queue data to the TCP socket, which also can dig into the socket and TCP stack state, which involves grabbing locks;
  • And there's also timers firing to update state, and doing this involves grabbing locks.
Without RSS and without lining everything up on CPU cores, all the above can run on different cores. Whenever any of them try running at the same time, lock contention can occur and that particular task can stop. If the lock contention blocks the transmit or receive NIC threads, then not only is that connection affected - the whole NIC processing is affected.

There's still lock contention in the network stack - especially if you're doing a lot of new, short connections. The good folk at Verisign are working on that particular corner of the problem so I'm happy to defer to them.

So, I ended up doing a bunch of little pieces to get this lined up right:
  • The per-CPU timer callwheels can now be optionally pinned to their CPU cores, so timer events running on CPU X actually do run on CPU X (yes, that was amusing to find..);
  • There's support in the TCP stack for per-CPU timers, but it's not enabled by default;
  • ... and it also didn't query RSS, netisr or anything to figure out how to map a flowid to a given CPU to run a timer on;
  • Then to make matters worse, incoming TCP sessions didn't have a flowid assigned to the PCB until after the first data packet was read - which meant that the initial timer work would all assume CPU 0 and any queries on that particular PCB would return flowid=0 - so it would not find it in the right PCBGROUP.
So those are fixed in FreeBSD-HEAD. The per-CPU TCP timer and pinned-CPU timers aren't enabled by default - I'll only flip that on when I'm confident that the RSS stuff is working.

So that lets all the RSS stuff correctly work. But there wasn't a nice way to query the per-connection flowid or RSS information. So I then extended netstat to have 'R' as a flag - it returns the flowid and the flowid type. I'll add RSS information once I have a nice way to extract it out in bulk. It's still a good diagnostic tool to ensure that the IPv4/IPv6 hashing is working correctly.

Then I had to teach a driver about RSS so I could actually test it all out. I have some igb(4) hardware at home, so I did the minimal work required to teach it about the RSS key and assigning things to the correct CPUs. It's still incomplete but it's good enough to get off the ground. I'll go into more details about the driver requirements in a follow-up blogpost.

Finally, how are application developers supposed to use it? I'll cover that particular bit in another follow-up blog post as there's quite a lot to cover there.

Meraki Sparky boards, and constant resetting

There's a Mesh internet project at Sudo Room and they've been doing some great work getting a platform up and running. However, like a lot of volunteer projects, they're working with whatever time and equipment they've been donated.

A few months ago they were donated a few hundred Meraki Sparky boards. They're an Atheros AR2317 SoC based device with an integrated 2GHz 802.11bg radio, 10/100 ethernet and.. well, a hardware watchdog that resets the board after five minutes.

Now, annoyingly, this reset occurs inside of Redboot too - which precludes them from being (fully) flashed before the unit reboots. Once the unit was flashed with OpenWRT, the unit still reboots every five minutes.

So, I started down the path of trying to debug this.

What did I know?

Firstly, the AR2317 watchdog doesn't have a way of resetting things itself - instead, all it can do is post an interrupt. The AR7161 and later SoCs do indeed have a way to do a full hardware reset if the watchdog is tickled.

Secondly, redboot has a few tricksy ways to manipulate the hardware:

  • 'x' can examine registers. Since we need them in KSEG1 (unmapped, uncached) then the reset registers (0x11000xxx becomes 0xb1000xxx.) Since its hardware access, we should do them as DWORDS and not bytes.
  • 'mfill' can be used to write to registers.
Thirdly, there's an Atheros specific command - bdshow - which is surprisingly informative:

RedBoot> bdshow
name:     Meraki Outdoor 1.0
magic:    35333131
cksum:    2a1b
rev:      10
major:    1
minor:    0
pciid:    0013
wlan0:    yes 00:18:0a:50:7b:ae
wlan1:    no  00:00:00:00:00:00
enet0:    yes 00:18:0a:50:7b:ae
enet1:    no  00:00:00:00:00:00
uart0:    yes
sysled:   no, gpio 0
factory:  no, gpio 0
serclk:   internal
cpufreq:  calculated 184000000 Hz
sysfreq:  calculated 92000000 Hz
memcap:   disabled
watchdg:  disabled (WARNING: for debugging only!)

serialNo: Q2AJYS5XMYZ8
Watchdog Gpio pin: 6
secret number: e2f019a200ee517e30ded15cdbd27ba72f9e30c8

.. hm. Watchdog GPIO pin 6? What's that?

Next, I tried manually manipulating the watchdog registers but nothing actually happened.

Then I wondered - what about manipulating the GPIO registers? Maybe there's a hardware reset circuit hooked up to GPIO 6 that needs to be toggled to keep the board from resetting.

Board: ap61
RAM: 0x80000000-0x82000000, [0x8003ddd0-0x80fe1000] available
FLASH: 0xa8000000 - 0xa87e0000, 128 blocks of 0x00010000 bytes each.
== Executing boot script in 2.000 seconds - enter ^C to abort
RedBoot> # set direction of gpio6 to out
RedBoot> mfill -b 0xb1000098 -l 4 -p 0x00000043
RedBoot> x -b 0xb1000098
B1000098: 00 00 00 43 00 00 00 00  00 00 00 00 00 00 00 03  |...C............|
B10000A8: FF EF F7 B9 7D DF 5F FF  00 00 00 00 00 00 00 00  |....}._.........|

RedBoot> # pat gpio6 - set it high, then low.
RedBoot> mfill -b 0xb1000090 -l 4 -p 0x00000042
RedBoot> mfill -b 0xb1000090 -l 4 -p 0x00000002

.. then I manually did this every minute or so.

RedBoot> mfill -b 0xb1000090 -l 4 -p 0x00000042
RedBoot> mfill -b 0xb1000090 -l 4 -p 0x00000002
RedBoot> mfill -b 0xb1000090 -l 4 -p 0x00000042
RedBoot> mfill -b 0xb1000090 -l 4 -p 0x00000002

.. so, the solution here seems to be to "set gpio6 to be output", then "pat it every 60 seconds."

I hope this helps people bring OpenWRT up on this board finally. There seems to be a few of them out there!

Adding chipset powersave support to FreeBSD’s Atheros driver

I've started adding some basic powersave support to the FreeBSD Atheros ath(4) driver. The NICs support putting parts of the device to sleep to conserve power but.. well, it's tricky.

In order to make things consistent, I either need to not do things when the NIC is asleep (for example, doing calibration when the NIC isn't running), but I also need to ensure that I force the NIC awake when the NIC may be asleep. During normal running, the NIC may have put itself into temporary sleep whilst waiting for some packets from the AP to signal that it needs to wake up. So I will also need to force the NIC awake before programming it.

So, before I start down the path of handling the whole dynamic power management stuff, I figured I'd tackle the initial bits - handling powering on the NIC at startup and powering it off when it's not in use. This includes powering it down during device detach and suspend, as well as when all of the VAPs are down.

This is turning out to be slightly more complicated than I'd like it to be.

The first really stupid thing I found was that during the interface down process, the VAP state change from RUN -> INIT would reset the BSS, which included re-programming the slot time. So, I have to wake up the hardware when programming that. It can then go back to sleep when I'm done with it.

Now there's some issues in the suspend path with the NIC being marked as asleep when it is being reset, which is confusing - the NIC should be woken up when ath_reset() is called. So, I'll have to debug these.

The really annoying bit is that if I read a register whilst the silicon is asleep, the reads return 0xDEADBEEF. So if I am storing the register contents anywhere, I'll end up storing and programming a potentially totally invalid value.

There's also some real problems with race conditions. I can put the power state changes behind a lock, but imagine something like this:

* ATH_LOCK; force awake; do something; ATH_UNLOCK .. ATH LOCK; do some more; put back to sleep; ATH_UNLOCK

Now, if a second thread puts the NIC back to sleep in between those two lock sections, the second "do some more" work may occur once the NIC was put to sleep by said second thread. So I have to correctly track if the NIC is being forced awake by refcounting how many times its being forced awake, then when the refcount hits zero and we can put it to sleep, put it back to sleep.

Once this is all done, I can start down the path of supporting proper network sleep - where the NIC stays asleep and wakes up to listen for beacons and received frames from the AP. I then choose to force the NIC awake and do more work. I have to make absolute sure that I don't queue things like transmitted frames or add more frames to the receive queue if it may fall asleep. There's also some mechanisms to have a transmit frame put the NIC to sleep - there's a bit that says "when this frame is transmitted, transition the NIC back to sleep." I have to go and figure out how that works and implement that.

But for now, let's keep it simple and debug just putting the NIC to sleep when it's not in use.

Porting over the AR8327 support

It's been a while since I posted. I'll post about why that is at some point but for now I figure it's time I wrote up the latest little side project - the Atheros AR8327 switch support.

The AR8327 switch is like the previous generation Atheros switches except for a couple of very specific and annoying differences - the register layouts and locations have changed. So it's not just a case of pretending it's an AR8316 except for the hardware setup - there's some significant surgery to do. And no, I did try just ignoring all of that - the switch doesn't come up and pass packets.

So, the first thing was to survey the damage.

The Linux driver (ar8216.c) has a bunch of abstractions that the FreeBSD driver doesn't have, so that's a good starting point. The VLAN operations and VLAN port configuration stuff is all methods in the Linux driver, so that was a good starting point. I stubbed most of the VLAN stuff out (because I really didn't want it to get in the way) - this turned out to be more annoying than I wanted.

Next was the hardware setup path. There's more configurable stuff with the AR8327 - there's two physical ports that I can configure the PHY/MAC parameters on for either external or internal connectivity. I just took the code from Linux (which yes, I have permission to relicence under BSD, thanks to the driver authors!) and I made it use the defaults from OpenWRT for the DB120. The ports didn't properly come up.

I then realised that I was reading total garbage from the PHY register space, so I went looking at the datasheet and ar8216 driver for some inspiration. Sure enough, the AR8327 has the PHY MDIO bus registers in different locations. So after patching the arswitch PHY routines with this knowledge, the PHYs were probed and attached fine. Great. But it still didn't detect port status changes.

So, back to the ar8216 driver. It turns out that there were a few things that weren't methodized - and these were the bits that read the PHY status from the switch. Both drivers didn't just poll the PHYs directly - they read the switch registers which had a summary of the port status. So, I taught the driver about this and voila! Port status changes worked.

But, no traffic.

Well, there's a few reasons for this. It's a switch, so I don't have to setup anything terribly difficult. The trick here is to enable port learning and make sure they're all in the same VLAN group. Now, here's where I screwed up and I found a bug that needed working around.

The port setup code did enable learning and put things into a vlan group.

Firstly, I found this odd behaviour that I got traffic only when I switched the ethernet cable to another port. Then learning worked fine. I then found that the ar8216 driver actually triggers a forwarding table flush upon port status change, so I added that. This fixed that behaviour.

But then it was flooding traffic to all ports. This is kinda stupid. What did I screw up? I put each port in a separate vlangroup, rather than put them in the same vlangroup. Then, I programmed the "which ports can you see?" to include all the other ports. What this meant was:
  • The forwarding table (ie, what addresses were learnt) were linked to the vlangroup the port is in;
  • .. and when the switch did a lookup for a given MAC on another port, it wouldn't find it, as the address in the forwarding table showed it was for another vlangroup;
  • .. so it would do what switches do when faced with not knowing about the MAC (well, and how I had configured it) - it flooded traffic.
The solution was thankfully easy - I just had to change the vlangroup (well, "port vlan" here) to be '1', instead of the port id. Once this was done, all the ports came up perfectly and things worked great.

So, this now works great on the Atheros DB120 reference board. It's not working on other boards - there's likely some timing issues that need to be resolved. But we're making progress!

Finally, I spent a bunch of time porting over the port configuration and LED configuration stuff from OpenWRT so I didn't have the driver just hard-coded to the DB120 board. I'll update the configuration and code when I get my hands on other boards that use the AR8327 but for now this is all I have.


Hacking on Mindwave for fun and .. fun

Allison (and others, like a game developer named Lat) showed interest in these Neurosky Mindwave headsets. They're little wireless (bluetooth, almost!) headsets that ship with a cheap USB dongle and expose their data via a binary protocol.

The protocol is not consistently and well documented. It's out there, if you can craft the right search queries. For the USB widget, you need to implement the basic handshake commands to attempt to connect to a given (or any) headset. Then you also need to implement the data decoding for the raw and processed data.

Now, I don't want to go into the details - you can read the documentation and my very bad, hacked up code.

The USB dongle didn't work with FreeBSD-9.x. It's a cheap chipset (CH341) and it just wouldn't transmit. It works fine on FreeBSD-HEAD though.

So, to explore it, I wrote a simple, hackish library to encapsulate pairing, parsing, data gathering. It needs a lot of improvement but it's there. Then, I (re-)learnt enough SDL and OpenGL to plot some data points. Finally, I grabbed a FFT library to poke at the returned data to see if it makes sense.

A few points thus far.

I still haven't found any correlation with the attention / meditation parameters the firmware returns. For the most part, you just have to stop any kind of muscular movements.

The raw values clip very easily with any kind of muscular movement. I can see how to decode say, "blink" as a muscular action though.

I've only started looking at the raw FFT results. Hopefully with a bit of filtering I'll see things that actually look like basic EEG results, or I'll concede these things are expensive muscular reaction devices.

The code:

And the obligatory screenshot:

Experimenting with zero-copy network IO in FreeBSD-HEAD

Back when I started all of this networking hacking, the "big thing" was the overhead of doing poll() and select(). Various operating systems came up with ways of eliminating these - FreeBSD grew the kqueue infrastructure; linux received epoll, Solaris received an epoll-like device and then ended up with some form of kqueue-like event mechanism. Windows has completion ports/overlapped IO which combined the event mechanism with a zero-copy way of doing network IO.

So the Free/Open operating systems have scalable event notification mechanisms for handling large numbers of concurrent sockets but they don't all have some nice, efficient way of doing zero-copy network IO.

Linux has splice()/tee()/vmsplice(). So yes, it effectively does have a way of doing zero-copy socket reading and writing.

OpenBSD does have a splice style syscall to copy data from a source to a destination TCP socket.

FreeBSD, however, has mostly focused on the "disk to network" path for content serving and thus has a lot of time invested in their sendfile() implementation. This is great if you're doing a lot of file to network sending (which Netflix does), but it has some serious shortcomings. The main one I'll address here is the lack of being able to do general zero-copy socket writes from userland. So it can only send data from disk files to the network. You can't implement a zero-copy intermediary proxy server, nor a memory cache that keeps things in pre-allocated memory regions. You have to use disk files (whether that be a real filesystem on disks, or a memory filesystem) and leverage VM hints to control caching.

Recently there was some new sendfile() work to allow sending from POSIX shared memory segments. This intrigued me - it's not the most effective way of doing zero-copy network IO from userland but it's a start. So I set off to write an updated version of my network library from yesteryear to implement some massively parallel network applications with.

The idea is simple - you allocate a POSIX shared memory segment. You then mmap() that region into memory and treat it as a place to allocate write-side network buffers from. Then you use the shared memory filedescriptor and offset to schedule a sendfile() from the shared memory segment to the destination network socket. It's not as elegant as having a write path that wires the memory down and just populates mbufs from that, but that'll come later.

Here's what I found.

Firstly, there's no asynchronous "I'm done!" notification for the sendfile path. So you have no explicit notification that the underlying memory has been freed so you can reuse it. sendfile() has the SF_SYNC flag which causes it to sleep until the transaction is done - primarily so users can be sure they can change the underlying file contents after the syscall completes. This is used by caches such as Varnish that leverage on-disk files as their cache filesystem space.

So I've been adding that. I have a working prototype that is scaling quite well under load and I'll look to commit it to FreeBSD-HEAD soon. It posts a knote to a kqueue filedescrpitor once a transaction has completed.

Once that was done, I started benchmarking the performance of this setup.

The first real roadblock I hit was massive VM contention on the shared memory segment. It turns out that a single POSIX shared memory segment is represented as a single vm_object and this is protected by a single lock. So when 8 threads are actively doing IO from the same shared memory segment it hits massive lock contention. I fixed this in my test suite by allocating one shared memory segment per thread. It's not elegant but it works well enough for benchmarking.

I next hit issues with contention on the VM page lists. Besides the per-object list, there's also a global per-type list (active, inactive, etc.) There's one lock protecting each of these lists. What I found was the VM was shuffling pages between active/inactive and at the traffic rates I was doing (20+gbit/sec) it was a few hundred thousand pages a second being shuffled around. The solution? mlock() the whole region into memory. This prevented the VM from having the pages change state so often and eliminated that overhead.

The code for doing this sendfile() work with posix shared memory is in my libiapp code - . It's terrible and hacky - I'm just experimenting with things for now. But with some tuning, I can get a good 35Gbit/sec out of 70,000 active TCP sockets. There's still a long way to go - I shouldn't be saturating an 8-core CPU with this traffic level when I'm doing no socket data copies. I'll write another update or two about that soon.

Now, what would I like to see? I did some experiments with physical disk IO using the FreeBSD AIO paths doing the same kinds of IO patterns as I am doing with network socket IO (4KiB to 64KiB random disk reads.) It turns out if you do everything correctly, the FreeBSD AIO code will turn physical disk IO into asynchronous disk buffer transactions by wiring the userland buffer into memory and then using that as the backing buffer memory. The overhead of doing the pmap work for this was not too high. So, I wonder if it's worth writing a new transmit path that uses the pmap code (and not the VM!) to wire in a region of memory and then use that for transmit buffers. Combined with an iovec style array of buffers and the above kqueue notification of the network IO completion, I think we can end up with a much more flexible method of doing network IO from userland without the shortcomings by using POSIX shared memory with sendfile().

Doing arduino development on FreeBSD-HEAD

I'm a sucker for punishment.

Or, I noticed that FreeBSD's pkgng binary package repository ships with a port of the Arduino development environment. It's this java thing that wraps around avr-gcc and avrdude. It's very popular, it's open source, and I figured what the hell.

I plugged in my Arduino Leonardo and .. it was detected as a umodem device. Excellent!

.. and then it wasn't. It went away very quickly and came back as a single interface (OK) with three child interfaces (Hm, okay), but only one uhid (human interface) interface active (Not Ok.) The modem port used to program and talk to the thing wasn't there.

I then went on a bit of a journey. I found that quite some work had already been done to correct issues in the FreeBSD USB stack - however, it still wasn't working. It showed up fine - it identified itself as a generic USB serial port device, and yet umodem didn't bind to it.

Next - the umodem source code. It yes, claimed anything identifying as a USB serial class device - but it only claimed devices that ALSO identified as an AT-class modem. Yes, a serial modem that you speak AT commands to. The Leonardo identifies itself as a USB serial class device but with NO command encoding. umodem didn't like that.

So, to the USB 1.1 standards documention! After reading the relevant bits, I discovered that the rest of the device handling is the same! Ie, it doesn't matter whether the device says "I speak AT commands" or "I speak no commands", it's still serial. This identifier is just for the upper layer application to decide whether to send AT commands or not.

Thus the fix was simple - also claim devices that say "no commands" as well as "AT commands." That fix is in -HEAD and I hope to try and sneak it into 10.0.

And with that - FreeBSD-HEAD is now a viable development environment for the Arduino Leonardo.

So, FreeBSD on the AR9344? What happened?

I committed a bunch of code a while ago to FreeBSD-HEAD to at least start booting on the AR934x SoCs. The AR934x SoC is a MIPS74k core - a dual-issue superscalar 11-stage pipeline MIPS32r2 CPU. It's slightly different to the existing MIPS24k stuff (which is a single 8-stage pipeline.)

So - first step - it booted up a little, then hit a machine check. At that point the FreeBSD MIPS peeps believed there was hilarity in the TLB exception handling code, so we put it to sleep for a while and I went back to real work.

Then a few weeks ago I decided to finish it off. I brought my developer board to Eurobsdcon in Malta and sat down with Warner Losh, who also has said developer board. We spent a bunch of time going over the TLB code and realised that FreeBSD's instruction/execution hazards are all.. just wrong. Then, on a whim, I read up some more about MIPS32r2 and superscalar stuff and discovered that the correct hazard instruction isn't NOPs or SSNOPs - it's EHB (execution hazard barrier.) It's 'SLL $0, $0, 3' in MIPS parlance which on older CPUs is just a NOP (since register 0 is always 'zero'.) So, this fixed the TLB management and the boot proceeded quite a bit further.

Next - bringing up ethernet and the switch PHY. I was seeing totally crappy and invalid register values when reading/writing the attached switch chips. Even probing didn't work reliably - in fact, I got to the point where I was reading the value I'd expect from the previous register read. So, I wondered if this was another out-of-order behaviour from the MIPS74k superscalar architecture.

After digging into the MIPS bus space code, I found two things:

  1. The MIPS driver(s) don't call bus barrier functions at all - so there's no driver enforced access ordering. It was all assuming that the CPU doesn't re-order things; and
  2. The bus barrier code for MIPS was a no-op. It just plainly wasn't defined.
So, I added read/write memory barriers to the MIPS bus barrier routines and I modified the ethernet driver to use barriers. For good measure, I also added barriers to the SPI driver code as that also has a bunch of register accesses that require ordering.

And with that, the switch PHY probe/attached fine, the SPI driver worked fine and the device started booting userland off of SPI connected NOR flash.

Then, it hung. I dug into that a bit and wondered what the hell was going on. Then after a day of poking, I discovered that the interrupt acknowledgement was not working. It's a quirky thing that I should really fix in the atheros platform support - the AR71xx chips don't require the CPU peripheral interrupts to be ack'ed (eg the uart) but later chips do. I added the AR934x to the list of SoCs that need interrupts to be ack'ed and the system kept booting, all the way to userland.

Next - I haven't yet written the AR8327 support but I started fleshing out the AR934x on-board switch support. I got it probing, attaching.. but not passing any traffic. After more digging, I realised my mistake - I was writing some registers incorrectly. I would mask out the right bits to set, but then I'd always set bit 0. Sigh. So, that came up and things worked.

Then I decided to do the wifi part. This was pretty damned simple. The HAL from Qualcomm Atheros already has support for the AR934x in it and I had already modified it to work for the AR933x SoC (which just required me to 'teach' it the FreeBSD way of exposing the calibration/configuration data from on-board flash.) So, all I had to do was this:

  1. Add the device to the kernel configuration;
  2. Add a hint pointing out where the device is mapped in IO space;
  3. Add a hint pointing out where the calibration data is in the NOR flash;
  4. Reboot.
That's it. No weeks of merging code in from Linux or the internal Qualcomm Atheros driver into the FreeBSD driver. No real debugging required. Just enable it, point it at the right place in memory/flash and .. boot it. I think this again vindicates my efforts to open source the Qualcomm Atheros HAL - I just inherit this working code for free. I don't have to try and merge it into anything.

So, I have a port that's dirty and working. There's a lot of infrastructure changes I need to commit before I can commit this port - lots of new clocking options (there's now variations on the clock rate that the MDIO bus (the MII bus connecting the ethernet port(s) to a PHY or switch), there's lots of new configuration options for how the on-chip ethernet port(s) map to external ports and a bunch of other ancillary stuff that's not really worth mentioning. But it's going to show up in FreeBSD-HEAD soon.

The gymnastics required to just do a "HALT" for MIPS..

So, it turns out there's no nice, guaranteed way to implement a HALT style setup for MIPS in the idle thread in the kernel. I'll braindump what happened about two years ago to try and address this.

(And I do hope that the implementation actually works!)

When the kernel isn't running any work, it will schedule the "idle" thread to run. This has the simple(!) task of entering a sleep state, only to wake up when another interrupt fires. This saves power consumption and generates less heat.

However, it's a little trickier than just "enter idle state."

The only way to exit it is to receive an interrupt. Now, this can be device interrupts; this can be timer interrupts (and yes; the timer is a kind of device, tsk..) but without it, the CPU will stay halted. This isn't a big deal - the UNIX kernel tends to be one big event processing "thing" and these include both software and hardware events. If it's a software event that needs to happen "now", the kernel won't go into the idle loop - it'll just run the needed event. If it has to happen in the future, it'll schedule it to occur after some time has elapsed - and this will be driven by a timer interrupt. An interrupt would wake up the system from the HALT state. If the interrupt occured just before the HALT instruction was executed, there'd be a very small window where the interrupt would occur - the interrupt routine would be called, complete, and then the HALT instruction would next execute.

Now, imagine you're an ethernet driver on FreeBSD-10 where the interrupt handler just scheduled some ithread to run in the future. Here's one example of what may happen:

  • The idle thread is executed;
  • An interrupt occurs, say to signal completion of some ethernet frame transmission;
  • The CPU kicks off the interrupt code;
  • The interrupt code doesn't find a fast interrupt handler but finds an ithread; so it schedules the ithread to run;
  • The scheduler goes to choose another thread to run and finds an ithread scheduled at a higher priority, so it schedules that;
  • The ethernet driver code in the ithread runs;
  • The scheduler then exits and re-enters the idle loop.
Everything is fine, right?

But, with the event timer changes that came in during the 9.x time frame, the halt code is now in a critical section.

So now, the following happens:

  • The idle thread is executed;
  • critical_enter() is called;
  • An interrupt occurs, which calls the fast handler, which schedules the ithread to run;
  • .. but critical_enter() has set a flag that says "no preemption", so the ithread doesn't get to run;
  • Control is returned to the idle function;
  • The idle thread gets to the point where it executes "HALT";
  • The idle thread continues to run, executing HALT.
In this instance, the ithread is scheduled but can't run before HALT runs. The ithread only runs after the next interrupt occurs.

I noticed this was happening when doing traffic tests on FreeBSD-HEAD between 802.11 and ethernet interfaces. The atheros 802.11 hardware implemented interrupt moderation so there wouldn't be tens of thousands of interrupts a second being generated. But what I saw was occasionally the receive queue being filled by packets and not drained fast enough. When digging into it, I found that due to interrupt moderation, if the interrupt came in just before WAIT was executed, the 802.11 receive function was taking a long time (sometimes up to milliseconds) to run after the interrupt came in and the ithread was actually scheduled. If I had an interrupt for each received packet, the amount of time between interrupts would have been very small (20,000 packets a second, so around 1/20000 sec per interrupt) and this problem would've been masked. But with moderated interrupts, it would be 750 microseconds or so before the next receive interrupt was generated.

Now, this is messy. There's some hacks in the idle loop code to try and skip the halt bit if the scheduler detects there's something to run. But there's still a small race window there which needs to be closed.

How can this be solved?

Apparently - the two instructions STI;HLT on x86 are atomic. Ie, there's no race window between them. If an interrupt comes through, HLT doesn't run. This doesn't happen for MWAIT or the ACPI sleep states and I am concerned we're still possibly hitting this race window from time to time. The specific behaviour is that the STI causes interrupts to be enabled following the next instruction. So yes, there's no window.

All that's left now is to make sure that interrupts are disabled before you do the scheduler check so no new interrupt processing on that thread can be scheduled. Ie, when entering cpu_idle():

  • Call critical_enter();
  • Disable interrupts;
  • See if the scheduler has anything to do - if so, enable interrupts and skip calling the idle loop
  • If there's nothing to do - and since interrupts are disabled, nothing new will have happened (like an interrupt scheduling an ithread) then just call the idle function
    • Which may or may not enable interrupts before entering the idle loop (you enter ACPI with interrupts disabled, but you enter HLT with interrupts enabled)
  • .. and then call critical_exit(), which will let the kernel continue preempting.
For MIPS however, there's a clever hack. (No, not from me.)

Here's how it works:
  • In the idle loop, it calls mips_wait()
  • mips_wait() is a bit of assembly code that will:
    • disable interrupts
    • see if the scheduler sees anything running
    • .. and if so, it doesn't bother running the WAIT instruction! Just enable interrupts and jump over the WAIT;
  • .. but the bit of code that re-enables interrupts and calls WAIT is aligned to a 16 byte boundary and the address is a symbol (MipsWaitStart).
  • Then in the exception handling code (MipsKernIntr), it sees if the instruction pointer where the exception occured is in the 16 bytes (4 instructions) at MipsWaitStart.
  • .. if it is, it adjusts the return address from the interrupt to be after the WAIT instruction.
It's totally dirty and to be quite honest, I haven't at all tested it. Yes, it should be tested.

Finding low hanging fruit with PMC, or "O(wtf)" ?

I've lately been focusing on performance counter stuff on Sandy Bridge (Xeon and non-Xeon.) Part of this has been fixing some of the counters that were wrong. Part has been digesting the Intel tuning guides and the Intel micro-architecture for Sandy Bridge. It's a little different to the older school pipeline driven architecture that rules the MIPS world.

So, I fired up some of my scripts (at on a live cache pushing a whole lot of live video netflix traffic. The scripts use the PMC framework in global counter mode rather than sampling mode, so it's cheap to do and doesn't affect performance.

What I found:

  1. The pipeline slots per cycle metric is around 16% - so there's a lot of stalling going on.
  2. There's a lot of memory traffic going on - around 50% of clock cycles are spent in LLC_MISS - ie, it wasn't in L1, L2 or L3/LLC (last-level cache) and thus has to be fetched from memory.
So, I started digging into why there were so many memory accesses. It turns out the biggest abuser was the cross-CPU IPI involved in synchronising page mapping tables - there are a few places calling pmap_invalidate_range() as part of sendfile() buffer completion and this was causing issues. I pointed this out, someone else has addressed it internally. (Ideally if the IO path uses unmapped buffers on amd64, there shouldn't be any need to map them in and out of KVA.) I think that saved about 4% of total clock cycles spent being stalled.

Then I found a lot of stalling going on in the mwait and ACPI sleep path. It turns out that these paths seem to involve doing ISA space IO port accesses. These are .. very slow. I've just flipped my testing over to use no mwait and use HLT.

Next - flowtable had been turned on during experimentation and I had noticed that the flowtable expire/flush code would periodically spike up. It spiked up more when more clients and more TCP flows were connected. It showed up in both memory accesses and clock cycles busy PMCs - and the reason made me laugh out loud.

The flowtable uses a bitstring_t - effectively an array of bytes treated as a bitmap, like select() FD_SET's - and would walk this to look for flows to expire.

The expiry code would walk the list looking for flows to expire - it would loop over the entire set, calling ffs() over the whole set to look for the next new flow to check.

.. so looping over looping over the whole set. O(n^2). Right there, in the flow cleaning path. Doing byte offset fetches, rather than 32-bit fetches. Everything about it was ridiculous. As we scaled up to serve more flows the flowcleaner CPU cycle count was spiking really, really hard.

I pointed this out in an email to my coworkers and fell asleep. It was fixed when I awoke - a co-worker fixed it to be correctly O(n) whilst I was sleeping. It's now totally disappeared from the CPU cycle and stall analysis.

So, I've just been chipping away at things here and there. There are some larger scale issues that I really want to address but I'd like to make sure all the odd, silly and remaining low hanging fruit are addressed. Then comes the fun stuff.

Adrian Chadd’s Ramblings 2013-08-19 16:52:00

This blog post is mostly so I don't forget this kind of stuff. mentions "% execution stalled". This is the core i7 document rather than the Sandy Bridge document, but bear with me.

The formula is:


However, there's no UOPS_EXECUTED.CORE_STALL_CYCLES in the PMC documentation, nor is it in the Intel SDM chapter on performance counters.

But wait! It kind of is there. There /is/ UOPS_EXECUTED.THREAD, which is "Counts the total number of uops to be executed per thread each cycle." In the same block, it says that to count stall cycles, set CMASK=1, INV=1. Ok, so how does one do that with PMC?

# pmcstat -S UOPS_EXECUTED.THREAD,inv,cmask=1 -T -w 5

Now, it seems to be showing me the ACPI wait and MWAIT functions as high sample events - which is odd, as I didn't think this particular PMC measured C1 and MWAIT states. I'll chase this up.

For Sandy Bridge it's UOPS_DISPATCHED.THREAD - this counts dispatched micro-operatons per-thread each cycle. CMASK=1,INV=1 counts the number of stall cycles.