Category Archives: FreeBSD

Intel DDIO, LLC cache, buffer alignment, prefetching, shared locks and packet rates.

I've been digging into the low level behaviour of high throughput packet classification and pushing for my job. The initial suggestions from everyone was "use netmap!" Which was cool, but it only seems to to fast packet work if you're only ever really flipping packets between receive and transmit rings. Once you start actually looking into the payload, you start having to take memory misses and things can slow down quite a bit. An L3 miss (ie, RAM access) on Sandybridge is ~50ns. (There's also costs involved in walking the TLB, but I won't cover that here.)

For background: http://7-cpu.com/cpu/SandyBridge.html .

But! Intel has this magical thing called DDIO. In theory (and there's a lot of theory here), DMA is done via a small (~10%) fraction of LLC (L3) cache, which is shared between all cores. If the data is already in cache when the CPU accesses it, it will be quick. Also, if you then wish to DMA out data from something in cache, it doesn't have to get flushed to memory first - it's just DMAed straight out of cache.

However! When I was doing packet bridge testing (using netmap + bridge, 64 byte payloads), I noticed that I was doing a significant amount of memory bandwidth. It wasn't quite at the rate of 10G worth of bridged data, but DDIO should be doing almost all of that work for me at 64 byte payloads.

So, to reproduce: run netmap bridge (eg 'bridge -i netmap:ix0 -i netmap:ix1') and run pkt-gen between two nodes.

This is the output of 'pcm-memory.x 1' from the intel-pcm toolkit (which is available as a binary package on FreeBSD.)

---------------------------------------||---------------------------------------
--                   System Read Throughput(MB/s):    300.68                  --
--                  System Write Throughput(MB/s):    970.81                  --
--                 System Memory Throughput(MB/s):   1271.48                  --
---------------------------------------||---------------------------------------

The first theory - the bridging isn't occuring fast enough to service what's in LLC before it gets flushed out by other packets. So, assume:

  1. It's 1/10th of the LLC - which is 1/10th of an 8 core * 2.5MB per core setup, is ~ 2MB.
  2. 64 byte payloads are being cached.
  3. Perfect (!) LLC use.
That's 32,768 packets at a time. Now, netmap is doing ~ 1000 packets a batch and it's keeping up line rate bridging on one core (~14 million packets per second), so it's not likely that.

Ok, so what if it's not perfect LLC usage?

Then I thought back to cache line aliasing and other issues that I've previously written about. What if the buffers are perfectly aligned (say, 2048 byte aligned) - the cache line aliasing effects should also manifest themselves as low LLC utilisation.

Luckily netmap has a twiddle - 'dev.netmap.buf_size' / 'dev.netmap.priv_buf_size'. They're both .. 2048. So yes, the default buffer sizes are aligned, and there's likely some very poor LLC utilisation going on.

So, I tried 1920 - that's 2048 - (2 * 64) - ie, two cache lines less than 2048.


---------------------------------------||---------------------------------------
--                   System Read Throughput(MB/s):    104.92                  --
--                  System Write Throughput(MB/s):    382.32                  --
--                 System Memory Throughput(MB/s):    487.24                  --
---------------------------------------||---------------------------------------

It's now using significantly less memory bandwidth to do the same thing. I'm guessing this is because I'm now using the LLC much more efficiently.

Ok, so that's nice - but what about when it comes time to actually look at the packet contents to make decisions?

I've modified a copy of bridge to do a few things, mostly inspired by netmap-ipfw:
  • It does batch receive from netmap;
  • but it then looks at the ethernet header do decap that;
  • then it gets the IPv4 src/dst addresses;
  • .. and looks them up in a (very large) traditional hash table.
I also have a modified copy of pkt-gen that will use completely random source and destination IPv4 addresses and ports, so as to elicit some very terrible behaviour.

With an empty hash set, but still dereferencing the ethernet header and IPv4 source/destination, handling a packet at a time, no batching, no prefetching and only using one core/thread to run:

buf_size=2048:
  • Bridges about 6.5 million pps;
  • .. maxes out the CPU core;
  • Memory access: 1000MB/sec read; 423MB/sec write (~1400MB/sec in total).
buf_size=1920:
  • Bridges around 10 million pps;
  • 98% of a CPU core;
  • Memory access: 125MB/sec read, 32MB/sec write, ~ 153MB/sec in total.
So, it's a significant drop in memory throughput and a massive increase in pps for a single core.

Ok, so most of the CPU time is now spent looking at the ethernet header in the demux routine and in the hash table lookup. It's a blank hash table, so it's just the memory access needed to see if the bucket has anything in it. I'm guessing it's because the CPU is loading in the ethernet and IP header into a cache line, so it's not already there from DDIO.

I next added in prefetching the ethernet header. I don't have the code to do that, so I can't report numbers at the moment. But what I did there was I looped over everything in the netmap RX ring, dereferenced the ethernet header, and then did per-packet processing. This was interesting, but I wanted to try batching out next. So, after some significant refactoring, I arranged the code to look like this:
  1. Pull in up to 1024 entries from the netmap receive ring;
  2. Loop through, up to 16 at a time, and place them in a batch
  3. For each packet in a batch do:
    1. For each packet in the batch: optional prefetch on the ethernet header
    2. For each packet in the batch: decapsulate ethernet/IP header;
    3. For each packet in the batch: optional prefetch on the hash table bucket head;
    4. For each packet in the batch: do hash table lookup, decide whether to forward/block
    5. For each packet in the batch: forward (ie, ignore the forward/block for now.)
I had things be optional so I could turn on/off prefetching and control the batch size.

So, with an empty hash table, no prefetching and only changing the batch size, at buf_size=1920:
  • Batch size of 1: 10 million pps;
  • Batch size of 2: 11.1 million pps;
  • Batch size of 4: 11.7 million pps.
Hm, that's cute. What about with prefetching of ethernet header? At buf_size=1920:
  • Batch size of 1: 10 million pps;
  • Batch size of 2: 10.8 million pps;
  • Batch size of 4: 11.5 million pps.
Ok, so that's not that useful. Prefetching on the bucket header here isn't worthwhile, because the buckets are all empty (and thus NULL pointers.)

But, I want to also be doing hash table lookups. I loaded in a reasonably large hash table set (~ 6 million entries), and I absolutely accept that a traditional hash table is not exactly memory or cache footprint happy. I was specifically after what the performance was like for a traditional hash table. Said hash table has 524,288 buckets, and each points to an array of IPv4 addresses to search. So yes, not very optimal by any measure, but it's the kind of thing you'd expect to find in an existing project.

With no prefetching, and a 6 million entry hash table:

At 2048 byte buffers:
  • Batch size of 1: 3.7 million pps;
  • Batch size of 2: 4.5 million pps;
  • Batch size of 4: 4.8 million pps.
At 1920 byte buffers:
  • Batch size of 1: 5 million pps;
  • Batch size of 2: 5.6 million pps;
  • Batch size of 4: 5.6 million pps.
That's a very inefficient hash table - each bucket is going to have around 11 IPv4 entries in it, and that's checking almost a cache line worth of IPv4 addresses in it. Not very nice. But, it's within a cache line worth of data, so in theory it's not too terrible.

What about with prefetching? All at 1920 byte buffers:
  • Batch size of 4, ethernet prefetching: 5.5 million pps
  • Batch size of 4, hash bucket prefetching: 7.7 million pps
  • Batch size of 4, ethernet + hash bucket prefetching: 7.5 million pps
So in this instance, there's no real benefit from doing prefetching on both.

For one last test, let's bump the bucket count from 524,288 to 2,097,152. These again are all at buf_size=1920:
  • Batch size of 1, no prefetching: 6.1 million pps;
  • Batch size of 2, no prefetching: 7.1 million pps;
  • Batch size of 4, no prefetching: 7.1 million pps;
  • Batch size of 4, hash bucket prefetching: 8.9 million pps.
Now, I didn't quite predict this. I figured that since I was reading in the full cache line anyway, having up to 11 entries in it to linearly check would be cheap. It turns out that no, that's not exactly true.

The difference between the naive way (no prefetching, no batching) to 4-packet batching, hash bucket prefetching is not trivial - it's ~ 50% faster. Going all the way to a larger hash bucket was ~75% faster. Now, this hash implementation is not exactly cache footprint friendly - it's bigger than the LLC, so with random flows and thus no real useful cache behaviour it's going to degrade to quite a few memory accesses.

This has been quite a fun trip down the optimisation peephole. I'm going to spend a bunch of time writing down the hardware performance counters involved in analysing this stuff and I'll look to write a follow-up post with details about that.

One final things: threads and locking. I wanted to clearly demonstrate the cost of shared read locks on a setup like this. There's been lots of discussions about the right kind of locking and concurrency strategies, so I figured I'd just do a simple test in this setup and explain how terrible it can get.

So, no read-locks between threads on the hash table, batch size of 4, hash bucket prefetching, buf_size=1920:
  • 1 thread: 8.9 million pps;
  • 4 threads: 12 million pps.
But with a read lock on the hash table lookups:
  • 1 thread: 7 million pps;
  • 4 threads: 4.7 million pps.
I'm guessing that as I add more threads, the performance will drop.

Even taking a rwlock as a reader lock in pthreads is expensive - it's purely just an atomic increment/decrement in FreeBSD, but it's still not free. I'm getting the lock once for two hash table lookups - ie, the source and destination IP hash table lookups are done under one lock. I'm sure if I took the lock for the whole batch hash table lookup it'd work out a little better on a small number of CPU cores, but I think this demonstrates my point - read locks aren't going to cut it when you have a frequently accessed thing to protect.

The best bit about this post? The prefetching, terrible (large) hash table performance and general cache abuse is not new. Doing batching on superscalar Intel CPUs is not new. Documenting DDIO effectiveness using non-power-of-two-aligned buffer sizes is new, but it's just a rehash of the existing cache aliasing effect. But, I now have a little test bed to experiment with these things without having to try and involve the rest of a kernel.

Yes, I'll publish code soon.

Using the arswitch ethernet switch on FreeBSD

I sat down a few weeks ago to make the AR8327 ethernet switch work and in doing so I wanted to add per-port and 802.1q VLAN support. It turned out that I .. didn't know as much I thought I did about the etherswitch support. So, after a whole bunch of trial-and-error, I wrapped my head around things. This post is mostly a braindump so if I do forget I have something written down about it - at least until I turn it into a FreeBSD manpage.

There's three modes:
  • default - all ports are in the same VLAN;
  • per-port - each port can be in a VLAN 'group';
  • dot1q - each port can be in multiple VLAN groups, with 802.1q tagging going on.
The per-port VLAN group is for switches that don't have an arbitrary VLAN table - you just assign each port an ID from some low set of values (say, 16), and then the VLAN tag can either be added or not added. I think the RTL8366 switch is like this, but I'd have to check.

The dot1q VLAN is for switches that support multiple VLANs, each can have an arbitrary VLAN ID (0..4095) with optional other VLAN options (like tag-in-tag support.)

The etherswitch configuration side has a few options and they're supported by different hardware:
  • Each port has a port VLAN ID - this is the "native port" for dot1q support. I don't think it has any particular meaning in the per-port VLAN code in arswitch but I could be terribly wrong. I thought it did when I initially did the port, but the documentation is .. lacking.
  • Then there's a set of per-port flags - eg q-in-q, 802.1q tagging, etc.
  • Then there's the vlangroup - each vlangroup has a vlan ID, and then a set of port members. Each port member can be tagged or untagged.
This is where things get odd.

Firstly - the AR934x SoC switch support doesn't include VLANs. I need to add that. I'm not sure which side of the wall this falls.

The switches previous to the AR8327 support per-port and VLAN configuration, but they don't support per-port-per-VLAN tagging. Ie, you can configure 802.1q VLANs, and you can enable tagging on the port - but it tags all packets that aren't the port 'VLAN ID'.

The per-port VLAN ID seems ignored by the arswitch code - it's only used by the dot1q support.

So I think (and it hasn't yet been tested) that on the earlier switches, I can use per-port VLANs with tagging by:
  • Configuring per port vlans - "etherswitch config vlan_mode port"
  • Adding vlangroups as appropriate with membership - tag/untag doesn't matter
  • Set the CPU port up to have tagging - "etherswitch port0 addtag"
When configuring dot1q VLANs, the mode is "config vlan_mode dot1q" and the 802.1q VLAN IDs are used, but the above still holds - the port is tagged or untagged.

But on the AR8327, the VLAN map hardware actually supports enabling/disabling tagging on a per-port-per-VLAN basis. Ie, when the VLAN table is programmed with the port membership, it takes a list of both the ports and whether the ports are tagged/untagged/open/filtered. So, I don't think per-port VLAN tagging works - only dot1q tagging. Maybe I can make it work, but I haven't really sat down for long enough with the documentation to see what combinations are required.
  • Configure the hardware - "etherswitch config vlan_mode dot1q"
  • Add vlangroups as appropriate, set pvid as appropriate
  • For each vlangroup membership, the port can be tagged or untagged - eg to tag the cpu port 0, you'd use '0t' as the port member. That says "port0 is a member, and it's tagged."
I still have a whole lot more to add - the ingress/egress filters aren't configurable, the per-port vlan stuff needs to be made much more sensible and consistent - and the AR934x SoC switch needs to support VLANs. Oh, and much more documentation. But, hey, I can get the thing spitting out VLAN tags, so when it's time to setup my home network with some VLANs, i'll be sure to document what I did and share it with everyone.

Cache Line Aliasing #2, or "What happens when you page align everything"

After a little more digging into the Intel performance side of things, I discovered one of the big reasons for the performance drop on this particular workload: how Intel CPUs do memory reordering.

The TL;DR is this - there's some hardware inside the Intel CPUs that tracks memory ordering and cache contents - but they don't use all the address bits.

The relevant chapter in the intel optimisation guide is 3.6.8 - Capacity Limits and Aliasing in Caches. The specific thing I was hitting was in 3.6.8.2 - Store Forwarding Aliasing.

Assembly/Compiler Coding Rule 56. (H impact, M generality) Avoid having a store followed by a non-dependent load with addresses that differ by a multiple of 4 KBytes. Also, lay out data or order computation to avoid having cache lines that have linear addresses that are a multiple of 64 KBytes apart in the same working set. Avoid having more than 4 cache lines that are some multiple of 2 KBytes apart in the same first-level cache working set, and avoid having more than 8 cache lines that are some multiple of 4 KBytes apart in the same first-level cache working set.

So, given this, what can be done? In this workload, a bunch of large matrices were allocated via jemalloc, which page aligns large allocations. In the default invocation of the benchmark (where the allocation padding size is 0), the memory access patterns showed a very large number of counter events on "LD_BLOCKS_PARTIAL.ADDRESS_ALIAS" - which is the number of 64k address aliases on the Sandy Bridge Xeon processors I've been testing on. (The same occurs on Westmere, Ivy Bridge and Haswell.) As I vary the padding size, the address aliasing value drops, the memory access counters increase, and the general performance increases.

On the test boxes I have (running pmcstat -w 120 -C -p LD_BLOCKS_PARTIAL.ADDRESS_ALIAS ./himenobmtxpa M )

0 217799413 830.995025
64 18138386 1624.296713
96 8876469 1662.486298
128 19281984 1645.370750
192 18247069 1643.119908
256 18511952 1661.426341
320 19636951 1674.154119
352 19716236 1686.694053
384 19684863 1681.110499
448 18189029 1683.163673
512 19380987 1691.937818

So there's still plenty of aliasing going on at different padding offsets, however it's a very marked drop between 0 and, well, anything.

It turns out that someone's gone and done a bunch more digging into the effects of various CPU magic under the hood. The last paper in the list (Analysing Contextual Bias..) looks at Aliasing and Cache Effects and the effect of memory layout. There's some cute (and sobering!) analysis of the performance changes due to something as simple as the length of your login name in the UNIX environment. It's worth reading.

The summary? Maybe page alignment of all of your memory accesses isn't the way to go.

For further reading:

cache line aliasing effects, or "why is freebsd slower than linux?"

There was some threads on FreeBSD/DragonflyBSD mailing lists a few years ago (2012?) which talked about some math benchmarks being much slower on FreeBSD/DragonflyBSD versus Linux.

When the same benchmark is run on FreeBSD/DragonflyBSD using the Linux layer (ie, a linux binary compiled for linux, but run on BSD) it gives the same or better behaviour.

Some digging was done, and it turned out it was due to memory allocation patterns and memory layout. The jemalloc library allocates large chunks at page aligned boundaries, whereas the allocator in glibc under Linux does not.

I've put the code online in the hope that others can test and verify this:

https://github.com/erikarn/himenobmtxpa

The branch 'local/freebsd' has my local change to allow the allocator offset to be specified. The offset compounds on each allocation - so with an 'n' byte offset, the first allocation is 0 bytes offset from the page boundary, the next is 'n' bytes offset from the page boundary, the next is '2n' bytes offset, etc.

You can experiment with different values and get completely different behavioural results. It's non-trivial: there's a 100% speedup by using a 127 byte offset for each allocation, versus a 0 byte offset.

I'd like to investigate cache line aliasing effects further. There was work done a few years ago to offset mbuf headers in the FreeBSD kernel so they weren't all page-aligned or 256/512/1024 byte aligned - and apparently this gave a significant performance improvement. But it wasn't folded into FreeBSD. What I'd like to do is come up with some better strategies / profiling guides for identifying when this is actually happening so the underlying objects being accessed can be adjusted.

So - if anyone out there has any tips, hints or suggestions on how to do this, please let me know. I'd like to document and automate this testing.

FreeBSD on the POWER8: it’s alive!

A post to freebsd-ppc from a couple of months ago asked if we had support for POWER8 and offered to provide remote access to anyone interested in working on it. I was sufficiently intrigued that I approached the FreeBSD powerpc hackers to ask about it, and was informed that it'd be nice, but we didn't have hardware.

After a bit of wrangling of hardware logistics and with the FreeBSD Foundation purchasing a box, a Tyan POWER8 evaluation server appeared. Nathan Whitehorn started poking at it and managed to get a basic "hello world" going, but stalled on issues with the Linux KVM virtualisation environment.

Fast forward a few weeks - he's figured out the KVM issues, their lack of support for some mandated hypervisor APIs and other bugs - FreeBSD now boots inside of the hypervisor environment and seems stable enough to do development on.

He then found the existing powerpc pmap (physical memory management) code wasn't very SMP friendly - it works fine on one and two CPU powerpc machines, but this POWER8 evaluation board is a 4-core, 32-thread CPU. So a few days of development went by and he rewrote most of the pmap code to be much more fine grained locked and scale much, much better than the existing code. (He also found the PS3 hypervisor layer isn't thread-safe.)

What's been done thus far?

  • FreeBSD boots inside the hypervisor environment;
  • Virtualised console, networking and storage all work;
  • (in progress) new, scalable pmap implementation;
  • Initial support for the Vector-Scalar Extension (VSX) that's found on POWER7 and POWER8.
So, I'm impressed. Nathan's done a fantastic job bringing the whole thing up. There's some further work on the new powerpc technology that needs doing (things like the new vector processing units, performance counter support and such) and I'm sure Justin and Nathan will poke powerpc dtrace support into further good shape. I'm going to see if we can fix a chelsio 40G NIC into one of these and work with their developers to fix any endian/busdma issues that creep up, and then do some network stack scaling testing with it. There's also the missing hardware/hypervisor support to run FreeBSD on the bare metal, which would be a fantastic achievement.

Now I kind of want some larger POWER8 hardware.

TDMA (somewhat) working on AR9380 chips

(Wow, I have a lot of posts to write to catch up on things.)

I've just brought up FreeBSD's TDMA support on the AR9380 chipset. Specifically, the AR9331, since I have a Carambola 2 on me today.

It was pretty simple to bring up - I was missing the beacon configuration HAL call that the TDMA code expected. It's only used by the TDMA code - the STA and AP modes rely on the normal HAL beacon methods that date back to the Atheros HAL.

The only problem - it seems something is up with ANI (noise immunity) and sensitivity on at least the AR9331. It doesn't seem to behave well on slightly loaded channels and thus the beacons don't always go out when they're supposed to.

But, if you've been wanting to play with TDMA on the later Atheros chips, now you can!

pkg(8) passes coverity scans

At FOSDEM phk@ reminded me to always on regular basis make static analysis of the code via all possible tools available.

We did but on unregular basis and only paid attention to very critical reports And not all reports.

That is now fixed, I relaunched a few scan via coverity and I'm happy to say that the latest scan on master claims 0 defects!

Meaning that all known defects have been fixed.

I was also planning to use lint(1) as well, unfortunatly on FreeBSD lint is not supporting C99...

If I'm brave enough I may synchronise lint(1) with NetBSD which seems to have added C99 support to that tool. Or maybe someone will volunteer to do it? :)

Build just one FreeBSD port

I usually build my own packages with poudriere but it’s not fun to do on tiny boxes so I just do ‘pkg install ‘ on them and use upstream packages. One downside is, that package is build with default options. I recently ran into a situation where I wanted to change some options for just a single port.

Now, what is the minimal set of things in /usr/ports/ that I need to checkout to be able to config/build just one port?
Turns out to be:


/usr/ports/Mk
/usr/ports/Keywords
/usr/ports/Templates
/usr/ports/ports-mgmt/dialog4ports

And the port I want to actually build. Now I can ‘make config’, change the options and build/install that port without checking out entire ports tree.

Fan Speed monitoring

Recently I moved a server into a proper cabinet with doors. After a few days I noticed the fans were spinning up and down. So I started investigating ways to monitor the fan speed. I figured having a graph of them long term would give me a nice way to show changes in the environment, beyond the temperature monitoring I am already doing.

I was not having much luck searching the Internet. Luckily, Darius on IRC pointed me to a project called bsdhwmon by Jeremy Chadwick, a fellow FreeBSD Developer. The server is running an older Supermicro X7SBi motherboard with a Winbond 83627HG chip which is listed on the supported page of bsdhwmon.

It was easy to setup:

  • Install bsdhwmon: pkg install bsdhwmon
  • Load the SMBus Controller driver for my motherboard: kldload ichsmb
  • Load the Generic SMB I/O Device driver: kldload smb

All I had to do from that point was run bsdhwmon:
# bsdhwmon
CPU1 Temperature 46 C
System Temperature 29 C
FAN1 10975 RPM
FAN2 11344 RPM
FAN3 7219 RPM
FAN4 7068 RPM
FAN5 0 RPM
FAN6 11065 RPM
VcoreA 1.122 V
MCH Core 1.508 V
-12V -12.672 V
V_DIMM 1.808 V
+3.3V 3.296 V
+12V 11.904 V
5Vsb 5.046 V
5VDD 4.998 V
P_VTT 1.228 V
Vbat 3.312 V

It is important to remember to add the kernel modules to be loaded at boot. Adding the following to /boot/loader.conf will take care of that:
smb_load="YES"
ichsmb_load="YES"

Note that ichsmb will load smbus, but not the smb kernel driver.

Now that I have the tools, I can monitor it at will.

Qt5 for FreeBSD/Pi

Build SD card image using crochet-freebsd with option VideoCore enabled. Mount either SD card itself of image to build host

mount /dev/mmcsd0s2a /pi

Checkout Qt5 sources and patch them

cd /src
git clone git://gitorious.org/qt/qt5.git qt5
cd qt5
git checkout 5.4.0
MODULES=qtbase,qtdeclarative,qtgraphicaleffects,qtimageformats,qtquick1,qtquickcontrols,qtscript,qtsvg,qtxmlpatterns
./init-repository --module-subset=$MODULES

fetch -q -o - http://people.freebsd.org/~gonzo/arm/rpi/qt5-freebsd-pi.diff | patch -p1

Configure, build and install Qt5 to SD card

./configure -platform unsupported/freebsd-clang -no-openssl -opengl es2 -device freebsd-rasp-pi-clang -device-option CROSS_COMPILE=/usr/armv6-freebsd/usr/bin/ -sysroot /pi/ -no-gcc-sysroot -opensource -confirm-license -optimized-qmake -release -prefix /usr/local/Qt5 -no-pch -nomake tests -nomake examples -plugin-sql-sqlite

gmake -j `sysctl -n hw.ncpu`
sudo gmake install

You need BSD-specific plugins to enable mouse and keyboard input in EGLFS mode

cd /src/
git clone https://github.com/gonzoua/qt5-bsd-input.git
cd qt5-bsd-input
/src/qt5/qtbase/bin/qmake
gmake
sudo gmake install

Build application you’d like run and install it. I use one of the examples here

cd /src/qt5/qtbase/examples/opengl/cube
/src/qt5/qtbase/bin/qmake
gmake
sudo gmake install

Unmount SD card, boot Pi, make sure vchiq is loaded

root@raspberry-pi:~ # kldload

Start application

root@raspberry-pi:~ # /usr/local/Qt5/examples/opengl/cube/cube -plugin bsdkeyboard -plugin bsdsysmouse

If you see something like this:

EGL Error : Could not create the egl surface: error = 0x3003

Or this:

QOpenGLFramebufferObject: Framebuffer incomplete attachment.

It means you need to increase GPU memory by setting gpu_mem in config.txt. Amount depends on framebuffer resolution. 128Mb works for me on 1920×1080 display.

bsdsysmouse plugin uses /dev/sysmouse by default, so you either should have moused running or specify actual mouse device, e.g.:

root@raspberry-pi:~ # cube -plugin bsdkeyboard -plugin bsdsysmouse:/dev/ums0

bsdkeyboard uses STDIN as input device, so if you’re trying to start app from serial console it should be something like this:

root@raspberry-pi:~ # cube -plugin bsdkeyboard -plugin bsdsysmouse < /dev/ttyv0

On profiling HTTP, or "god damnit people, why are all the open source tools slow?"

Something that's been a challenge at work (and at other things in the past) has been "how do I generate enough traffic to test this thing?"

If you're running some public facing boxes then sure, you can do A/B testing. But what if you're not able to test it in the real world? What if you need to do testing before you ship, and the traffic levels have to be stupid high?

So, what do you do?

I've done this a few times. When doing squid and other reverse proxy development, I would run tools like apachebench, httperf, even web polygraph - but these things scaled poorly. They didn't handle tens of thousands of concurrent connections and scale to both slow and fast clients - their use of poll() and select() just wouldn't work out well.

Something I did at Netflix was to start building TCP testing tools that more than 65,000 concurrent sockets. My aim is much higher, but one has to start somewhere. I was testing out the network stack rather than specifically doing HTTP testing. Here at my current job, I'm much more interested in real HTTP and all processing.

I looked at what's out there, and it's not very pretty. I need to be able to do 10G of traffic, looking upwards towards 20G and 40G of HTTP in the future. After a little more digging into what was out there - and finding httperf actually reverted my changes to use libevent and went back to poll/select! - I decided it was about time I just started writing something minimal to stress test things and build upon it as the need arose. I want something that eventually ends up like web polygraph - multiple client/server sets with different URL choices from a pool, a variety of client IP addresses, and other things like how often to make the requests and other request pacing.

So, I grabbed libevent, libevhtp from Mark Ellzey and threw them together. It turned out okish - libevent/libevhtp still does a bunch of memcpy()'ing inside the buffer management routines that makes 40G on one box infeasible at the moment, but it's good enough to get a few gigabit of client traffic on one core. There were some hiccups which I'll cover below, but it's good enough to build upon.

What did I learn?
  • Well, it turns out the client code in libevhtp was a bit immature. Mark and I talked a bit about it on IRC and then I found there was an outstanding pull request that found and fixed a bunch of these. So, my code has turned into another thing - a libevhtp client and server test suite.
  • The libevhtp threading model is fine for a couple of CPUs, but it's the standardish *NIX model of "one thread does accept, farms work off to other threads." So it's not going to scale well at high request rates to multiple CPUs. That's cool; that's what the FreeBSD-HEAD RSS work is for.
  •  There's memcpy()'ing in the libevhtp body handling code. It's not a big deal at 1G, but at 10G it's definitely noticeable. I've spoken to Mark about it.
But, it's a good starting point. Once the rest of the bugs get shaken out, it'll be a good high throughput HTTP traffic tester.

What would I do next, after the bugs?

  • the server will eventually grow the ability to generate controllable sized responses. That way the client can control how big a response to send and thus can create a mix of requests/replies.
  • .. and HTTP request body testing would be nice.
  • The client side needs to grow the ability to create client pools, like web polygraph, where certain subsets of clients get certain behaviours (like a pool of IPs to use, separate pool of URLs to fetch from, the time between each HTTP request, etc.)
The other trick is how to simulate lots (and I do mean lots) of IP addresses. I don't want to create separate loopback connections for each - that would be crazy. Instead, it'd be good to use the transparent interception support in FreeBSD IPFW that allows both connections from and connections to arbitrary IP addresses. A little trickery with IP routing so we don't need more than 1 ARP entry for each server and voila!

Oh, and the code?

https://github.com/erikarn/libevhtp-http/

Audio on Raspberry Pi

With stable VCHIQ driver next obvious target was to add VCHIQ-based audio support. So let me introduce to you: vchiq_audio, first take. It’s part of vchiq-freebsd repo so if you use Crochet to build SD card image just enable option VideoCore in config file and module will be automatically included.

From shell run kldload vchiq_audio and you’re good to do. I believe that audio output is picked up automatically by VideoCore so if you have HDMI connected it’s probably going to be HDMI. I do not have device to confirm this. Adding knob to control audio output (auto, headphones, HDMI) is on my ToDo list.

Quality is not ideal though. From quick tests it seems to work fine on system with rootfs on NFS but there are audio drops on SD-based system while playing mp3 over NFS. I’m going to debug and stresstest it more thoroughly next week.

Short instruction on how to install mpg321 package on RPi:

env PACKAGESITE=http://chips.ysv.freebsd.org/packages/11armv6-11armv6/ SIGNATURE_TYPE=none pkg bootstrap

mkdir -p /usr/local/etc/pkg/repos
cd /usr/local/etc/pkg/repos
echo 'FreeBSD: { enabled: no }' > FreeBSD.conf

cat > chips.ysv.conf <<__EOF__
chips.ysv: {
  url: "http://chips.ysv.freebsd.org/packages/freebsd:11:armv6:32:el:eabi:softfp",
  mirror_type: "http",
  signature_type: "none",
  enabled: yes
}
__EOF__

pkg install mpg321

RaspberryPi, FreeBSD and ioquake3

After New Year I got back to hacking the VCHIQ stuff (thanks to adrian@ for prodding). Since last time I touched NetBSD folks got it merged to main tree, syncing with latest upstream code and fixing some stupid bugs in my codebase. So I partially merged things back, spent some time on fixing more bugs introduced by yours truly, merged userland bits from latest Broadcom’s bits (and fixing some bugs introduced by them). And as a result VCHIQ got stable enough to run ioquake3d on raspberry pi. Well, you can’t play it because there is no sound and no mouse support and keyboard support is severely crippled but you can navigate menus and watch demoes.

Here is short summary of how to get it running:

  • Get latest HEAD that includes r276794
  • Get latest crochet-freebsd
  • Create configuration file for RasspberryPi, make sure that it’s configured for 2Gb SD card and has VideoCore enabled. i.e. it contains:
    option ImageSize 1950mb # for 2 Gigabyte card
    option VideoCore
    
  • Build RPi image and flash to SD card
  • mount FreeBSD partition, e.g. mount /dev/mmcsd0s2a /mnt
  • Copy Quake3 PAK files to /baseq3 directory on SD card
  • Download http://people.freebsd.org/~gonzo/arm/ioquake3.arm.tar.gz and copy *.so files to /baseq3 and ioqake3.arm to /usr/bin on SD card
  • Unmount FreeBSD partition and mount boot partiotion, e.g. mount_msdosfs /dev/mmcsd0s1 /mnt
  • Edit config.txt and change gpu_mem value to 64
  • Unmount SD card and boot it on your Pi
  • Load vchiq module: kldload vchiq
  • Start Quake3: ioqake3.arm +set s_initsound 0

Keyboard support is really broken. TAB and ENTER works, so you can navigate menus. But that’s pretty much it.

ioquake3 codebase with my minor changes located here: https://github.com/gonzoua/quake3
I provide pre-compiled binaries because for some reason ioquake3 built with xdev tools crash in qsort (libc incompatibilities?) so I use make buildenv to build it.

And here is photo of demo in action (there are RaspberryPi and ZedBoard on it too, yay!)
Photo Jan 07

AsiaBSDCon 2014 Videos Posted (6 years of BSDConferences on YouTube)

Sato-san has once created a playlist of videos from AsiaBSDCon. There were 20 videos from the conference held March 15-16, 2014 and papers can be found here. Congrats to the organizers for running another successful conference in Tokyo. A full list of videos is included below. Six years ago when I first created this channel videos longer than 10 minutes couldn't normally be uploaded to YouTube and we had to create a special partner channel for the content. It is great to see how the availability of technical video content about FreeBSD has grown in the last six years.

[youtube http://www.youtube.com/watch?v=videoseries?list=PLnTFqpZk5ebAUvwU75KRZ7zwU5YUaPudL&w=560&h=315]

FreeBSD + Packer = Vagrant

So I recently discovered a tool to build Vagrant images called Packer. It allows you to script the install via key presses over VNC to automate the install of any OS. I am running on a rather fast machine (Core i7, 16GB of RAM, SSD), so I suspect there might be some lurking problems for people on slower machines due to timing of the commands.

Everything is available from my Github repo: https://github.com/so14k/packer-freebsd

To get started:

  • Install Vagrant and Packer
  • Clone the repo onto your machine
  • Build the Vagrant box: packer build template.json
  • Wait while it builds..
  • Start the Vagrant box: vagrant up
  • Start hacking: vagrant ssh

Give it a spin and let me know what you think!

The Short List #8: fetchmailrc/gmail/ssl … grrr #FreeBSD

Didn’t realize that a fetchmail implementation I was using was actually *not* using SSL for a month.  I had installed security/ca_root_nss but FreeBSD doesn’t assume that you want to use the certificates in this package.  I don’t understand it, but whatever.

So, add this to your fetchmailrc to actually use the certificate authorities in there and really do SSL to your gmail account:

sslcertfile /usr/local/share/certs/ca-root-nss.crt

… mmm emulators.

I occasionally get asked to test out FreeBSD/MIPS patches for people, as they don't have physical hardware present. I can understand that - the hardware is cheap and plentiful, but not everyone wants to have a spare access point around just to test out MIPS changes on.

However QEMU does a pretty good job of emulating MIPS if you're just testing out non-hardware patches. There's even instructions on the FreeBSD wiki for how to do this! So I decided to teach my wifi build system about the various QEMU MIPS emulator targets so it can spit out a kernel and mfsroot to use for QEMU.

Thus:

https://github.com/freebsd/freebsd-wifi-build/wiki/MipsQemuEmulatorImages

It turns out that it wasn't all that hard. The main trick was to use qemu-devel, not qemu. There are bugs in the non-development QEMU branch that mean it works great for Linux but not FreeBSD.

The kernel configurations in FreeBSD had bitrotted a little bit (they were missing the random device, for example) but besides that the build, install and QEMU startup just worked. I now have FreeBSD/MIPS of each variety (32 bit, 64 bit, Little-Endian, Big-Endian) running under QEMU and building FreeBSD-HEAD as a basic test.

Next is figuring out how to build gdb to target each of the above and have it speak to the QEMU GDB stub. That should make it very easy to do MIPS platform debugging.

I also hear rumours about this stuff working somewhat for ARM and PPC, so I'll see how hard it is to run QEMU for those platforms and whether FreeBSD will just boot and run on each.

More RSS UDP tests – this time on a Dell R720

I've recently had the chance to run my RSS UDP test suite up on a pair of Dell R720s. They came with on-board 10G Intel NICs (ixgbe(4) in FreeBSD) so I figured I'd run my test suite up on it.

Thank you to the Enterprise Storage Division at Dell for providing hardware for me to develop on!

The config is like in the previous blog post, but now I have two 8-core Sandy Bridge Xeon CPUs to play with. To simply things (and to not have to try and solve NUMA related issues) I'm running this on the first socket. The Intel NIC is attached to the first CPU socket.

So:


  • CPU: Intel(R) Xeon(R) CPU E5-2650 0 @ 2.00GHz (2000.04-MHz K8-class CPU) x 2
  • RAM: 64GiB
  • HTT disabled
/boot/loader.conf:

# ... until ncpus is tunable, make it use 8 buckets.
net.inet.rss.bits=3
net.isr.maxthreads=8
net.isr.bindthreads=1

This time I want to test with 8 streams, so after some trial and error I found the right IPv4 addresses to use:

  • Server: 10.11.2.1/24
  • Client: 10.11.2.3/24, 10.11.2.2/24, 10.11.2.32/24, 10.11.2.33/24, 10.11.2.64/24, 10.11.2.65/24, 10.11.2.17/24, 10.11.2.18/24
The test was like before - the server ran one rss-udp-srv program that spawns one thread per RSS bucket. The client side runs rss-clt programs to generate traffic - but now there's eight of them instead of four.

The results are what I expected: the contention is in the same place (UDP receive) and it's per-core - it doesn't contend between CPU cores.

Each CPU is transmitting and receiving 215,000 510-byte UDP frames a second. It scales linearly - 1 CPU is 215,000 TX/RX frames a second. 8 CPUs is 215,000 TX/RX frames a second * 8. There's no degrading as the CPU core count increases.

That's 1.72 million packets per second. At 510 bytes frames it's about 7 gigabits/sec in and out.

The other 8 cores are idle. Ideally we'd be able to run an application in those cores - so hopefully I can get my network / rss library up and running enough to prototype an RSS-aware memcached and see if it'll handle this particular workload.

It's a far cry from what I think we can likely achieve - but please keep in mind that I know I could do more awesome looking results with netmap, PF_RING or Intel's DPDK software. What I'm trying to do is push the existing kernel networking subsystem to its limits so the issues can be exposed and fixed.

So, where's the CPU going?

In the UDP server program (pid 1620), it looks thus:

# pmcstat -P CPU_CLK_UNHALTED_CORE -T -w 1 -p 1620
PMC: [CPU_CLK_UNHALTED_CORE] Samples: 34298 (100.0%) , 155 unresolved

%SAMP IMAGE      FUNCTION             CALLERS
  8.0 kernel     fget_unlocked        kern_sendit:4.2 kern_recvit:3.9
  7.0 kernel     copyout              soreceive_dgram:5.6 amd64_syscall:0.9
  3.6 kernel     __mtx_unlock_flags   ixgbe_mq_start
  3.5 kernel     copyin               m_uiotombuf:1.8 amd64_syscall:1.2
  3.4 kernel     memcpy               ip_output:2.9 ether_output:0.6
  3.4 kernel     toeplitz_hash        rss_hash_ip4_2tuple
  3.3 kernel     bcopy                rss_hash_ip4_2tuple:1.4 rss_proto_software_hash_v4:0.9
  3.0 kernel     _mtx_lock_spin_cooki pmclog_reserve
  2.7 kernel     udp_send             sosend_dgram
  2.5 kernel     ip_output            udp_send

In the NIC receive / transmit thread(s) (pid 12), it looks thus:

# pmcstat -P CPU_CLK_UNHALTED_CORE -T -w 1 -p 12

PMC: [CPU_CLK_UNHALTED_CORE] Samples: 79319 (100.0%) , 0 unresolved

%SAMP IMAGE      FUNCTION             CALLERS
 10.3 kernel     ixgbe_rxeof          ixgbe_msix_que
  9.3 kernel     __mtx_unlock_flags   ixgbe_rxeof:4.8 netisr_dispatch_src:2.1 in_pcblookup_mbuf:1.3
  8.3 kernel     __mtx_lock_flags     ixgbe_rxeof:2.8 netisr_dispatch_src:2.4 udp_append:1.2 in_pcblookup_mbuf:1.1 knote:0.6
  3.8 kernel     bcmp                 netisr_dispatch_src
  3.6 kernel     uma_zalloc_arg       sbappendaddr_locked_internal:2.0 m_getjcl:1.6
  3.4 kernel     ip_input             netisr_dispatch_src
  3.4 kernel     lock_profile_release __mtx_unlock_flags
  3.4 kernel     in_pcblookup_mbuf    udp_input
  3.0 kernel     ether_nh_input       netisr_dispatch_src
  2.4 kernel     udp_input            ip_input
  2.4 kernel     mb_free_ext          m_freem
  2.2 kernel     lock_profile_obtain_ __mtx_lock_flags
  2.1 kernel     ixgbe_refresh_mbufs  ixgbe_rxeof

It looks like there's some obvious optimisations to poke at (what the heck is fget_unlocked() doing up there?) and yes, copyout/copyin are really terrible but currently unavoidable. The toeplitz hash and bcopy aren't very nice but they're occuring in the transmit path because at the moment there's no nice way to efficiently set both the outbound RSS hash and RSS bucket ID and send to a non-connected socket destination (ie, specify the destination IP:port as part of the send.) There's also some lock contention that needs to be addressed.

The output of the netisr queue statistics looks good:

root@abaddon:/home/adrian/git/github/erikarn/freebsd-rss # netstat -Q
Configuration:
Setting                        Current        Limit
Thread count                         8            8
Default queue limit                256        10240
Dispatch policy                 direct          n/a
Threads bound to CPUs          enabled          n/a

Protocols:
Name   Proto QLimit Policy Dispatch Flags
ip         1    256    cpu   hybrid   C--
igmp       2    256 source  default   ---
rtsock     3    256 source  default   ---
arp        4    256 source  default   ---
ether      5    256    cpu   direct   C--
ip6        6    256   flow  default   ---
ip_direct     9    256    cpu   hybrid   C--

Workstreams:
WSID CPU   Name     Len WMark   Disp'd  HDisp'd   QDrops   Queued  Handled
   0   0   ip         0    25        0 839349259        0       49 839349308
   0   0   igmp       0     0        0        0        0        0        0
   0   0   rtsock     0     2        0        0        0       92       92
   0   0   arp        0     0      118        0        0        0      118
   0   0   ether      0     0 839349600        0        0        0 839349600
   0   0   ip6        0     0        0        0        0        0        0
   0   0   ip_direct     0     0        0        0        0        0        0
   1   1   ip         0    20        0 829928186        0      286 829928472
   1   1   igmp       0     0        0        0        0        0        0
   1   1   rtsock     0     0        0        0        0        0        0
   1   1   arp        0     0        0        0        0        0        0
   1   1   ether      0     0 829928672        0        0        0 829928672
   1   1   ip6        0     0        0        0        0        0        0
   1   1   ip_direct     0     0        0        0        0        0        0
   2   2   ip         0     0        0 835558437        0        0 835558437
   2   2   igmp       0     0        0        0        0        0        0
   2   2   rtsock     0     0        0        0        0        0        0
   2   2   arp        0     0        0        0        0        0        0
   2   2   ether      0     0 835558610        0        0        0 835558610
   2   2   ip6        0     0        0        0        0        0        0
   2   2   ip_direct     0     0        0        0        0        0        0
   3   3   ip         0     1        0 850271162        0       23 850271185
   3   3   igmp       0     0        0        0        0        0        0
   3   3   rtsock     0     0        0        0        0        0        0
   3   3   arp        0     0        0        0        0        0        0
   3   3   ether      0     0 850271163        0        0        0 850271163
   3   3   ip6        0     0        0        0        0        0        0
   3   3   ip_direct     0     0        0        0        0        0        0
   4   4   ip         0    23        0 817439448        0      345 817439793
   4   4   igmp       0     0        0        0        0        0        0
   4   4   rtsock     0     0        0        0        0        0        0
   4   4   arp        0     0        0        0        0        0        0
   4   4   ether      0     0 817439625        0        0        0 817439625
   4   4   ip6        0     0        0        0        0        0        0
   4   4   ip_direct     0     0        0        0        0        0        0
   5   5   ip         0    19        0 817862508        0      332 817862840
   5   5   igmp       0     0        0        0        0        0        0
   5   5   rtsock     0     0        0        0        0        0        0
   5   5   arp        0     0        0        0        0        0        0
   5   5   ether      0     0 817862675        0        0        0 817862675
   5   5   ip6        0     0        0        0        0        0        0
   5   5   ip_direct     0     0        0        0        0        0        0
   6   6   ip         0    19        0 817281399        0      457 817281856
   6   6   igmp       0     0        0        0        0        0        0
   6   6   rtsock     0     0        0        0        0        0        0
   6   6   arp        0     0        0        0        0        0        0
   6   6   ether      0     0 817281665        0        0        0 817281665
   6   6   ip6        0     0        0        0        0        0        0
   6   6   ip_direct     0     0        0        0        0        0        0
   7   7   ip         0     0        0 813562616        0        0 813562616
   7   7   igmp       0     0        0        0        0        0        0
   7   7   rtsock     0     0        0        0        0        0        0
   7   7   arp        0     0        0        0        0        0        0
   7   7   ether      0     0 813562620        0        0        0 813562620
   7   7   ip6        0     0        0        0        0        0        0
   7   7   ip_direct     0     0        0        0        0        0        0
root@abaddon:/home/adrian/git/github/erikarn/freebsd-rss # 

It looks like everything is being dispatched correctly; nothing is being queued and/or dropped.

But yes, we're running out of socket buffers because each core is 100% pinned:

root@abaddon:/home/adrian/git/github/erikarn/freebsd-rss # netstat -sp udp
udp:
        6773040390 datagrams received
        0 with incomplete header
        0 with bad data length field
        0 with bad checksum
        0 with no checksum
        17450880 dropped due to no socket
        136 broadcast/multicast datagrams undelivered
        1634117674 dropped due to full socket buffers
        0 not for hashed pcb
        5121471700 delivered
        5121471044 datagrams output
        0 times multicast source filter matched

There's definitely room for improvement.

UDP RSS update: ixbge(4) turned out to have issues..

I started digging deeper into the RSS performance on my home test platform. Four cores and one (desktop) socket isn't all that much, but it's a good starting point for this.

It turns out that there was some lock contention inside netisr. Which made no sense, as RSS should be keeping all the flows local to each CPU.

After a bunch of digging, I discovered that the NIC was occasionally receiving packets into the wrong ring. Have a look at tihs:

Sep 12 08:04:32 adrian-hackbox kernel: ix0: ixgbe_rxeof: 100034:
m=0xfffff80047713d00; flowid=0x21f7db62; rxr->me=3
Sep 12 08:04:32 adrian-hackbox kernel: ix0: ixgbe_rxeof: 100034:
m=0xfffff8004742e100; flowid=0x21f7db62; rxr->me=3
Sep 12 08:04:32 adrian-hackbox kernel: ix0: ixgbe_rxeof: 100034:
m=0xfffff800474c2e00; flowid=0x21f7db62; rxr->me=3
Sep 12 08:04:32 adrian-hackbox kernel: ix0: ixgbe_rxeof: 100034:
m=0xfffff800474c5000; flowid=0x21f7db62; rxr->me=3
Sep 12 08:04:32 adrian-hackbox kernel: ix0: ixgbe_rxeof: 100034:
m=0xfffff8004742ec00; flowid=0x21f7db62; rxr->me=3
Sep 12 08:04:32 adrian-hackbox kernel: ix0: ixgbe_rxeof: 100032:
m=0xfffff8004727a700; flowid=0x335a5c03; rxr->me=2
Sep 12 08:04:32 adrian-hackbox kernel: ix0: ixgbe_rxeof: 100032:
m=0xfffff80006f11600; flowid=0x335a5c03; rxr->me=2
Sep 12 08:04:32 adrian-hackbox kernel: ix0: ixgbe_rxeof: 100032:
m=0xfffff80047279b00; flowid=0x335a5c03; rxr->me=2
Sep 12 08:04:32 adrian-hackbox kernel: ix0: ixgbe_rxeof: 100032:
m=0xfffff80006f0b700; flowid=0x335a5c03; rxr->me=2



The RX flowid was correct - I hashed the packets in software too and verified the software hash equaled the hardware hash. But they were turning up on the wrong receive queue. "rxr->me" is the queue id; the hardware should be hashing on the last 7 bits. 0x3 -> ring 3, 0x2 -> ring 2.

It also only happened when I was sending traffic to more than one receive ring. Everything was okay if I just transmitted to a single receive ring.

Luckily for me, some developers from Verisign saw some odd behaviour in their TCP stress testing and had dug in a bit further. They were seeing corrupted frames on the receive side that looked a lot like internal NIC configuration state. They figured out that the ixgbe(4) driver wasn't initialising the flow director and receive units correctly - the FreeBSD driver was not correctly setting up the amount of memory each was allocated on the NIC and they were overlapping. They also found a handful of incorrectly handled errors and double-freed mbufs.

So, with that all fixed, their TCP problem went away and my UDP tests started properly behaving themselves. Now all the flows are ending up on the right CPUs.

The flow director code was also dynamically programming flows into the NIC to try and rebalance traffic. Trouble is, I think it's a bit buggy and it's likely not working well with generic receive offload (LRO).

What's it mean for normal people? Well, it's fixed in FreeBSD-HEAD now. I'm hoping I or someone else will backport it to FreeBSD-10 soon. It fixes my UDP tests - now I hit around 1.3 million packets per second transmit and receive on my test rig; the server now has around 10-15% CPU free. It also fixed issues that Verisign were seeing with their high transaction rate TCP tests. I'm hoping that it fixes the odd corner cases that people have seen with Intel 10 gigabit hardware on FreeBSD and makes LRO generally more useful and stable.

Next up - some code refactoring, then finishing off IPv6 RSS!

 
 

Receive side scaling: testing UDP throughput

I think it's about time I shared some more details about the RSS stuff going into FreeBSD and how I'm testing it.

For now I'm focusing on IPv4 + UDP on the Intel 10GE NICs. The TCP side of things is done (and the IPv6 side of things works too!) but enough of the performance walls show up in the IPv4 UDP case that it's worth sticking to it for now.

I'm testing on a pair of 4-core boxes at home. They're not special - and they're very specifically not trying to be server-class hardware. I'd like to see where these bottlenecks are even at low core count.

The test setup in question:

Testing software:

  • http://github.com/erikarn/freebsd-rss
  • It requires libevent2 - an updated copy; previous versions of libevent2 didn't handle FreeBSD specific errors gracefully and would early error out of the IO loop.

Server:

  • CPU: Intel(R) Core(TM) i5-3550 CPU @ 3.30GHz (3292.59-MHz K8-class CPU)
  • There's no SMT/HTT, but I've disabled it in the BIOS just to be sure
  • 4GB RAM
  • FreeBSD-HEAD, amd64
  • NIC:  '82599EB 10-Gigabit SFI/SFP+ Network Connection
  • ix0: 10.11.2.1/24
/etc/sysctl.conf:

# for now redirect processing just makes the lock overhead suck even more.
# disable it.
net.inet.ip.redirect=0
net.inet.icmp.drop_redirect=1

/boot/loader.conf:

hw.ix.num_queues=8

# experiment with deferred dispatch for RSS
net.isr.numthreads=4
net.isr.maxthreads=4
net.isr.bindthreads=1
 

kernel config:

include GENERIC
ident HACKBOX

device netmap
options RSS
options PCBGROUP

# in-system lock profiling
options LOCK_PROFILING

# Flowtable - the rtentry locking is a bit .. slow.
options   FLOWTABLE

# This debugging code has too much overhead to do accurate
# testing with.
nooptions         INVARIANTS
nooptions         INVARIANT_SUPPORT
nooptions         WITNESS
nooptions         WITNESS_SKIPSPIN


The server runs the "rss-udp-srv" process, which behaves like a multi-threaded UDP echo server on port 8080.

Client

The client box is slightly more powerful to compensate for (currently) not using completely affinity-aware RSS UDP transmit code.

  • CPU: Intel(R) Core(TM) i5-4460  CPU @ 3.20GHz (3192.68-MHz K8-class CPU)
  • SMT/HTT: Disabled in BIOS
  • 8GB RAM
  • FreeBSD-HEAD amd64
  • Same kernel config, loader and sysctl config as the server
  • ix0: configured as 10.11.2.2/24, 10.11.2.3/32, 10.11.2.4/32, 10.11.2.32/32, 10.11.2.33/32
The client runs 'udp-clt' programs to source and sink traffic to the server.

Running things

The server-side simply runs the listen server, configured to respond to each frame:

$ rss-udp-srv 1 10.11.2.1

The client-side runs four couples of udp-clt, each from different IP addresses. These are run in parallel (i do it in different screens, so I can quickly see what's going on):

$ ./udp-clt -l 10.11.2.3 -r 10.11.2.1 -p 8080 -n 10000000000 -s 510
$ ./udp-clt -l 10.11.2.4 -r 10.11.2.1 -p 8080 -n 10000000000 -s 510
$ ./udp-clt -l 10.11.2.32 -r 10.11.2.1 -p 8080 -n 10000000000 -s 510
$ ./udp-clt -l 10.11.2.33 -r 10.11.2.1 -p 8080 -n 10000000000 -s 510

The IP addresses are chosen so that the 2-tuple topelitz hash using the default Microsoft key hash to different RSS buckets that live on individual CPUs.

Results: Round one

When the server is responding to each frame, the following occurs. The numbers are "number of frames generated by the client (netstat)", "number of frames received by the server (netstat)", "number of frames seen by udp-rss-srv", "number of responses transmitted from udp-rss-srv", "number of frames seen by the server (netstat)"
  • 1 udp-clt process: 710,000; 710,000; 296,000; 283,000; 281,000
  • 2 udp-clt processes: 1,300,000; 1,300,000; 592,000; 592,000; 575,000
  • 3 udp-clt processes: 1,800,000; 1,800,000; 636,000; 636,000; 600,000
  • 4 udp-clt processes: 2,100,000; 2,100,000; 255,000; 255,000; 255,000
So, it's not actually linear past two cores. The question here is: why?

There are a couple of parts to this.

Firstly - I had left turbo boost on. What this translated to:

  • One core active: ~ 30% increase in clock speed
  • Two cores active: ~ 30% increase in clock speed
  • Three cores active: ~ 25% increase in clock speed
  • Four cores active: ~ 15% increase in clock speed.
Secondly and more importantly - I had left flow control enabled. This made a world of difference.

The revised results are mostly linear - with more active RSS buckets (and thus CPUs) things seem to get slightly more efficient:
  • 1 udp-clt process: 710,000; 710,000; 266,000; 266,000; 266,000
  • 2 udp-clt processes: 1,300,000; 1,300,000; 512,000; 512,000; 512,000
  • 3 udp-clt processes: 1,800,000; 1,800,000; 810,000; 810,000; 810,000
  • 4 udp-clt processes: 2,100,000; 2,100,000; 1,120,000; 1,120,000; 1,120,000

Finally, let's repeat the process but only receiving instead also echoing back the packet to the client:

$ rss-udp-srv 0 10.11.2.1
  • 1 udp-clt process: 710,000; 710,000; 204,000
  • 2 udp-clt processes: 1,300,000; 1,300,000; 378,000
  • 3 udp-clt processes: 1,800,000; 1,800,000; 645,000
  • 4 udp-clt processes: 2,100,000; 2,100,000; 900,000
The receive-only workload is actually worse off versus the transmit + receive workload!

What's going on here?

Well, a little digging shows that in both instances - even with a single udp-clt thread running which means only one CPU on the server side is actually active! - there's active lock contention.

Here's an example dtrace output for measuring lock contention with only one active process, where one CPU is involved (and the other three are idle):

Receive only, 5 seconds:

root@adrian-hackbox:/home/adrian/git/github/erikarn/freebsd-rss # dtrace -n 'lockstat:::adaptive-block { @[stack()] = sum(arg1); }'
dtrace: description 'lockstat:::adaptive-block ' matched 1 probe
^C


              kernel`udp_append+0x11c
              kernel`udp_input+0x8cc
              kernel`ip_input+0x116
              kernel`netisr_dispatch_src+0x1cb
              kernel`ether_demux+0x123
              kernel`ether_nh_input+0x34d
              kernel`netisr_dispatch_src+0x61
              kernel`ether_input+0x26
              kernel`ixgbe_rxeof+0x2f7
              kernel`ixgbe_msix_que+0xb6
              kernel`intr_event_execute_handlers+0x83
              kernel`ithread_loop+0x96
              kernel`fork_exit+0x71
              kernel`0xffffffff80cd19de
         46729281

Transmit + receive, 5 seconds:

dtrace: description 'lockstat:::adaptive-block ' matched 1 probe
^C


              kernel`knote+0x7e
              kernel`sowakeup+0x65
              kernel`udp_append+0x14a
              kernel`udp_input+0x8cc
              kernel`ip_input+0x116
              kernel`netisr_dispatch_src+0x1cb
              kernel`ether_demux+0x123
              kernel`ether_nh_input+0x34d
              kernel`netisr_dispatch_src+0x61
              kernel`ether_input+0x26
              kernel`ixgbe_rxeof+0x2f7
              kernel`ixgbe_msix_que+0xb6
              kernel`intr_event_execute_handlers+0x83
              kernel`ithread_loop+0x96
              kernel`fork_exit+0x71
              kernel`0xffffffff80cd19de
             3793

              kernel`udp_append+0x11c
              kernel`udp_input+0x8cc
              kernel`ip_input+0x116
              kernel`netisr_dispatch_src+0x1cb
              kernel`ether_demux+0x123
              kernel`ether_nh_input+0x34d
              kernel`netisr_dispatch_src+0x61
              kernel`ether_input+0x26
              kernel`ixgbe_rxeof+0x2f7
              kernel`ixgbe_msix_que+0xb6
              kernel`intr_event_execute_handlers+0x83
              kernel`ithread_loop+0x96
              kernel`fork_exit+0x71
              kernel`0xffffffff80cd19de
          3823793

              kernel`ixgbe_msix_que+0xd3
              kernel`intr_event_execute_handlers+0x83
              kernel`ithread_loop+0x96
              kernel`fork_exit+0x71
              kernel`0xffffffff80cd19de
          9918140

Somehow it seems there's less lock contention / blocking going on when both transmit and receive is running!

So then I dug into it using the lock profiling suite. This is for 5 seconds with receive-only traffic on a single RSS bucket / CPU (all other CPUs are idle):

# sysctl debug.lock.prof.enable = 1; sleep 5 ; sysctl debug.lock.prof.enable=0

root@adrian-hackbox:/home/adrian/git/github/erikarn/freebsd-rss # sysctl debug.lock.prof.enable=1 ; sleep 5 ; sysctl debug.lock.prof.enable=0
debug.lock.prof.enable: 1 -> 1
debug.lock.prof.enable: 1 -> 0

root@adrian-hackbox:/home/adrian/git/github/erikarn/freebsd-rss # sysctl debug.lock.prof.stats | head -2 ; sysctl debug.lock.prof.stats | sort -nk4 | tail -10
debug.lock.prof.stats: 
     max  wait_max       total  wait_total       count    avg wait_avg cnt_hold cnt_lock name
    1496         0       10900           0          28    389      0  0      0 /usr/home/adrian/work/freebsd/head/src/sys/dev/usb/usb_device.c:2755 (sx:USB config SX lock)
debug.lock.prof.stats: 
       0         0          31           1          67      0      0  0      4 /usr/home/adrian/work/freebsd/head/src/sys/kern/sched_ule.c:888 (spin mutex:sched lock 2)
       0         0        2715           1       49740      0      0  0      7 /usr/home/adrian/work/freebsd/head/src/sys/dev/random/random_harvestq.c:294 (spin mutex:entropy harvest mutex)
       1         0          51           1         131      0      0  0      2 /usr/home/adrian/work/freebsd/head/src/sys/kern/sched_ule.c:1179 (spin mutex:sched lock 1)
       0         0          69           2         170      0      0  0      8 /usr/home/adrian/work/freebsd/head/src/sys/kern/sched_ule.c:886 (spin mutex:sched lock 2)
       0         0       40389           2      287649      0      0  0      8 /usr/home/adrian/work/freebsd/head/src/sys/kern/kern_intr.c:1359 (spin mutex:sched lock 2)
       0         2           2           4          12      0      0  0      2 /usr/home/adrian/work/freebsd/head/src/sys/dev/usb/usb_device.c:2762 (sleep mutex:Giant)
      15        20        6556         520        2254      2      0  0    105 /usr/home/adrian/work/freebsd/head/src/sys/dev/acpica/Osd/OsdSynch.c:535 (spin mutex:ACPI lock (0xfffff80002b10f00))
       4         5      195967       65888     3445501      0      0  0  28975 /usr/home/adrian/work/freebsd/head/src/sys/netinet/udp_usrreq.c:369 (sleep mutex:so_rcv)

Notice the lock contention for the so_rcv (socket receive buffer) handling? What's going on here is pretty amusing - it turns out that because there's so much receive traffic going on, the userland process receiving the data is being preempted by the NIC receive thread very often - and when this happens, there's a good chance it's going to be within the small window that the receive socket buffer lock is held. Once this happens, the NIC receive thread processes frames until it gets to one that requires it to grab the same sock buffer lock that is already held by userland - and it fails - so the NIC thread sleeps until the userland thread finishes consuming a packet. Then the CPU flips back to the NIC thread and continues processing a packet.

When the userland code is also transmitting frames it's increasing the amount of time in between socket receives and decreasing the probability of hitting the lock contention condition above.

Note there's no contention between CPUs here - this is entirely contention within a single CPU.

So for now I'm happy that the UDP IPv4 path is scaling well enough with RSS on a single core. The main performance problem here is the socket receive buffer locking (and, yes, copyin() / copyout().)

Next!