Category Archives: FreeBSD

Build just one FreeBSD port

I usually build my own packages with poudriere but it’s not fun to do on tiny boxes so I just do ‘pkg install ‘ on them and use upstream packages. One downside is, that package is build with default options. I recently ran into a situation where I wanted to change some options for just a single port.

Now, what is the minimal set of things in /usr/ports/ that I need to checkout to be able to config/build just one port?
Turns out to be:


/usr/ports/Mk
/usr/ports/Keywords
/usr/ports/Templates
/usr/ports/ports-mgmt/dialog4ports

And the port I want to actually build. Now I can ‘make config’, change the options and build/install that port without checking out entire ports tree.

Fan Speed monitoring

Recently I moved a server into a proper cabinet with doors. After a few days I noticed the fans were spinning up and down. So I started investigating ways to monitor the fan speed. I figured having a graph of them long term would give me a nice way to show changes in the environment, beyond the temperature monitoring I am already doing.

I was not having much luck searching the Internet. Luckily, Darius on IRC pointed me to a project called bsdhwmon by Jeremy Chadwick, a fellow FreeBSD Developer. The server is running an older Supermicro X7SBi motherboard with a Winbond 83627HG chip which is listed on the supported page of bsdhwmon.

It was easy to setup:

  • Install bsdhwmon: pkg install bsdhwmon
  • Load the SMBus Controller driver for my motherboard: kldload ichsmb
  • Load the Generic SMB I/O Device driver: kldload smb

All I had to do from that point was run bsdhwmon:
# bsdhwmon
CPU1 Temperature 46 C
System Temperature 29 C
FAN1 10975 RPM
FAN2 11344 RPM
FAN3 7219 RPM
FAN4 7068 RPM
FAN5 0 RPM
FAN6 11065 RPM
VcoreA 1.122 V
MCH Core 1.508 V
-12V -12.672 V
V_DIMM 1.808 V
+3.3V 3.296 V
+12V 11.904 V
5Vsb 5.046 V
5VDD 4.998 V
P_VTT 1.228 V
Vbat 3.312 V

It is important to remember to add the kernel modules to be loaded at boot. Adding the following to /boot/loader.conf will take care of that:
smb_load="YES"
ichsmb_load="YES"

Note that ichsmb will load smbus, but not the smb kernel driver.

Now that I have the tools, I can monitor it at will.

Qt5 for FreeBSD/Pi

Build SD card image using crochet-freebsd with option VideoCore enabled. Mount either SD card itself of image to build host

mount /dev/mmcsd0s2a /pi

Checkout Qt5 sources and patch them

cd /src
git clone git://gitorious.org/qt/qt5.git qt5
cd qt5
git checkout 5.4.0
MODULES=qtbase,qtdeclarative,qtgraphicaleffects,qtimageformats,qtquick1,qtquickcontrols,qtscript,qtsvg,qtxmlpatterns
./init-repository --module-subset=$MODULES

fetch -q -o - http://people.freebsd.org/~gonzo/arm/rpi/qt5-freebsd-pi.diff | patch -p1

Configure, build and install Qt5 to SD card

./configure -platform unsupported/freebsd-clang -no-openssl -opengl es2 -device freebsd-rasp-pi-clang -device-option CROSS_COMPILE=/usr/armv6-freebsd/usr/bin/ -sysroot /pi/ -no-gcc-sysroot -opensource -confirm-license -optimized-qmake -release -prefix /usr/local/Qt5 -no-pch -nomake tests -nomake examples -plugin-sql-sqlite

gmake -j `sysctl -n hw.ncpu`
sudo gmake install

You need BSD-specific plugins to enable mouse and keyboard input in EGLFS mode

cd /src/
git clone https://github.com/gonzoua/qt5-bsd-input.git
cd qt5-bsd-input
/src/qt5/qtbase/bin/qmake
gmake
sudo gmake install

Build application you’d like run and install it. I use one of the examples here

cd /src/qt5/qtbase/examples/opengl/cube
/src/qt5/qtbase/bin/qmake
gmake
sudo gmake install

Unmount SD card, boot Pi, make sure vchiq is loaded

root@raspberry-pi:~ # kldload

Start application

root@raspberry-pi:~ # /usr/local/Qt5/examples/opengl/cube/cube -plugin bsdkeyboard -plugin bsdsysmouse

If you see something like this:

EGL Error : Could not create the egl surface: error = 0x3003

Or this:

QOpenGLFramebufferObject: Framebuffer incomplete attachment.

It means you need to increase GPU memory by setting gpu_mem in config.txt. Amount depends on framebuffer resolution. 128Mb works for me on 1920×1080 display.

bsdsysmouse plugin uses /dev/sysmouse by default, so you either should have moused running or specify actual mouse device, e.g.:

root@raspberry-pi:~ # cube -plugin bsdkeyboard -plugin bsdsysmouse:/dev/ums0

bsdkeyboard uses STDIN as input device, so if you’re trying to start app from serial console it should be something like this:

root@raspberry-pi:~ # cube -plugin bsdkeyboard -plugin bsdsysmouse < /dev/ttyv0

On profiling HTTP, or "god damnit people, why are all the open source tools slow?"

Something that's been a challenge at work (and at other things in the past) has been "how do I generate enough traffic to test this thing?"

If you're running some public facing boxes then sure, you can do A/B testing. But what if you're not able to test it in the real world? What if you need to do testing before you ship, and the traffic levels have to be stupid high?

So, what do you do?

I've done this a few times. When doing squid and other reverse proxy development, I would run tools like apachebench, httperf, even web polygraph - but these things scaled poorly. They didn't handle tens of thousands of concurrent connections and scale to both slow and fast clients - their use of poll() and select() just wouldn't work out well.

Something I did at Netflix was to start building TCP testing tools that more than 65,000 concurrent sockets. My aim is much higher, but one has to start somewhere. I was testing out the network stack rather than specifically doing HTTP testing. Here at my current job, I'm much more interested in real HTTP and all processing.

I looked at what's out there, and it's not very pretty. I need to be able to do 10G of traffic, looking upwards towards 20G and 40G of HTTP in the future. After a little more digging into what was out there - and finding httperf actually reverted my changes to use libevent and went back to poll/select! - I decided it was about time I just started writing something minimal to stress test things and build upon it as the need arose. I want something that eventually ends up like web polygraph - multiple client/server sets with different URL choices from a pool, a variety of client IP addresses, and other things like how often to make the requests and other request pacing.

So, I grabbed libevent, libevhtp from Mark Ellzey and threw them together. It turned out okish - libevent/libevhtp still does a bunch of memcpy()'ing inside the buffer management routines that makes 40G on one box infeasible at the moment, but it's good enough to get a few gigabit of client traffic on one core. There were some hiccups which I'll cover below, but it's good enough to build upon.

What did I learn?
  • Well, it turns out the client code in libevhtp was a bit immature. Mark and I talked a bit about it on IRC and then I found there was an outstanding pull request that found and fixed a bunch of these. So, my code has turned into another thing - a libevhtp client and server test suite.
  • The libevhtp threading model is fine for a couple of CPUs, but it's the standardish *NIX model of "one thread does accept, farms work off to other threads." So it's not going to scale well at high request rates to multiple CPUs. That's cool; that's what the FreeBSD-HEAD RSS work is for.
  •  There's memcpy()'ing in the libevhtp body handling code. It's not a big deal at 1G, but at 10G it's definitely noticeable. I've spoken to Mark about it.
But, it's a good starting point. Once the rest of the bugs get shaken out, it'll be a good high throughput HTTP traffic tester.

What would I do next, after the bugs?

  • the server will eventually grow the ability to generate controllable sized responses. That way the client can control how big a response to send and thus can create a mix of requests/replies.
  • .. and HTTP request body testing would be nice.
  • The client side needs to grow the ability to create client pools, like web polygraph, where certain subsets of clients get certain behaviours (like a pool of IPs to use, separate pool of URLs to fetch from, the time between each HTTP request, etc.)
The other trick is how to simulate lots (and I do mean lots) of IP addresses. I don't want to create separate loopback connections for each - that would be crazy. Instead, it'd be good to use the transparent interception support in FreeBSD IPFW that allows both connections from and connections to arbitrary IP addresses. A little trickery with IP routing so we don't need more than 1 ARP entry for each server and voila!

Oh, and the code?

https://github.com/erikarn/libevhtp-http/

Audio on Raspberry Pi

With stable VCHIQ driver next obvious target was to add VCHIQ-based audio support. So let me introduce to you: vchiq_audio, first take. It’s part of vchiq-freebsd repo so if you use Crochet to build SD card image just enable option VideoCore in config file and module will be automatically included.

From shell run kldload vchiq_audio and you’re good to do. I believe that audio output is picked up automatically by VideoCore so if you have HDMI connected it’s probably going to be HDMI. I do not have device to confirm this. Adding knob to control audio output (auto, headphones, HDMI) is on my ToDo list.

Quality is not ideal though. From quick tests it seems to work fine on system with rootfs on NFS but there are audio drops on SD-based system while playing mp3 over NFS. I’m going to debug and stresstest it more thoroughly next week.

Short instruction on how to install mpg321 package on RPi:

env PACKAGESITE=http://chips.ysv.freebsd.org/packages/11armv6-11armv6/ SIGNATURE_TYPE=none pkg bootstrap

mkdir -p /usr/local/etc/pkg/repos
cd /usr/local/etc/pkg/repos
echo 'FreeBSD: { enabled: no }' > FreeBSD.conf

cat > chips.ysv.conf <<__EOF__
chips.ysv: {
  url: "http://chips.ysv.freebsd.org/packages/freebsd:11:armv6:32:el:eabi:softfp",
  mirror_type: "http",
  signature_type: "none",
  enabled: yes
}
__EOF__

pkg install mpg321

RaspberryPi, FreeBSD and ioquake3

After New Year I got back to hacking the VCHIQ stuff (thanks to adrian@ for prodding). Since last time I touched NetBSD folks got it merged to main tree, syncing with latest upstream code and fixing some stupid bugs in my codebase. So I partially merged things back, spent some time on fixing more bugs introduced by yours truly, merged userland bits from latest Broadcom’s bits (and fixing some bugs introduced by them). And as a result VCHIQ got stable enough to run ioquake3d on raspberry pi. Well, you can’t play it because there is no sound and no mouse support and keyboard support is severely crippled but you can navigate menus and watch demoes.

Here is short summary of how to get it running:

  • Get latest HEAD that includes r276794
  • Get latest crochet-freebsd
  • Create configuration file for RasspberryPi, make sure that it’s configured for 2Gb SD card and has VideoCore enabled. i.e. it contains:
    option ImageSize 1950mb # for 2 Gigabyte card
    option VideoCore
    
  • Build RPi image and flash to SD card
  • mount FreeBSD partition, e.g. mount /dev/mmcsd0s2a /mnt
  • Copy Quake3 PAK files to /baseq3 directory on SD card
  • Download http://people.freebsd.org/~gonzo/arm/ioquake3.arm.tar.gz and copy *.so files to /baseq3 and ioqake3.arm to /usr/bin on SD card
  • Unmount FreeBSD partition and mount boot partiotion, e.g. mount_msdosfs /dev/mmcsd0s1 /mnt
  • Edit config.txt and change gpu_mem value to 64
  • Unmount SD card and boot it on your Pi
  • Load vchiq module: kldload vchiq
  • Start Quake3: ioqake3.arm +set s_initsound 0

Keyboard support is really broken. TAB and ENTER works, so you can navigate menus. But that’s pretty much it.

ioquake3 codebase with my minor changes located here: https://github.com/gonzoua/quake3
I provide pre-compiled binaries because for some reason ioquake3 built with xdev tools crash in qsort (libc incompatibilities?) so I use make buildenv to build it.

And here is photo of demo in action (there are RaspberryPi and ZedBoard on it too, yay!)
Photo Jan 07

AsiaBSDCon 2014 Videos Posted (6 years of BSDConferences on YouTube)

Sato-san has once created a playlist of videos from AsiaBSDCon. There were 20 videos from the conference held March 15-16, 2014 and papers can be found here. Congrats to the organizers for running another successful conference in Tokyo. A full list of videos is included below. Six years ago when I first created this channel videos longer than 10 minutes couldn't normally be uploaded to YouTube and we had to create a special partner channel for the content. It is great to see how the availability of technical video content about FreeBSD has grown in the last six years.

[youtube http://www.youtube.com/watch?v=videoseries?list=PLnTFqpZk5ebAUvwU75KRZ7zwU5YUaPudL&w=560&h=315]

FreeBSD + Packer = Vagrant

So I recently discovered a tool to build Vagrant images called Packer. It allows you to script the install via key presses over VNC to automate the install of any OS. I am running on a rather fast machine (Core i7, 16GB of RAM, SSD), so I suspect there might be some lurking problems for people on slower machines due to timing of the commands.

Everything is available from my Github repo: https://github.com/so14k/packer-freebsd

To get started:

  • Install Vagrant and Packer
  • Clone the repo onto your machine
  • Build the Vagrant box: packer build template.json
  • Wait while it builds..
  • Start the Vagrant box: vagrant up
  • Start hacking: vagrant ssh

Give it a spin and let me know what you think!

The Short List #8: fetchmailrc/gmail/ssl … grrr #FreeBSD

Didn’t realize that a fetchmail implementation I was using was actually *not* using SSL for a month.  I had installed security/ca_root_nss but FreeBSD doesn’t assume that you want to use the certificates in this package.  I don’t understand it, but whatever.

So, add this to your fetchmailrc to actually use the certificate authorities in there and really do SSL to your gmail account:

sslcertfile /usr/local/share/certs/ca-root-nss.crt

… mmm emulators.

I occasionally get asked to test out FreeBSD/MIPS patches for people, as they don't have physical hardware present. I can understand that - the hardware is cheap and plentiful, but not everyone wants to have a spare access point around just to test out MIPS changes on.

However QEMU does a pretty good job of emulating MIPS if you're just testing out non-hardware patches. There's even instructions on the FreeBSD wiki for how to do this! So I decided to teach my wifi build system about the various QEMU MIPS emulator targets so it can spit out a kernel and mfsroot to use for QEMU.

Thus:

https://github.com/freebsd/freebsd-wifi-build/wiki/MipsQemuEmulatorImages

It turns out that it wasn't all that hard. The main trick was to use qemu-devel, not qemu. There are bugs in the non-development QEMU branch that mean it works great for Linux but not FreeBSD.

The kernel configurations in FreeBSD had bitrotted a little bit (they were missing the random device, for example) but besides that the build, install and QEMU startup just worked. I now have FreeBSD/MIPS of each variety (32 bit, 64 bit, Little-Endian, Big-Endian) running under QEMU and building FreeBSD-HEAD as a basic test.

Next is figuring out how to build gdb to target each of the above and have it speak to the QEMU GDB stub. That should make it very easy to do MIPS platform debugging.

I also hear rumours about this stuff working somewhat for ARM and PPC, so I'll see how hard it is to run QEMU for those platforms and whether FreeBSD will just boot and run on each.

More RSS UDP tests – this time on a Dell R720

I've recently had the chance to run my RSS UDP test suite up on a pair of Dell R720s. They came with on-board 10G Intel NICs (ixgbe(4) in FreeBSD) so I figured I'd run my test suite up on it.

Thank you to the Enterprise Storage Division at Dell for providing hardware for me to develop on!

The config is like in the previous blog post, but now I have two 8-core Sandy Bridge Xeon CPUs to play with. To simply things (and to not have to try and solve NUMA related issues) I'm running this on the first socket. The Intel NIC is attached to the first CPU socket.

So:


  • CPU: Intel(R) Xeon(R) CPU E5-2650 0 @ 2.00GHz (2000.04-MHz K8-class CPU) x 2
  • RAM: 64GiB
  • HTT disabled
/boot/loader.conf:

# ... until ncpus is tunable, make it use 8 buckets.
net.inet.rss.bits=3
net.isr.maxthreads=8
net.isr.bindthreads=1

This time I want to test with 8 streams, so after some trial and error I found the right IPv4 addresses to use:

  • Server: 10.11.2.1/24
  • Client: 10.11.2.3/24, 10.11.2.2/24, 10.11.2.32/24, 10.11.2.33/24, 10.11.2.64/24, 10.11.2.65/24, 10.11.2.17/24, 10.11.2.18/24
The test was like before - the server ran one rss-udp-srv program that spawns one thread per RSS bucket. The client side runs rss-clt programs to generate traffic - but now there's eight of them instead of four.

The results are what I expected: the contention is in the same place (UDP receive) and it's per-core - it doesn't contend between CPU cores.

Each CPU is transmitting and receiving 215,000 510-byte UDP frames a second. It scales linearly - 1 CPU is 215,000 TX/RX frames a second. 8 CPUs is 215,000 TX/RX frames a second * 8. There's no degrading as the CPU core count increases.

That's 1.72 million packets per second. At 510 bytes frames it's about 7 gigabits/sec in and out.

The other 8 cores are idle. Ideally we'd be able to run an application in those cores - so hopefully I can get my network / rss library up and running enough to prototype an RSS-aware memcached and see if it'll handle this particular workload.

It's a far cry from what I think we can likely achieve - but please keep in mind that I know I could do more awesome looking results with netmap, PF_RING or Intel's DPDK software. What I'm trying to do is push the existing kernel networking subsystem to its limits so the issues can be exposed and fixed.

So, where's the CPU going?

In the UDP server program (pid 1620), it looks thus:

# pmcstat -P CPU_CLK_UNHALTED_CORE -T -w 1 -p 1620
PMC: [CPU_CLK_UNHALTED_CORE] Samples: 34298 (100.0%) , 155 unresolved

%SAMP IMAGE      FUNCTION             CALLERS
  8.0 kernel     fget_unlocked        kern_sendit:4.2 kern_recvit:3.9
  7.0 kernel     copyout              soreceive_dgram:5.6 amd64_syscall:0.9
  3.6 kernel     __mtx_unlock_flags   ixgbe_mq_start
  3.5 kernel     copyin               m_uiotombuf:1.8 amd64_syscall:1.2
  3.4 kernel     memcpy               ip_output:2.9 ether_output:0.6
  3.4 kernel     toeplitz_hash        rss_hash_ip4_2tuple
  3.3 kernel     bcopy                rss_hash_ip4_2tuple:1.4 rss_proto_software_hash_v4:0.9
  3.0 kernel     _mtx_lock_spin_cooki pmclog_reserve
  2.7 kernel     udp_send             sosend_dgram
  2.5 kernel     ip_output            udp_send

In the NIC receive / transmit thread(s) (pid 12), it looks thus:

# pmcstat -P CPU_CLK_UNHALTED_CORE -T -w 1 -p 12

PMC: [CPU_CLK_UNHALTED_CORE] Samples: 79319 (100.0%) , 0 unresolved

%SAMP IMAGE      FUNCTION             CALLERS
 10.3 kernel     ixgbe_rxeof          ixgbe_msix_que
  9.3 kernel     __mtx_unlock_flags   ixgbe_rxeof:4.8 netisr_dispatch_src:2.1 in_pcblookup_mbuf:1.3
  8.3 kernel     __mtx_lock_flags     ixgbe_rxeof:2.8 netisr_dispatch_src:2.4 udp_append:1.2 in_pcblookup_mbuf:1.1 knote:0.6
  3.8 kernel     bcmp                 netisr_dispatch_src
  3.6 kernel     uma_zalloc_arg       sbappendaddr_locked_internal:2.0 m_getjcl:1.6
  3.4 kernel     ip_input             netisr_dispatch_src
  3.4 kernel     lock_profile_release __mtx_unlock_flags
  3.4 kernel     in_pcblookup_mbuf    udp_input
  3.0 kernel     ether_nh_input       netisr_dispatch_src
  2.4 kernel     udp_input            ip_input
  2.4 kernel     mb_free_ext          m_freem
  2.2 kernel     lock_profile_obtain_ __mtx_lock_flags
  2.1 kernel     ixgbe_refresh_mbufs  ixgbe_rxeof

It looks like there's some obvious optimisations to poke at (what the heck is fget_unlocked() doing up there?) and yes, copyout/copyin are really terrible but currently unavoidable. The toeplitz hash and bcopy aren't very nice but they're occuring in the transmit path because at the moment there's no nice way to efficiently set both the outbound RSS hash and RSS bucket ID and send to a non-connected socket destination (ie, specify the destination IP:port as part of the send.) There's also some lock contention that needs to be addressed.

The output of the netisr queue statistics looks good:

root@abaddon:/home/adrian/git/github/erikarn/freebsd-rss # netstat -Q
Configuration:
Setting                        Current        Limit
Thread count                         8            8
Default queue limit                256        10240
Dispatch policy                 direct          n/a
Threads bound to CPUs          enabled          n/a

Protocols:
Name   Proto QLimit Policy Dispatch Flags
ip         1    256    cpu   hybrid   C--
igmp       2    256 source  default   ---
rtsock     3    256 source  default   ---
arp        4    256 source  default   ---
ether      5    256    cpu   direct   C--
ip6        6    256   flow  default   ---
ip_direct     9    256    cpu   hybrid   C--

Workstreams:
WSID CPU   Name     Len WMark   Disp'd  HDisp'd   QDrops   Queued  Handled
   0   0   ip         0    25        0 839349259        0       49 839349308
   0   0   igmp       0     0        0        0        0        0        0
   0   0   rtsock     0     2        0        0        0       92       92
   0   0   arp        0     0      118        0        0        0      118
   0   0   ether      0     0 839349600        0        0        0 839349600
   0   0   ip6        0     0        0        0        0        0        0
   0   0   ip_direct     0     0        0        0        0        0        0
   1   1   ip         0    20        0 829928186        0      286 829928472
   1   1   igmp       0     0        0        0        0        0        0
   1   1   rtsock     0     0        0        0        0        0        0
   1   1   arp        0     0        0        0        0        0        0
   1   1   ether      0     0 829928672        0        0        0 829928672
   1   1   ip6        0     0        0        0        0        0        0
   1   1   ip_direct     0     0        0        0        0        0        0
   2   2   ip         0     0        0 835558437        0        0 835558437
   2   2   igmp       0     0        0        0        0        0        0
   2   2   rtsock     0     0        0        0        0        0        0
   2   2   arp        0     0        0        0        0        0        0
   2   2   ether      0     0 835558610        0        0        0 835558610
   2   2   ip6        0     0        0        0        0        0        0
   2   2   ip_direct     0     0        0        0        0        0        0
   3   3   ip         0     1        0 850271162        0       23 850271185
   3   3   igmp       0     0        0        0        0        0        0
   3   3   rtsock     0     0        0        0        0        0        0
   3   3   arp        0     0        0        0        0        0        0
   3   3   ether      0     0 850271163        0        0        0 850271163
   3   3   ip6        0     0        0        0        0        0        0
   3   3   ip_direct     0     0        0        0        0        0        0
   4   4   ip         0    23        0 817439448        0      345 817439793
   4   4   igmp       0     0        0        0        0        0        0
   4   4   rtsock     0     0        0        0        0        0        0
   4   4   arp        0     0        0        0        0        0        0
   4   4   ether      0     0 817439625        0        0        0 817439625
   4   4   ip6        0     0        0        0        0        0        0
   4   4   ip_direct     0     0        0        0        0        0        0
   5   5   ip         0    19        0 817862508        0      332 817862840
   5   5   igmp       0     0        0        0        0        0        0
   5   5   rtsock     0     0        0        0        0        0        0
   5   5   arp        0     0        0        0        0        0        0
   5   5   ether      0     0 817862675        0        0        0 817862675
   5   5   ip6        0     0        0        0        0        0        0
   5   5   ip_direct     0     0        0        0        0        0        0
   6   6   ip         0    19        0 817281399        0      457 817281856
   6   6   igmp       0     0        0        0        0        0        0
   6   6   rtsock     0     0        0        0        0        0        0
   6   6   arp        0     0        0        0        0        0        0
   6   6   ether      0     0 817281665        0        0        0 817281665
   6   6   ip6        0     0        0        0        0        0        0
   6   6   ip_direct     0     0        0        0        0        0        0
   7   7   ip         0     0        0 813562616        0        0 813562616
   7   7   igmp       0     0        0        0        0        0        0
   7   7   rtsock     0     0        0        0        0        0        0
   7   7   arp        0     0        0        0        0        0        0
   7   7   ether      0     0 813562620        0        0        0 813562620
   7   7   ip6        0     0        0        0        0        0        0
   7   7   ip_direct     0     0        0        0        0        0        0
root@abaddon:/home/adrian/git/github/erikarn/freebsd-rss # 

It looks like everything is being dispatched correctly; nothing is being queued and/or dropped.

But yes, we're running out of socket buffers because each core is 100% pinned:

root@abaddon:/home/adrian/git/github/erikarn/freebsd-rss # netstat -sp udp
udp:
        6773040390 datagrams received
        0 with incomplete header
        0 with bad data length field
        0 with bad checksum
        0 with no checksum
        17450880 dropped due to no socket
        136 broadcast/multicast datagrams undelivered
        1634117674 dropped due to full socket buffers
        0 not for hashed pcb
        5121471700 delivered
        5121471044 datagrams output
        0 times multicast source filter matched

There's definitely room for improvement.

UDP RSS update: ixbge(4) turned out to have issues..

I started digging deeper into the RSS performance on my home test platform. Four cores and one (desktop) socket isn't all that much, but it's a good starting point for this.

It turns out that there was some lock contention inside netisr. Which made no sense, as RSS should be keeping all the flows local to each CPU.

After a bunch of digging, I discovered that the NIC was occasionally receiving packets into the wrong ring. Have a look at tihs:

Sep 12 08:04:32 adrian-hackbox kernel: ix0: ixgbe_rxeof: 100034:
m=0xfffff80047713d00; flowid=0x21f7db62; rxr->me=3
Sep 12 08:04:32 adrian-hackbox kernel: ix0: ixgbe_rxeof: 100034:
m=0xfffff8004742e100; flowid=0x21f7db62; rxr->me=3
Sep 12 08:04:32 adrian-hackbox kernel: ix0: ixgbe_rxeof: 100034:
m=0xfffff800474c2e00; flowid=0x21f7db62; rxr->me=3
Sep 12 08:04:32 adrian-hackbox kernel: ix0: ixgbe_rxeof: 100034:
m=0xfffff800474c5000; flowid=0x21f7db62; rxr->me=3
Sep 12 08:04:32 adrian-hackbox kernel: ix0: ixgbe_rxeof: 100034:
m=0xfffff8004742ec00; flowid=0x21f7db62; rxr->me=3
Sep 12 08:04:32 adrian-hackbox kernel: ix0: ixgbe_rxeof: 100032:
m=0xfffff8004727a700; flowid=0x335a5c03; rxr->me=2
Sep 12 08:04:32 adrian-hackbox kernel: ix0: ixgbe_rxeof: 100032:
m=0xfffff80006f11600; flowid=0x335a5c03; rxr->me=2
Sep 12 08:04:32 adrian-hackbox kernel: ix0: ixgbe_rxeof: 100032:
m=0xfffff80047279b00; flowid=0x335a5c03; rxr->me=2
Sep 12 08:04:32 adrian-hackbox kernel: ix0: ixgbe_rxeof: 100032:
m=0xfffff80006f0b700; flowid=0x335a5c03; rxr->me=2



The RX flowid was correct - I hashed the packets in software too and verified the software hash equaled the hardware hash. But they were turning up on the wrong receive queue. "rxr->me" is the queue id; the hardware should be hashing on the last 7 bits. 0x3 -> ring 3, 0x2 -> ring 2.

It also only happened when I was sending traffic to more than one receive ring. Everything was okay if I just transmitted to a single receive ring.

Luckily for me, some developers from Verisign saw some odd behaviour in their TCP stress testing and had dug in a bit further. They were seeing corrupted frames on the receive side that looked a lot like internal NIC configuration state. They figured out that the ixgbe(4) driver wasn't initialising the flow director and receive units correctly - the FreeBSD driver was not correctly setting up the amount of memory each was allocated on the NIC and they were overlapping. They also found a handful of incorrectly handled errors and double-freed mbufs.

So, with that all fixed, their TCP problem went away and my UDP tests started properly behaving themselves. Now all the flows are ending up on the right CPUs.

The flow director code was also dynamically programming flows into the NIC to try and rebalance traffic. Trouble is, I think it's a bit buggy and it's likely not working well with generic receive offload (LRO).

What's it mean for normal people? Well, it's fixed in FreeBSD-HEAD now. I'm hoping I or someone else will backport it to FreeBSD-10 soon. It fixes my UDP tests - now I hit around 1.3 million packets per second transmit and receive on my test rig; the server now has around 10-15% CPU free. It also fixed issues that Verisign were seeing with their high transaction rate TCP tests. I'm hoping that it fixes the odd corner cases that people have seen with Intel 10 gigabit hardware on FreeBSD and makes LRO generally more useful and stable.

Next up - some code refactoring, then finishing off IPv6 RSS!

 
 

Receive side scaling: testing UDP throughput

I think it's about time I shared some more details about the RSS stuff going into FreeBSD and how I'm testing it.

For now I'm focusing on IPv4 + UDP on the Intel 10GE NICs. The TCP side of things is done (and the IPv6 side of things works too!) but enough of the performance walls show up in the IPv4 UDP case that it's worth sticking to it for now.

I'm testing on a pair of 4-core boxes at home. They're not special - and they're very specifically not trying to be server-class hardware. I'd like to see where these bottlenecks are even at low core count.

The test setup in question:

Testing software:

  • http://github.com/erikarn/freebsd-rss
  • It requires libevent2 - an updated copy; previous versions of libevent2 didn't handle FreeBSD specific errors gracefully and would early error out of the IO loop.

Server:

  • CPU: Intel(R) Core(TM) i5-3550 CPU @ 3.30GHz (3292.59-MHz K8-class CPU)
  • There's no SMT/HTT, but I've disabled it in the BIOS just to be sure
  • 4GB RAM
  • FreeBSD-HEAD, amd64
  • NIC:  '82599EB 10-Gigabit SFI/SFP+ Network Connection
  • ix0: 10.11.2.1/24
/etc/sysctl.conf:

# for now redirect processing just makes the lock overhead suck even more.
# disable it.
net.inet.ip.redirect=0
net.inet.icmp.drop_redirect=1

/boot/loader.conf:

hw.ix.num_queues=8

# experiment with deferred dispatch for RSS
net.isr.numthreads=4
net.isr.maxthreads=4
net.isr.bindthreads=1
 

kernel config:

include GENERIC
ident HACKBOX

device netmap
options RSS
options PCBGROUP

# in-system lock profiling
options LOCK_PROFILING

# Flowtable - the rtentry locking is a bit .. slow.
options   FLOWTABLE

# This debugging code has too much overhead to do accurate
# testing with.
nooptions         INVARIANTS
nooptions         INVARIANT_SUPPORT
nooptions         WITNESS
nooptions         WITNESS_SKIPSPIN


The server runs the "rss-udp-srv" process, which behaves like a multi-threaded UDP echo server on port 8080.

Client

The client box is slightly more powerful to compensate for (currently) not using completely affinity-aware RSS UDP transmit code.

  • CPU: Intel(R) Core(TM) i5-4460  CPU @ 3.20GHz (3192.68-MHz K8-class CPU)
  • SMT/HTT: Disabled in BIOS
  • 8GB RAM
  • FreeBSD-HEAD amd64
  • Same kernel config, loader and sysctl config as the server
  • ix0: configured as 10.11.2.2/24, 10.11.2.3/32, 10.11.2.4/32, 10.11.2.32/32, 10.11.2.33/32
The client runs 'udp-clt' programs to source and sink traffic to the server.

Running things

The server-side simply runs the listen server, configured to respond to each frame:

$ rss-udp-srv 1 10.11.2.1

The client-side runs four couples of udp-clt, each from different IP addresses. These are run in parallel (i do it in different screens, so I can quickly see what's going on):

$ ./udp-clt -l 10.11.2.3 -r 10.11.2.1 -p 8080 -n 10000000000 -s 510
$ ./udp-clt -l 10.11.2.4 -r 10.11.2.1 -p 8080 -n 10000000000 -s 510
$ ./udp-clt -l 10.11.2.32 -r 10.11.2.1 -p 8080 -n 10000000000 -s 510
$ ./udp-clt -l 10.11.2.33 -r 10.11.2.1 -p 8080 -n 10000000000 -s 510

The IP addresses are chosen so that the 2-tuple topelitz hash using the default Microsoft key hash to different RSS buckets that live on individual CPUs.

Results: Round one

When the server is responding to each frame, the following occurs. The numbers are "number of frames generated by the client (netstat)", "number of frames received by the server (netstat)", "number of frames seen by udp-rss-srv", "number of responses transmitted from udp-rss-srv", "number of frames seen by the server (netstat)"
  • 1 udp-clt process: 710,000; 710,000; 296,000; 283,000; 281,000
  • 2 udp-clt processes: 1,300,000; 1,300,000; 592,000; 592,000; 575,000
  • 3 udp-clt processes: 1,800,000; 1,800,000; 636,000; 636,000; 600,000
  • 4 udp-clt processes: 2,100,000; 2,100,000; 255,000; 255,000; 255,000
So, it's not actually linear past two cores. The question here is: why?

There are a couple of parts to this.

Firstly - I had left turbo boost on. What this translated to:

  • One core active: ~ 30% increase in clock speed
  • Two cores active: ~ 30% increase in clock speed
  • Three cores active: ~ 25% increase in clock speed
  • Four cores active: ~ 15% increase in clock speed.
Secondly and more importantly - I had left flow control enabled. This made a world of difference.

The revised results are mostly linear - with more active RSS buckets (and thus CPUs) things seem to get slightly more efficient:
  • 1 udp-clt process: 710,000; 710,000; 266,000; 266,000; 266,000
  • 2 udp-clt processes: 1,300,000; 1,300,000; 512,000; 512,000; 512,000
  • 3 udp-clt processes: 1,800,000; 1,800,000; 810,000; 810,000; 810,000
  • 4 udp-clt processes: 2,100,000; 2,100,000; 1,120,000; 1,120,000; 1,120,000

Finally, let's repeat the process but only receiving instead also echoing back the packet to the client:

$ rss-udp-srv 0 10.11.2.1
  • 1 udp-clt process: 710,000; 710,000; 204,000
  • 2 udp-clt processes: 1,300,000; 1,300,000; 378,000
  • 3 udp-clt processes: 1,800,000; 1,800,000; 645,000
  • 4 udp-clt processes: 2,100,000; 2,100,000; 900,000
The receive-only workload is actually worse off versus the transmit + receive workload!

What's going on here?

Well, a little digging shows that in both instances - even with a single udp-clt thread running which means only one CPU on the server side is actually active! - there's active lock contention.

Here's an example dtrace output for measuring lock contention with only one active process, where one CPU is involved (and the other three are idle):

Receive only, 5 seconds:

root@adrian-hackbox:/home/adrian/git/github/erikarn/freebsd-rss # dtrace -n 'lockstat:::adaptive-block { @[stack()] = sum(arg1); }'
dtrace: description 'lockstat:::adaptive-block ' matched 1 probe
^C


              kernel`udp_append+0x11c
              kernel`udp_input+0x8cc
              kernel`ip_input+0x116
              kernel`netisr_dispatch_src+0x1cb
              kernel`ether_demux+0x123
              kernel`ether_nh_input+0x34d
              kernel`netisr_dispatch_src+0x61
              kernel`ether_input+0x26
              kernel`ixgbe_rxeof+0x2f7
              kernel`ixgbe_msix_que+0xb6
              kernel`intr_event_execute_handlers+0x83
              kernel`ithread_loop+0x96
              kernel`fork_exit+0x71
              kernel`0xffffffff80cd19de
         46729281

Transmit + receive, 5 seconds:

dtrace: description 'lockstat:::adaptive-block ' matched 1 probe
^C


              kernel`knote+0x7e
              kernel`sowakeup+0x65
              kernel`udp_append+0x14a
              kernel`udp_input+0x8cc
              kernel`ip_input+0x116
              kernel`netisr_dispatch_src+0x1cb
              kernel`ether_demux+0x123
              kernel`ether_nh_input+0x34d
              kernel`netisr_dispatch_src+0x61
              kernel`ether_input+0x26
              kernel`ixgbe_rxeof+0x2f7
              kernel`ixgbe_msix_que+0xb6
              kernel`intr_event_execute_handlers+0x83
              kernel`ithread_loop+0x96
              kernel`fork_exit+0x71
              kernel`0xffffffff80cd19de
             3793

              kernel`udp_append+0x11c
              kernel`udp_input+0x8cc
              kernel`ip_input+0x116
              kernel`netisr_dispatch_src+0x1cb
              kernel`ether_demux+0x123
              kernel`ether_nh_input+0x34d
              kernel`netisr_dispatch_src+0x61
              kernel`ether_input+0x26
              kernel`ixgbe_rxeof+0x2f7
              kernel`ixgbe_msix_que+0xb6
              kernel`intr_event_execute_handlers+0x83
              kernel`ithread_loop+0x96
              kernel`fork_exit+0x71
              kernel`0xffffffff80cd19de
          3823793

              kernel`ixgbe_msix_que+0xd3
              kernel`intr_event_execute_handlers+0x83
              kernel`ithread_loop+0x96
              kernel`fork_exit+0x71
              kernel`0xffffffff80cd19de
          9918140

Somehow it seems there's less lock contention / blocking going on when both transmit and receive is running!

So then I dug into it using the lock profiling suite. This is for 5 seconds with receive-only traffic on a single RSS bucket / CPU (all other CPUs are idle):

# sysctl debug.lock.prof.enable = 1; sleep 5 ; sysctl debug.lock.prof.enable=0

root@adrian-hackbox:/home/adrian/git/github/erikarn/freebsd-rss # sysctl debug.lock.prof.enable=1 ; sleep 5 ; sysctl debug.lock.prof.enable=0
debug.lock.prof.enable: 1 -> 1
debug.lock.prof.enable: 1 -> 0

root@adrian-hackbox:/home/adrian/git/github/erikarn/freebsd-rss # sysctl debug.lock.prof.stats | head -2 ; sysctl debug.lock.prof.stats | sort -nk4 | tail -10
debug.lock.prof.stats: 
     max  wait_max       total  wait_total       count    avg wait_avg cnt_hold cnt_lock name
    1496         0       10900           0          28    389      0  0      0 /usr/home/adrian/work/freebsd/head/src/sys/dev/usb/usb_device.c:2755 (sx:USB config SX lock)
debug.lock.prof.stats: 
       0         0          31           1          67      0      0  0      4 /usr/home/adrian/work/freebsd/head/src/sys/kern/sched_ule.c:888 (spin mutex:sched lock 2)
       0         0        2715           1       49740      0      0  0      7 /usr/home/adrian/work/freebsd/head/src/sys/dev/random/random_harvestq.c:294 (spin mutex:entropy harvest mutex)
       1         0          51           1         131      0      0  0      2 /usr/home/adrian/work/freebsd/head/src/sys/kern/sched_ule.c:1179 (spin mutex:sched lock 1)
       0         0          69           2         170      0      0  0      8 /usr/home/adrian/work/freebsd/head/src/sys/kern/sched_ule.c:886 (spin mutex:sched lock 2)
       0         0       40389           2      287649      0      0  0      8 /usr/home/adrian/work/freebsd/head/src/sys/kern/kern_intr.c:1359 (spin mutex:sched lock 2)
       0         2           2           4          12      0      0  0      2 /usr/home/adrian/work/freebsd/head/src/sys/dev/usb/usb_device.c:2762 (sleep mutex:Giant)
      15        20        6556         520        2254      2      0  0    105 /usr/home/adrian/work/freebsd/head/src/sys/dev/acpica/Osd/OsdSynch.c:535 (spin mutex:ACPI lock (0xfffff80002b10f00))
       4         5      195967       65888     3445501      0      0  0  28975 /usr/home/adrian/work/freebsd/head/src/sys/netinet/udp_usrreq.c:369 (sleep mutex:so_rcv)

Notice the lock contention for the so_rcv (socket receive buffer) handling? What's going on here is pretty amusing - it turns out that because there's so much receive traffic going on, the userland process receiving the data is being preempted by the NIC receive thread very often - and when this happens, there's a good chance it's going to be within the small window that the receive socket buffer lock is held. Once this happens, the NIC receive thread processes frames until it gets to one that requires it to grab the same sock buffer lock that is already held by userland - and it fails - so the NIC thread sleeps until the userland thread finishes consuming a packet. Then the CPU flips back to the NIC thread and continues processing a packet.

When the userland code is also transmitting frames it's increasing the amount of time in between socket receives and decreasing the probability of hitting the lock contention condition above.

Note there's no contention between CPUs here - this is entirely contention within a single CPU.

So for now I'm happy that the UDP IPv4 path is scaling well enough with RSS on a single core. The main performance problem here is the socket receive buffer locking (and, yes, copyin() / copyout().)

Next!

The ports tree is now stage only

The ports tree is now fully staged (only 2% has been left unstaged, marked as broken and will be removed from the ports tree if no PR to stage them are pending in bugzilla).

I would like to thank every committer and maintainers for their work on staging!
It allowed us to convert more than 23k packages to support stage in only 11 months!

Staging is a very important state, it allows us to right now be able to run quality testing scripts on the packages (which already allowed to fix tons of hidden problems) and it allows use to be able to build packages as a regular user!

It also opens the gates to new features that users have been requesting for many years:

  • flavors
  • multiple packages

Expect those features to happen in the near future.

ZFS support in libvirt

An upcoming release of libvirt, 1.2.8 that should be released early September, will include an initial support of managing ZFS volumes.

That means that it's possible to boot VMs and use ZFS volumes as disks. Additionally, it allows to control volumes using the libvirt API. Currently, supported operations are:

  • list volumes in a pool
  • create and delete volumes
  • upload and download volumes

It's not possible to create and delete pools yet, hope to implement that in the next release.

Defining a pool

Assume we have some pools and want to use one of them in libvirt:

# zpool list
NAME SIZE ALLOC FREE FRAG EXPANDSZ CAP DEDUP HEALTH ALTROOT
filepool 1,98G 56,5K 1,98G 0% - 0% 1.00x ONLINE -
test 186G 7,81G 178G 0% - 4% 1.00x ONLINE -

Let's take filepool and define it with libvirt. This could be done using this virsh command:

virsh # pool-define-as --name zfsfilepool --source-name filepool --type zfs
Pool zfsfilepool defined

virsh # pool-start zfsfilepool
Pool zfsfilepool started

virsh # pool-info zfsfilepool
Name: zfsfilepool
UUID: 5d1a33a9-d8b5-43d8-bebe-c585e9450176
State: running
Persistent: yes
Autostart: no
Capacity: 1,98 GiB
Allocation: 56,50 KiB
Available: 1,98 GiB

virsh #

As you can see, we specify a type of the pool, its source name, such as seen in zpool list output and a name for it in libvirt. We also need to start it using the pool-start command.

Managing volumes

Let's create a couple of volumes in our new pool.


virsh # vol-create-as --pool zfsfilepool --name vol1 --capacity 1G
Vol vol1 created

virsh # vol-create-as --pool zfsfilepool --name vol2 --capacity 700M
Vol vol2 created

virsh # vol-list zfsfilepool
Name Path
------------------------------------------------------------------------------
vol1 /dev/zvol/filepool/vol1
vol2 /dev/zvol/filepool/vol2

virsh #

Dropping a volume is also easy:

virsh # vol-delete --pool zfsfilepool vol2
Vol vol2 deleted

Uploading and downloading data

Let's upload an image to our new volume:

virsh # vol-upload --pool zfsfilepool --vol vol1 --file /home/novel/FreeBSD-10.0-RELEASE-amd64-memstick.img 

... and download

virsh # vol-download --pool zfsfilepool --vol vol1 --file /home/novel/zfsfilepool_vol1.img

Note: if you would check e.g. md5 sum of the downloaded files, the result would be different as downloaded file will be of the same size as a volume. However, if you trim zeros, it'll be the same.

$ md5 FreeBSD-10.0-RELEASE-amd64-memstick.img zfsfilepool_vol1.img 
MD5 (FreeBSD-10.0-RELEASE-amd64-memstick.img) = e8e7cbd41b80457957bd7981452ecf5c
MD5 (zfsfilepool_vol1.img) = a77c3b434b01a57ec091826f81ebbb97
$ truncate -r FreeBSD-10.0-RELEASE-amd64-memstick.img zfsfilepool_vol1.img
$ md5 FreeBSD-10.0-RELEASE-amd64-memstick.img zfsfilepool_vol1.img
MD5 (FreeBSD-10.0-RELEASE-amd64-memstick.img) = e8e7cbd41b80457957bd7981452ecf5c
MD5 (zfsfilepool_vol1.img) = e8e7cbd41b80457957bd7981452ecf5c
$

Booting a VM from volume

Finally got to the most important part. In use a volume as disk device for VM 'devices' section of the domain XML should be updated with something like this:


<disk type='volume' device='disk'>
<source pool='zfsfilepool' volume='vol1'/>
<target dev='vdb' bus='virtio'/>
</disk>

Few notes

Note #1: this code is just a few weeks old, so quite likely there are some rough edges. Feel free to report problems to novel%freebsd.org if you spot any problems.

Note #2: this code is FreeBSD-only for now. However, it should not be hard to make it work on Linux with zfsonlinux.org. Its developers were kind enough to add some useful missing flags in some of the CLI tools. However, these changes are not available in any released version so far. There are some more minor differences between zfs on Linux and FreeBSD, but that should not be hard to address. I was planning to get to it as soon as a new version of zfs on linux with the necessary flags is available. However, if you are interested in that and ready to help with testing -- feel free to poke me so it could be done sooner.


Receive Side Scaling: figuring out how to handle IP fragments

The TL:DR; of this is - IP fragments are annoying.

If everything was awesome and there were never IP fragments, all TCP and UDP frames would always have the TCP/UDP header stamped on them, and the NIC could hash the TCP/UDP header in hardware to calculate the destination queue to receive traffic on.

However, everything isn't awesome and there will be cases where IP frames are fragmented. When this happens, the first frame in the fragment has the IPv4 header and the TCP/UDP header - but the subsequent fragments only have the IPv4 header. That means there's not enough information in the rest of the fragments to hash them to the same hash value and thus hardware queue as the first fragment - only the first has the full IPv4+TCP/UDP information.

The Intel and Chelsio NICs will hash on all packets that are fragmented by only hashing on the IPv4 details. So, if it's a fragmented TCP or UDP frame, it will hash the first fragment the same as the others - it'll ignore the TCP/UDP details and only hash on the IPv4 frame. This means that all the fragments in a given IP datagram will hash to the same value and thus the same queue.

But if there are a mix of fragmented and non-fragmented packets in a given flow - for example, small versus larger UDP frames - then some may be hashed via the IPv4+TCP or IPv4+UDP details and some will just be hashed via the IPv4 details. This means that packets in the same flow will end up being received in different receive queues and thus highly likely be processed out of order.

The Linux intel driver code flipped off IPv4+UDP hashing a while ago - they hash UDP frames by their IPv4 details only and then do whatever other load balancing in the kernel they choose. I found this and updated the FreeBSD drivers to do the same. This should result in less out of order UDP frames for UDP heavy workloads. I'm not sure about the Chelsio driver yet - when I convert it to the RSS framework it'll disable IPv4+UDP hashing if that isn't enabled at boot time. This is a good stop-gap, but it's not the whole story.

TCP is where it gets annoying. People don't want to flip off IPv4+TCP hashing as they're convinced that the TCP MSS negotiation and path-MTU discovery stuff will prevent there from being any IP fragmented TCP frames. But, well, that's not really viable in the real world. There are too many misconfigured networks out there and IP fragmentation does occur. So this is also a problem for TCP. This means that the IPv4 fragmented TCP frames in those sessions will come into another receive queue and CPU and this will show up as out of order data.

So, what's this all have to do with receive side scaling?

With RSS, there's a well defined hash for packets and a configuration for what the operating system and NICs are supposed to be doing. It's entirely possible that we'll configure IPv4+TCP to be hashed and also entirely possible we'll see IP fragments showing up on other CPUs. So in order to have the TCP stack run on the right CPU, the IP fragments need to be assembled on whichever CPU they're received upon and then re-injected into the correct destination queue to run on the correct CPU.

Fortunately the FreeBSD netisr scheme makes this easy.

So what I'm doing in my branch (and what will soon show up in -HEAD) is thus:


  • UDP is still hashed as IPv4-only frames for now. I'll change that later to hash on IPv4+UDP and have things reinjected on the correct destination RSS bucket / netisr queue / CPU.
  • I create one netisr thread, pinned to a CPU, for each RSS CPU that's defined.
    • Ideally I'd create one netisr thread for each RSS bucket and pin that, but that'll come later.
  • IP fragments will be hashed to whatever the IPv4 hash calculates, so fragment reassembly will occur on some CPU;
    • .. and it's the same CPU for all frames in a fragmented datagram.
  • Then when the fragment is reassembled, a software hash is calculated for the newly reassembled frame.
    • If RSS is configured to hash for IPv4 only, then it'll see that the hash on the reassembled datagram matches the configured hash for that packet type and reuse it.
    • So, if it's UDP right now, it'll see that UDP is only hashing on IPv4 details and reuse it.
    • .. but if IPv4+UDP hashing is configured, it'll software hash the packet and assign the new flow type and RSS hash.
  • Then, it'll reinject the frame into netisr to be requeued and reprocessed.
  • .. this uses the nh_m2cpuid function to calculate the destination CPU for the given RSS hash.
    • If it's handled on the same destination CPU then it'll be handled.
    • If it's handled on a different destination CPU then it'll be queued to that netisr and dispatched appropriately.
This works. It's not great, and I'd rather the IP fragment reassembly code was much more efficient, but it's correct. I'm going for correctness here to begin with.

Now, before you ask - yes, IPv6 has fragments and yes, I have to do the same thing for IPv6 flows. Most of the code is written.

Finally - the same thing applies to things like IPv4 tunnels, IPv6-in-IPv4 tunnels, IPSEC tunnels and the like. The NIC hashes the packets on the IPv4 header details but once the packet is de-encapsulated, it needs to be reinjected back into the correct CPU for further processing.

Using the xdev target with qemu-user-static on #FreeBSD

I’ve been playing with building ports for ARM on an AMD64 machine via a bunch of tools.  The duct tape and bailing wire is a bit thick with this method, but if you keep at it, this should work.

1. build armv6 chroot:
make buildworld TARGET=arm TARGET_ARCH=armv6
make installworld TARGET=arm TARGET_ARCH=armv6 DESTDIR=/armv6
make distribution TARGET=arm TARGET_ARCH=armv6 DESTDIR=/armv6

2. build xdev
make xdev TARGET=arm TARGET_ARCH=armv6 NOSHARED=y

3. move xdev into chroot
mv /usr/armv6-freebsd /armv6/usr/

4. add toolchain to make.conf:
CFLAGS+=-integrated-as
CC=/usr/armv6-freebsd/usr/bin/cc
CPP=/usr/armv6-freebsd/usr/bin/cpp
CXX=/usr/armv6-freebsd/usr/bin/c++
AS=/usr/armv6-freebsd/usr/bin/as
NM=/usr/armv6-freebsd/usr/bin/nm
LD=/usr/armv6-freebsd/usr/bin/ld
OBJCOPY=/usr/armv6-freebsd/usr/bin/objcopy
SIZE=/usr/armv6-freebsd/usr/bin/size
STRIPBIN=/usr/armv6-freebsd/usr/bin/strip
5. Install qemu-static-user from ports and copy into jail:
pkg instlal qemu-static-user
mkdir -p /armv6/usr/local/bin
cp /usr/local/bin/qemu-arm /armv6/usr/local/bin/

6. setup binmiscctl to handle armv6 translations:
binmiscctl add armv6 –interpreter “/usr/local/bin/qemu-arm” –magic “x7fx45x4cx46x01x01x01x00x00x00x00x00x00x00x00x00x02x00x28x00″ –mask “xffxffxffxffxffxffxffx00xffxffxffxffxffxffxffxffxfexffxffxff” –size 20 –set-enabled

7. mount devfs and ports if needed
mount -t devfs devfs /armv6/dev
mount -t nullfs /usr/ports /armv6/usr/ports

8. chroot
chroot /armv6

pkg 1.3.0 out!

Hi all,

I’m very please to announce the release of pkg 1.3.0
This version is the result of almost 9 month of hard work

Here are the statistics for the version:
- 373 files changed, 66973 insertions(+), 38512 deletions(-)
- 29 different contributors

Please not that for the first time I’m not the main contributor, and I would
like to particularly thanks Vsevold Stakhov for all the hard work he has done to
allow us to get this release out. I would like also to give a special thanks to
Andrej Zverev for the tons of hours spending on testing and cleaning the bug
tracker!

So much has happened that it is hard to summarize so I’ll try to highlight the
major points:
- New solver, now pkg has a real SAT solver able to automatically handle
conflicts and dynamically discover them. (yes pkg set -o is deprecated now)
- pkg install now able to install local files as well and resolve their
dependencies from the remote repositories
- Lots of parts of the code has been sandboxed
- Lots of rework to improve portability
- Package installation process has been reworked to be safer and handle properly
the schg flags
- Important modification of the locking system for finer grain locks
- Massive usage of libucl
- Simplification of the API
- Lots of improvements on the UI to provide a better user experience.
- Lots of improvements in multi repository mode
- pkg audit code has been moved into the library
- pkg -o A=B that will overwrite configuration file from cli
- The ui now support long options
- The unicity of a package is not anymore origin
- Tons of bug fixes
- Tons of behaviours fixes
- Way more!

Thank you to all contributors:
Alberto Villa, Alexandre Perrin, Andrej Zverev, Antoine Brodin, Brad Davis,
Bryan Drewery, Dag-Erling Smørgrav, Dmitry Marakasov, Elvira Khabirova, Jamie
Landeg Jones, Jilles Tjoelker, John Marino, Julien Laffaye, Mathieu Arnold,
Matthew Seaman, Maximilian Gaß, Michael Gehring, Michael Gmelin, Nicolas Szalay,
Rodrigo Osorio, Roman Naumann, Rui Paulo, Sean Channel, Stanislav E. Putrya,
Vsevolod Stakhov, Xin Li, coctic

Regards,
Bapt on behalf of the pkg@

Application awareness of receive side scaling (RSS) on FreeBSD

Part of testing this receive side scaling work is designing a set of APIs that allow for some kind of affinity awareness. It's not easy - the general case is difficult and highly varying. But something has to be tested! So, where should it begin?

The main tricky part of this is the difference between incoming, outgoing and listening sockets.

For incoming traffic, the NIC has already calculated the RSS hash value and there's already a map between RSS hash and destination CPU. Well, destination queue to be much more precise; then there's a CPU for that queue.

For outgoing traffic, the thread(s) in question can be scheduled on any CPU core and as you have more cores, it's increasingly unlikely to be the right one. In FreeBSD, the default is to direct dispatch transmit related socket and protocol work in the thread that started it, save a handful of places like TCP timers. Once the driver if_transmit() method is called to transmit a frame it can check the mbuf to see what the flowid is and map that to a destination transmit queue. Before RSS, that's typically done to keep packets vaguely in some semblance of in-order behaviour - ie, for a given traffic flow between two endpoints (say, IP, or TCP, or UDP) the packets should be transmitted in-order. It wasn't really done for CPU affinity reasons.

Before RSS, there was no real consistency with how drivers hashed traffic upon receive, nor any rules on how it should select an outbound transmit queue for a given buffer. Most multi-queue drivers got it "mostly right". They definitely didn't try to make any CPU affinity choices - it was all done to preserve the in-order behaviour of traffic flows.

For an incoming socket, all the information about the destination CPU can be calculated from the RSS hash provided during frame reception. So, for TCP, the RSS hash for the received ACK during the three way handshake goes into the inpcb entry. For UDP it's not so simple (and the inpcb doesn't get a hash entry for UDP - I'll explain why below.)

For an outgoing socket, all the information about the eventual destination CPU isn't necessarily available. If the application knows the source/destination IP and source/destination port then it (or the kernel) can calculate the RSS hash that the hardware would calculate upon frame reception and use that to populate the inpcb. However this isn't typically known - frequently the source IP and port won't be explicitly defined and it'll be up to the kernel to choose them for the application. So, during socket creation, the destination CPU can't be known.

So to make it simple (and to make it simple for me to ensure the driver and protocol stack parts are working right) my focus has been on incoming sockets and incoming packets, rather than trying to handle outgoing sockets. I can handle outbound sockets easily enough - I just need to do a software hash calculation once all of the required information is available (ie, the source IP and port is selected) and populate the inpcb with that particular value. But I decided to not have to try and debug that at the same time as I debugged the driver side and the protocol stack side, so it's a "later" task.

For TCP, traffic for a given connection will use the same source/destination IP and source/destination port values. So for a given socket, it'll always hash to the same value. However, for UDP, it's quite possible to get UDP traffic from a variety of different source IP/ports and respond from a variety of different source/IP ports. This means that the RSS hash value that we can store in the inpcb isn't at all guaranteed to be the same for all subsequent socket writes.

Ok, so given all of that above information, how exactly is this supposed to work?

Well, the slightly more interesting and pressing problem is how to break out incoming requests/packets to multiple receive threads. In traditional UNIX socket setups, there are a couple of common design patterns for farming off incoming requests to multiple worker threads:

  • There's one thread that just does accept() (for TCP) or recv() (for UDP) and it then farms off new connections to userland worker threads; or
  • There are multiple userland worker threads which all wait on a single socket for accept() or recv() - and hope that the OS will only wake up one thread to hand work to.
It turns out that the OS may wake up one thread at a time for accept() or recv() but then userland threads will sit in a loop trying to accept connections / packets - and then you tend to find they get called a lot only to find another worker thread that was running stole the workload. Oops.

I decided this wasn't really acceptable for the RSS work. I needed a way to redirect traffic to a thread that's also pinned to the same CPU as the receive RSS bucket. I decided the cheapest way would be to allow multiple PCB entries for the same socket details (eg, multiple TCP sockets listening on *:80). Since the PCBGROUPS code in this instance has one PCB hash per RSS bucket, all I had to do was to teach the stack that wildcard listen PCB entries (eg, *:80) could also exist in each PCB hash bucket and to use those in preference to the global PCB hash.

The idea behind this decision is pretty simple - Robert Watson already did all this great work in setting up and debugging PCBGROUPS and then made the RSS work leverage that. All I'd have to do is to have one userland thread in each RSS bucket and have the listen socket for that thread be in the RSS bucket. Then any incoming packet would first check the PCBGROUP that matched the RSS bucket indicated by the RSS hash from the hardware - and it'd find the "right" PCB entry in the "right" PCBGROUP PCB has table for the "right" RSS bucket.

That's what I did for both TCP and UDP.

So the programming model is thus:

  • First, query the RSS sysctl (net.inet.rss) for the RSS configuration - this gives the number of RSS buckets and the RSS bucket -> CPU mapping.
  • Then create one worker thread per RSS bucket..
  • .. and pin each thread to the indicated CPU.
  • Next, each worker thread creates one listen socket..
  • .. sets the IP_BINDANY or IP6_BINDANY option to indicate that there'll be multiple RSS entries bound to the given listen details (eg, binding to *:80);
  • .. then IP_RSS_LISTEN_BUCKET to set which RSS bucket the incoming socket should live in;
  • Then for UDP - call bind()
  • Or for TCP - call bind(), then call listen()
Each worker thread will then receive TCP connections / UDP frames that are local to that CPU. Writing data out the TCP socket will also stay local to that CPU. Writing UDP frames out doesn't - and I'm about to cover that.

Yes, it's annoying because now you're not just able to choose an IO model that's convenient for your application / coding style. Oops.

Ok, so what's up with UDP?

The problem with UDP is that outbound responses may be to an arbitrary destination setup and thus may actually be considered "local" to another CPU. Most common services don't do this - they'll send the UDP response to the same remote IP and port that it was sent from.

My plan for UDP (and TCP in some instances, see below!) is four-fold:

  • When receiving UDP frames, optionally mark them with RSS hash and flowid information.
  • When transmitting UDP frames, allow userspace to inform the kernel about a pre-calculated RSS hash / flow information.
  • For the fully-connected setup (ie, where a single socket is connect() ed to a given UDP remote IP:port and frame exchange only occurs between the fixed IP and port details) - cache the RSS flow information in the inpcb;
  • .. and for all other situations (if it's not connected, if there's no hint from userland, if it's going to a destination that isn't in the inpcb) - just do a software hash calculation on the outgoing details.
I mostly have the the first two UDP options implemented (ie, where userland caches the information to re-use when transmitting the response) and I'll commit them to FreeBSD soon. The other two options are the "correct" way to do the default methods but it'll take some time to get right.

Ok, so does it work?

I don't have graphs. Mostly because I'm slack. I'll do up some before I present this - likely at BSDCan 2015.

My testing has been done with Intel 1G and 10G NICs on desktop Ivy Bridge 4-core hardware. So yes, server class hardware will behave better.

For incoming TCP workloads (eg a webserver) then yes, there's no lock contention between CPUs in the NIC driver or network stack any longer. The main lock contention between CPUs is the VM and allocator paths. If you're doing disk IO then that'll also show up.

For incoming UDP workloads, I've seen it scale linearly on 10G NICs (ixgbe(4)) from one to four cores. This is with no-defragmentation, 510 byte sized datagrams.

Ie, 1 core reception (ie, all flows to one core) was ~ 250,000 pps into userland with just straight UDP reception and no flow/hash information via recvmsg(); 135,000 pps into userland with UDP reception and flow/hash information via recvmsg().

4 core reception was ~ 1.1 million pps into userland, roughly ~ 255,000 pps per core. There's no contention between CPU cores at all.

Unfortunately what I was sending was markedly different. The driver quite happily received 1.1 million frames on one queue and up to 2.1 million when all four queues were busy. So there's definitely room for improvement.

Now, there is lock contention - it's just not between CPU cores. Now that I'm getting past the between-core contention, we see the within-core contention.

For TCP HTTP request reception and bulk response transmission, most of the contention I'm currently seeing is between the driver transmit paths. So, the following occurs:

  • TCP stack writes some data out;
  • NIC if_transmit() method is called;
  • It tries to grab the queue lock and succeeds;
It then appends the frame to the buf_ring and schedules a transmit out the NIC. This bit is fine.

But then whilst the transmit lock is held, because the driver is taking frames from the buf_ring to push into the NIC TX DMA queue
  • The NIC queue interrupt fires, scheduling the software interrupt thread;
  • This pre-empts the existing running transmit thread;
  • The NIC code tries to grab the transmit lock to handle completed transmissions;
  • .. and it fails, because the code it preempted holds the transmit lock already.
So there's some context switching and thrashing going on there which needs to be addressed.

Ok, what about UDP? It turns out there's some lock contention with the socket receive buffer.

The soreceive_dgram() routine grabs the socket receive buffer (SOCKBUF_LOCK()) to see if there's anything to return. If not, and if it can sleep, it'll call sbwait() that will release the lock and msleep() waiting for the protocol stack to indicate that something has been received. However, since we're receiving packets at such a very high rate, it seems that the receive protocol path contends with the socket buffer lock that is held by the userland code trying to receive a datagram. It pre-empts the user thread, tries to grab the lock and fails - and then goes to sleep until the userland code finishes with the lock. soreceive_dgram() doesn't hold the lock for very long - but I do see upwards of a million context switches a second.

To wrap up - I'm pleased with how things are going. I've found and fixed some issues with the igb(4) and ixgbe(4) drivers that were partly my fault and the traffic is now quite happily and correctly being processed in parallel. There are issues with scaling within a core that are now being exposed and I'm glad to say I'm going to ignore them for now and focus on wrapping up what I've started.

There's a bunch more to talk about and I'm going to do it in follow-up posts.
  • what I'm going to do about UDP transmit in more detail;
  • what about creating outbound connections and how applications can be structured to handle this;
  • handling IP fragments and rehashing packets to be mostly in-order - and what happens when we can't guarantee ordering with the hardware hashing UDP frames to a 4-tuple;
  • CPU hash rebalancing - what if a specific bucket gets too much CPU load for some reason;
  • randomly creating a toeplitz RSS hash key at bootup and how that should be verified;
  • multi-socket CPU and IO domain awareness;
  • .. and whatever else I'm going to stumble across whilst I'm slowly fleshing this stuff out.
I hope to get the UDP transmit side of things completed in the next couple of weeks so I can teach memcached about TCP and UDP RSS. After that, who knows!

Using qemu-user to chroot and bootstrap other architectures on #FreeBSD

My last post spawned enough feedback that I thought I would dump some notes here for those interested in building a chroot on FreeBSD that allows you to test and prototype architectures, e.g. ARMv6 on AMD64.

The FreeBSD buildsys has many targets used for many things, the two we care about here are buildworld and distribution.  We will also be changing the output architecture through the use of TARGET and TARGET_ARCH command line variables.  I’ll assume csh is your shell here, just for simplicity.  You’ll need 10stable or 11current to do this, as it requires the binary activator via binmiscctl(8) which has not appeared in a release version of FreeBSD yet.

Checkout the FreeBSD source tree somewhere, your home directory will be fine and start a buildworld.  This will take a while, so get a cup of tea and relax.

make -s -j <number of cpus on your machine> buildworld TARGET=mips TARGET_ARCH=mips64 MAKEOBJDIRPREFIX=/var/tmp

Some valid combinations of TARGET/TARGET_ARCH are:

mips:mips

mips:mip64

arm:armv6

sparc64:sparc64

powerpc:powerpc

powerpc:powerpc64

i386:i386

amd64:amd64

Once this is done, you have an installable tree in /var/tmp.  You need to be root for the next few steps, su now and execute these steps:

make -s installworld TARGET=mips TARGET_ARCH=mips64 MAKEOBJDIRPREFIX=/var/tmp DESTDIR=/opt/test

DESTDIR is where you intend on placing the installed FreeBSD system.  I chose /opt/test here only because I wanted to be FAR away from anything in my running system.  Just to be clear here, this will crush and destroy your host computer without DESTDIR set.

Next, there are some tweaks that have to be done by the buildsys, so run this command as root:

make -s distribution TARGET=mips TARGET_ARCH=mips64 MAKEOBJDIRPREFIX=/var/tmp DESTDIR=/opt/test

Now we need to install the emulator tools (QEMU) to allow us to use the chroot on our system.  I suggest using emulators/qemu-user-static for this as Juergen Lock has set it up for exactly this purpose.  It will install only the tools you need here.

Once that is installed, via pkg or ports, setup your binary activator module for the architecture of your chroot.  Use the listed options on the QEMU user mode wiki page for the architecture you want.  I know the arguments are not straight forward, but there should be examples for the target that you are looking for.

For this mips/mips64 example:

binmiscctl add mips64elf –interpreter “/usr/local/bin/qemu-mips64-static”
–magic “x7fx45x4cx46x02x02x01x00x00x00x00x00x00x00x00x00x00x02x00x08″
–mask “xffxffxffxffxffxffxffx00xffxffxffxffxffxffxffxffxffxfexffxff”
–size 20 –set-enabled

Copy the binary qemu that you setup in this step *into* the chroot environment:

mkdir -p /opt/tmp/usr/local/bin

cp /usr/local/bin/qemu-mips64-static /opt/tmp/usr/local/bin/

Mount devfs into the chroot:

mount -t devfs devfs /opt/tmp/dev

Want to try building ports in your chroot?  Mount the ports tree in via nullfs:

mkdir /opt/tmp/usr/ports

mount -t nullfs /usr/ports /opt/tmp/usr/ports

And now, through the QEMU and FreeBSD, you can simply chroot into the environment:

chroot /opt/tmp

Hopefully, you can now “do” things as though you were running on a MIPS64 or whatever architecture machine you have as a target.

arm:armv6, mips:mips, mips:mips64 are working at about %80-90 functionality.  powerpc:powerpc64 and powerpc:powerpc are still a work in progress and need more work.  sparc64:sparc64 immediately aborts and probably needs someone with an eye familiar with the architecture to give QEMU a look.  If you are interested in further development of the qemu-user targets, please see my github repo and clone away.

If you are looking to see what needs to be done, Stacey Son has kept an excellent log of open item on the FreeBSD Wiki