Farewell beloved Canadian

Last month, our beloved Canadian Thomas Abthorpe decided to step
down from his portmgr-secretary position. While I suspect this is secretly
related to his pool of Canadian jokes having dried up, the official reason is
that Thomas wants to focus more on his private and professional lives for the
moment. Needless to say, the whole ports community is in mourning.

BSDCan Trip Report: Baptiste Daroussin

The next trip report is from Baptiste Daroussin:

Thanks to the FreeBSD Foundation I was able to attend BSDCan 2014.

I arrived in Ottawa on Tuesday evening and went directly to the Royal Oak where I met other FreeBSD developers.

On Wednesday, the DevSummit started with the FreeBSD future plans where I was mainly interested in pushing subjects like packaging base, dma(8) integration, improvements in kqueue, and status of the toolchain.

The afternoon was mainly spent meeting with many other developers to talk face to face on subjects which usually take a while to resolve via mail.

Thursday started with the ports and package session where I talked about the status of the package distribution: from building packages to distributing packages on the FreeBSD cluster. I gave a brief status about pkg(8). We talked about the pkg_tools decomission. We had a long and interesting discussion about the future of the ports tree. The other subjects we talked about were packaging-base, continuous integration of the ports tree, cross building packages, and the license framework.

Like the previous day, I spent the afternoon discussing pkg(8) with other developers, as well as phabricator, and discussing with clusteradm about different possibilities for distributed "extra" packages repositories.

On Friday and Saturday the main conference took place. There were plenty of different interesting talks I went to.

The main interesting one for me was " The architecture of the new solver in pkg" by Vsevolod Stakhov as it gave me more details about his wonderful work on pkg during GSoC 2013!

This conference has been really succesful for me. It was the first time we were able to get 4 pkg developers together: Vsevolod Stakhov (vsevolod@), Bryan Drewery (bdrewery@), Matthew Seaman (matthew@), and myself. I found it really productive to exchange ideas, share problems, and simply have discussion.

This conference also allows me to talk with clusteradm people, in particular Glen Barber (gjb), Peter Wemm (peter@), and Sean Bruno (sbruno@)

There was also the opportunity for 4 portmgrs, a future portmgr, and a former portmgr to have an informal meeting which was really great!

July/August Issue of The FreeBSD Journal Now Available

The fourth issue of the online FreeBSD Journal is now available! The issue is all about FreeBSD and Virtualization and includes topics such as FreeBSD on Amazon's EC2, and FreeBSD's own native virtualization system, bhyve. Plus, you'll find pieces on Xen, the USE Method, and more. The FreeBSD Journal is available at the Apple, Google, and Kindle stores at $19.99/year for six (6) issues or $6.99 for a single issue. Not a subscriber? Find out more and subscribe today!

BSDCan Trip Report: Mark Linimon

The next trip report is from Mark Linimon:

The first day, Tuesday, was an unoffficial day, spent socializing.

The Developer's Summit began Wednesday.  My main interest was to attend the "FreeBSD future plans" session.   Of particular interest was the discussion about how Release Engineering should look in the future. The ports team has done a great deal of work to decouple ports releases from src releases.  This required both changes in the way packages were built, as well as a substantial amount of new hardware to be able to build multiple package sets simultaneously.  (Much of this hardware was purchased by the Foundation).  This was the first change that many of the src and docs people had been brought up to speed on these developments.

Thursday, of course, my main interest was the Ports and Packages session.

In the evening, I was invited to an informal meeting with the various Ports Management Team (portmgrs) who were in attendance.  (I had previously served for several years on this team.)  Somehow, I was "volunteered" to rejoin the Ports Management team with an "advisor" status.  Clearly, peer pressure works.

Friday the conference itself started.  I spent some of the day trying to catch up on rest from the hectic first two days, and then socialized in the evening.

On Saturday, the most interesting session was the FreeNAS development talk.  While it was informative, there was also an opportunity to heckle John Hixson.

Perhaps the most important task that I accomplished during the conference was to sit down with Bryan Drewery and discuss future software improvements to the Ports Monitoring System (portsmon), which I wrote.

portsmon has survived many changes in FreeBSD.  The first was from CVS to SVN.  More recently, the ports build farm has been switched over from the old portbuild codebase to a completely rewritten system.  Our discussion dealt with the changes that I needed to make to port over to the new system; what the future changes to the new system would be; and changes that I requested that would make portsmon's job easier.  These changes have now been incorporated.  The next task is to catch up with the change from GNATS to Bugzilla; by that point, all of the inputs to portsmon will have been switched over from their initial codebase.

Receive Side Scaling: figuring out how to handle IP fragments

The TL:DR; of this is - IP fragments are annoying.

If everything was awesome and there were never IP fragments, all TCP and UDP frames would always have the TCP/UDP header stamped on them, and the NIC could hash the TCP/UDP header in hardware to calculate the destination queue to receive traffic on.

However, everything isn't awesome and there will be cases where IP frames are fragmented. When this happens, the first frame in the fragment has the IPv4 header and the TCP/UDP header - but the subsequent fragments only have the IPv4 header. That means there's not enough information in the rest of the fragments to hash them to the same hash value and thus hardware queue as the first fragment - only the first has the full IPv4+TCP/UDP information.

The Intel and Chelsio NICs will hash on all packets that are fragmented by only hashing on the IPv4 details. So, if it's a fragmented TCP or UDP frame, it will hash the first fragment the same as the others - it'll ignore the TCP/UDP details and only hash on the IPv4 frame. This means that all the fragments in a given IP datagram will hash to the same value and thus the same queue.

But if there are a mix of fragmented and non-fragmented packets in a given flow - for example, small versus larger UDP frames - then some may be hashed via the IPv4+TCP or IPv4+UDP details and some will just be hashed via the IPv4 details. This means that packets in the same flow will end up being received in different receive queues and thus highly likely be processed out of order.

The Linux intel driver code flipped off IPv4+UDP hashing a while ago - they hash UDP frames by their IPv4 details only and then do whatever other load balancing in the kernel they choose. I found this and updated the FreeBSD drivers to do the same. This should result in less out of order UDP frames for UDP heavy workloads. I'm not sure about the Chelsio driver yet - when I convert it to the RSS framework it'll disable IPv4+UDP hashing if that isn't enabled at boot time. This is a good stop-gap, but it's not the whole story.

TCP is where it gets annoying. People don't want to flip off IPv4+TCP hashing as they're convinced that the TCP MSS negotiation and path-MTU discovery stuff will prevent there from being any IP fragmented TCP frames. But, well, that's not really viable in the real world. There are too many misconfigured networks out there and IP fragmentation does occur. So this is also a problem for TCP. This means that the IPv4 fragmented TCP frames in those sessions will come into another receive queue and CPU and this will show up as out of order data.

So, what's this all have to do with receive side scaling?

With RSS, there's a well defined hash for packets and a configuration for what the operating system and NICs are supposed to be doing. It's entirely possible that we'll configure IPv4+TCP to be hashed and also entirely possible we'll see IP fragments showing up on other CPUs. So in order to have the TCP stack run on the right CPU, the IP fragments need to be assembled on whichever CPU they're received upon and then re-injected into the correct destination queue to run on the correct CPU.

Fortunately the FreeBSD netisr scheme makes this easy.

So what I'm doing in my branch (and what will soon show up in -HEAD) is thus:

  • UDP is still hashed as IPv4-only frames for now. I'll change that later to hash on IPv4+UDP and have things reinjected on the correct destination RSS bucket / netisr queue / CPU.
  • I create one netisr thread, pinned to a CPU, for each RSS CPU that's defined.
    • Ideally I'd create one netisr thread for each RSS bucket and pin that, but that'll come later.
  • IP fragments will be hashed to whatever the IPv4 hash calculates, so fragment reassembly will occur on some CPU;
    • .. and it's the same CPU for all frames in a fragmented datagram.
  • Then when the fragment is reassembled, a software hash is calculated for the newly reassembled frame.
    • If RSS is configured to hash for IPv4 only, then it'll see that the hash on the reassembled datagram matches the configured hash for that packet type and reuse it.
    • So, if it's UDP right now, it'll see that UDP is only hashing on IPv4 details and reuse it.
    • .. but if IPv4+UDP hashing is configured, it'll software hash the packet and assign the new flow type and RSS hash.
  • Then, it'll reinject the frame into netisr to be requeued and reprocessed.
  • .. this uses the nh_m2cpuid function to calculate the destination CPU for the given RSS hash.
    • If it's handled on the same destination CPU then it'll be handled.
    • If it's handled on a different destination CPU then it'll be queued to that netisr and dispatched appropriately.
This works. It's not great, and I'd rather the IP fragment reassembly code was much more efficient, but it's correct. I'm going for correctness here to begin with.

Now, before you ask - yes, IPv6 has fragments and yes, I have to do the same thing for IPv6 flows. Most of the code is written.

Finally - the same thing applies to things like IPv4 tunnels, IPv6-in-IPv4 tunnels, IPSEC tunnels and the like. The NIC hashes the packets on the IPv4 header details but once the packet is de-encapsulated, it needs to be reinjected back into the correct CPU for further processing.

Using the xdev target with qemu-user-static on #FreeBSD

I’ve been playing with building ports for ARM on an AMD64 machine via a bunch of tools.  The duct tape and bailing wire is a bit thick with this method, but if you keep at it, this should work.

1. build armv6 chroot:
make buildworld TARGET=arm TARGET_ARCH=armv6
make installworld TARGET=arm TARGET_ARCH=armv6 DESTDIR=/armv6
make distribution TARGET=arm TARGET_ARCH=armv6 DESTDIR=/armv6

2. build xdev
make xdev TARGET=arm TARGET_ARCH=armv6 NOSHARED=y

3. move xdev into chroot
mv /usr/armv6-freebsd /armv6/usr/

4. add toolchain to make.conf:
5. Install qemu-static-user from ports and copy into jail:
pkg instlal qemu-static-user
mkdir -p /armv6/usr/local/bin
cp /usr/local/bin/qemu-arm /armv6/usr/local/bin/

6. setup binmiscctl to handle armv6 translations:
binmiscctl add armv6 –interpreter “/usr/local/bin/qemu-arm” –magic “x7fx45x4cx46x01x01x01x00x00x00x00x00x00x00x00x00x02x00x28x00″ –mask “xffxffxffxffxffxffxffx00xffxffxffxffxffxffxffxffxfexffxffxff” –size 20 –set-enabled

7. mount devfs and ports if needed
mount -t devfs devfs /armv6/dev
mount -t nullfs /usr/ports /armv6/usr/ports

8. chroot
chroot /armv6

Foundation is Accepting Travel Grant Applications for EuroBSDCon 2014

Calling all FreeBSD developers needing assistance with travel expenses to EuroBSDCon 2014.

The FreeBSD Foundation will be providing a limited number of travel grants to individuals requesting assistance. Please fill out and submit  the Travel Grant Request Application at http://www.freebsdfoundation.org/documents/TravelRequestForm.pdf by August 15th, 2014 to apply for this grant.

This program is open to FreeBSD developers of all sorts (kernel hackers,  documentation authors, bugbusters, system administrators, etc).  In some cases we are also able to fund non-developers, such as active community members and FreeBSD advocates.

If you are a speaker at the conference, we expect the conference to cover your travel costs, and will most likely not approve your direct request to us.

There's some flexibility in the mechanism, so talk to us if something about the model doesn't quite work for you or if you have any questions. The travel grant program is one of the most effective ways we can spend money to help support the FreeBSD Project, as it helps developers get together in the same place at the same time, and helps advertise and advocate FreeBSD in the larger community.

bsdtalk243 – mandoc with Ingo Schwarze

Interview about mandoc with Ingo Schwarze.  The project webpage describes mandoc as "a suite of tools compiling mdoc, the roff macro language of choice for BSD manual pages, and man, the predominant historical language for UNIX manuals."

Recorded at BSDCan 2014.

File Info: 16Min, 8MB.

Ogg Link: http://cis01.uma.edu/~wbackman/bsdtalk/bsdtalk243.ogg

pkg 1.3.0 out!

Hi all,

I’m very please to announce the release of pkg 1.3.0
This version is the result of almost 9 month of hard work

Here are the statistics for the version:
- 373 files changed, 66973 insertions(+), 38512 deletions(-)
- 29 different contributors

Please not that for the first time I’m not the main contributor, and I would
like to particularly thanks Vsevold Stakhov for all the hard work he has done to
allow us to get this release out. I would like also to give a special thanks to
Andrej Zverev for the tons of hours spending on testing and cleaning the bug

So much has happened that it is hard to summarize so I’ll try to highlight the
major points:
- New solver, now pkg has a real SAT solver able to automatically handle
conflicts and dynamically discover them. (yes pkg set -o is deprecated now)
- pkg install now able to install local files as well and resolve their
dependencies from the remote repositories
- Lots of parts of the code has been sandboxed
- Lots of rework to improve portability
- Package installation process has been reworked to be safer and handle properly
the schg flags
- Important modification of the locking system for finer grain locks
- Massive usage of libucl
- Simplification of the API
- Lots of improvements on the UI to provide a better user experience.
- Lots of improvements in multi repository mode
- pkg audit code has been moved into the library
- pkg -o A=B that will overwrite configuration file from cli
- The ui now support long options
- The unicity of a package is not anymore origin
- Tons of bug fixes
- Tons of behaviours fixes
- Way more!

Thank you to all contributors:
Alberto Villa, Alexandre Perrin, Andrej Zverev, Antoine Brodin, Brad Davis,
Bryan Drewery, Dag-Erling Smørgrav, Dmitry Marakasov, Elvira Khabirova, Jamie
Landeg Jones, Jilles Tjoelker, John Marino, Julien Laffaye, Mathieu Arnold,
Matthew Seaman, Maximilian Gaß, Michael Gehring, Michael Gmelin, Nicolas Szalay,
Rodrigo Osorio, Roman Naumann, Rui Paulo, Sean Channel, Stanislav E. Putrya,
Vsevolod Stakhov, Xin Li, coctic

Bapt on behalf of the pkg@

vc4 driver month 1

I've just pushed the vc4-sim-validate branch to my Mesa tree. It's the culmination of the last week's worth pondering and false starts since I got my first texture sampling in simulation last Wednesday.

Handling texturing on vc4 safely is a pain. The pointer to texture contents doesn't appear in the normal command stream, and instead it's in the uniform stream. Which uniform happens to contain the pointer depends on how many uniforms have been loaded by the time you get to the QPU_W_TMU[01]_[STRB] writes. Since there's no iommu, I can't trust userspace to tell me where the uniform is, otherwise I'd be allowing them to just lie and put in physical addresses and read arbitrary system memory.

This meant I had to write a shader parser for the kernel, have that spit out a collection of references to texture samples, switch the uniform data from living in BOs in the user -> kernel ABI and instead be passed in as normal system memory that gets copied to the temporary exec bo, and then do relocations on that.

Instead of trying to write this in the kernel, with a ~10 minute turnaround time per test run, I copied my kernel code into Mesa with a little bit of wrapper code to give a kernel-like API environment, and did my development on that. When I'm looking at possibly 100s of iterations to get all the validation code working, it was well worth the day spent to build that infrastructure so that I could get my testing turnaround time down to about 15 sec.

I haven't done actual validation to make sure that the texture samples don't access outside of the bounds of the texture yet (though I at least have the infrastructure necessary now), just like I haven't done that validation for so many other pointers (vertex fetch, tile load/stores, etc.). I also need to copy the code back out to the kernel driver, and it really deserves some cleanups to add sanity to the many different addresses involved (unvalidated vaddr, validated vaddr, and validated paddr of the data for each of render, bin, shader recs, uniforms). But hopefully once I do that, I can soon start bringing up glamor on the Pi (though I've got some major issue with tile allocation BO memory management before anything's stable on the Pi).

Application awareness of receive side scaling (RSS) on FreeBSD

Part of testing this receive side scaling work is designing a set of APIs that allow for some kind of affinity awareness. It's not easy - the general case is difficult and highly varying. But something has to be tested! So, where should it begin?

The main tricky part of this is the difference between incoming, outgoing and listening sockets.

For incoming traffic, the NIC has already calculated the RSS hash value and there's already a map between RSS hash and destination CPU. Well, destination queue to be much more precise; then there's a CPU for that queue.

For outgoing traffic, the thread(s) in question can be scheduled on any CPU core and as you have more cores, it's increasingly unlikely to be the right one. In FreeBSD, the default is to direct dispatch transmit related socket and protocol work in the thread that started it, save a handful of places like TCP timers. Once the driver if_transmit() method is called to transmit a frame it can check the mbuf to see what the flowid is and map that to a destination transmit queue. Before RSS, that's typically done to keep packets vaguely in some semblance of in-order behaviour - ie, for a given traffic flow between two endpoints (say, IP, or TCP, or UDP) the packets should be transmitted in-order. It wasn't really done for CPU affinity reasons.

Before RSS, there was no real consistency with how drivers hashed traffic upon receive, nor any rules on how it should select an outbound transmit queue for a given buffer. Most multi-queue drivers got it "mostly right". They definitely didn't try to make any CPU affinity choices - it was all done to preserve the in-order behaviour of traffic flows.

For an incoming socket, all the information about the destination CPU can be calculated from the RSS hash provided during frame reception. So, for TCP, the RSS hash for the received ACK during the three way handshake goes into the inpcb entry. For UDP it's not so simple (and the inpcb doesn't get a hash entry for UDP - I'll explain why below.)

For an outgoing socket, all the information about the eventual destination CPU isn't necessarily available. If the application knows the source/destination IP and source/destination port then it (or the kernel) can calculate the RSS hash that the hardware would calculate upon frame reception and use that to populate the inpcb. However this isn't typically known - frequently the source IP and port won't be explicitly defined and it'll be up to the kernel to choose them for the application. So, during socket creation, the destination CPU can't be known.

So to make it simple (and to make it simple for me to ensure the driver and protocol stack parts are working right) my focus has been on incoming sockets and incoming packets, rather than trying to handle outgoing sockets. I can handle outbound sockets easily enough - I just need to do a software hash calculation once all of the required information is available (ie, the source IP and port is selected) and populate the inpcb with that particular value. But I decided to not have to try and debug that at the same time as I debugged the driver side and the protocol stack side, so it's a "later" task.

For TCP, traffic for a given connection will use the same source/destination IP and source/destination port values. So for a given socket, it'll always hash to the same value. However, for UDP, it's quite possible to get UDP traffic from a variety of different source IP/ports and respond from a variety of different source/IP ports. This means that the RSS hash value that we can store in the inpcb isn't at all guaranteed to be the same for all subsequent socket writes.

Ok, so given all of that above information, how exactly is this supposed to work?

Well, the slightly more interesting and pressing problem is how to break out incoming requests/packets to multiple receive threads. In traditional UNIX socket setups, there are a couple of common design patterns for farming off incoming requests to multiple worker threads:

  • There's one thread that just does accept() (for TCP) or recv() (for UDP) and it then farms off new connections to userland worker threads; or
  • There are multiple userland worker threads which all wait on a single socket for accept() or recv() - and hope that the OS will only wake up one thread to hand work to.
It turns out that the OS may wake up one thread at a time for accept() or recv() but then userland threads will sit in a loop trying to accept connections / packets - and then you tend to find they get called a lot only to find another worker thread that was running stole the workload. Oops.

I decided this wasn't really acceptable for the RSS work. I needed a way to redirect traffic to a thread that's also pinned to the same CPU as the receive RSS bucket. I decided the cheapest way would be to allow multiple PCB entries for the same socket details (eg, multiple TCP sockets listening on *:80). Since the PCBGROUPS code in this instance has one PCB hash per RSS bucket, all I had to do was to teach the stack that wildcard listen PCB entries (eg, *:80) could also exist in each PCB hash bucket and to use those in preference to the global PCB hash.

The idea behind this decision is pretty simple - Robert Watson already did all this great work in setting up and debugging PCBGROUPS and then made the RSS work leverage that. All I'd have to do is to have one userland thread in each RSS bucket and have the listen socket for that thread be in the RSS bucket. Then any incoming packet would first check the PCBGROUP that matched the RSS bucket indicated by the RSS hash from the hardware - and it'd find the "right" PCB entry in the "right" PCBGROUP PCB has table for the "right" RSS bucket.

That's what I did for both TCP and UDP.

So the programming model is thus:

  • First, query the RSS sysctl (net.inet.rss) for the RSS configuration - this gives the number of RSS buckets and the RSS bucket -> CPU mapping.
  • Then create one worker thread per RSS bucket..
  • .. and pin each thread to the indicated CPU.
  • Next, each worker thread creates one listen socket..
  • .. sets the IP_BINDANY or IP6_BINDANY option to indicate that there'll be multiple RSS entries bound to the given listen details (eg, binding to *:80);
  • .. then IP_RSS_LISTEN_BUCKET to set which RSS bucket the incoming socket should live in;
  • Then for UDP - call bind()
  • Or for TCP - call bind(), then call listen()
Each worker thread will then receive TCP connections / UDP frames that are local to that CPU. Writing data out the TCP socket will also stay local to that CPU. Writing UDP frames out doesn't - and I'm about to cover that.

Yes, it's annoying because now you're not just able to choose an IO model that's convenient for your application / coding style. Oops.

Ok, so what's up with UDP?

The problem with UDP is that outbound responses may be to an arbitrary destination setup and thus may actually be considered "local" to another CPU. Most common services don't do this - they'll send the UDP response to the same remote IP and port that it was sent from.

My plan for UDP (and TCP in some instances, see below!) is four-fold:

  • When receiving UDP frames, optionally mark them with RSS hash and flowid information.
  • When transmitting UDP frames, allow userspace to inform the kernel about a pre-calculated RSS hash / flow information.
  • For the fully-connected setup (ie, where a single socket is connect() ed to a given UDP remote IP:port and frame exchange only occurs between the fixed IP and port details) - cache the RSS flow information in the inpcb;
  • .. and for all other situations (if it's not connected, if there's no hint from userland, if it's going to a destination that isn't in the inpcb) - just do a software hash calculation on the outgoing details.
I mostly have the the first two UDP options implemented (ie, where userland caches the information to re-use when transmitting the response) and I'll commit them to FreeBSD soon. The other two options are the "correct" way to do the default methods but it'll take some time to get right.

Ok, so does it work?

I don't have graphs. Mostly because I'm slack. I'll do up some before I present this - likely at BSDCan 2015.

My testing has been done with Intel 1G and 10G NICs on desktop Ivy Bridge 4-core hardware. So yes, server class hardware will behave better.

For incoming TCP workloads (eg a webserver) then yes, there's no lock contention between CPUs in the NIC driver or network stack any longer. The main lock contention between CPUs is the VM and allocator paths. If you're doing disk IO then that'll also show up.

For incoming UDP workloads, I've seen it scale linearly on 10G NICs (ixgbe(4)) from one to four cores. This is with no-defragmentation, 510 byte sized datagrams.

Ie, 1 core reception (ie, all flows to one core) was ~ 250,000 pps into userland with just straight UDP reception and no flow/hash information via recvmsg(); 135,000 pps into userland with UDP reception and flow/hash information via recvmsg().

4 core reception was ~ 1.1 million pps into userland, roughly ~ 255,000 pps per core. There's no contention between CPU cores at all.

Unfortunately what I was sending was markedly different. The driver quite happily received 1.1 million frames on one queue and up to 2.1 million when all four queues were busy. So there's definitely room for improvement.

Now, there is lock contention - it's just not between CPU cores. Now that I'm getting past the between-core contention, we see the within-core contention.

For TCP HTTP request reception and bulk response transmission, most of the contention I'm currently seeing is between the driver transmit paths. So, the following occurs:

  • TCP stack writes some data out;
  • NIC if_transmit() method is called;
  • It tries to grab the queue lock and succeeds;
It then appends the frame to the buf_ring and schedules a transmit out the NIC. This bit is fine.

But then whilst the transmit lock is held, because the driver is taking frames from the buf_ring to push into the NIC TX DMA queue
  • The NIC queue interrupt fires, scheduling the software interrupt thread;
  • This pre-empts the existing running transmit thread;
  • The NIC code tries to grab the transmit lock to handle completed transmissions;
  • .. and it fails, because the code it preempted holds the transmit lock already.
So there's some context switching and thrashing going on there which needs to be addressed.

Ok, what about UDP? It turns out there's some lock contention with the socket receive buffer.

The soreceive_dgram() routine grabs the socket receive buffer (SOCKBUF_LOCK()) to see if there's anything to return. If not, and if it can sleep, it'll call sbwait() that will release the lock and msleep() waiting for the protocol stack to indicate that something has been received. However, since we're receiving packets at such a very high rate, it seems that the receive protocol path contends with the socket buffer lock that is held by the userland code trying to receive a datagram. It pre-empts the user thread, tries to grab the lock and fails - and then goes to sleep until the userland code finishes with the lock. soreceive_dgram() doesn't hold the lock for very long - but I do see upwards of a million context switches a second.

To wrap up - I'm pleased with how things are going. I've found and fixed some issues with the igb(4) and ixgbe(4) drivers that were partly my fault and the traffic is now quite happily and correctly being processed in parallel. There are issues with scaling within a core that are now being exposed and I'm glad to say I'm going to ignore them for now and focus on wrapping up what I've started.

There's a bunch more to talk about and I'm going to do it in follow-up posts.
  • what I'm going to do about UDP transmit in more detail;
  • what about creating outbound connections and how applications can be structured to handle this;
  • handling IP fragments and rehashing packets to be mostly in-order - and what happens when we can't guarantee ordering with the hardware hashing UDP frames to a 4-tuple;
  • CPU hash rebalancing - what if a specific bucket gets too much CPU load for some reason;
  • randomly creating a toeplitz RSS hash key at bootup and how that should be verified;
  • multi-socket CPU and IO domain awareness;
  • .. and whatever else I'm going to stumble across whilst I'm slowly fleshing this stuff out.
I hope to get the UDP transmit side of things completed in the next couple of weeks so I can teach memcached about TCP and UDP RSS. After that, who knows!

Using qemu-user to chroot and bootstrap other architectures on #FreeBSD

My last post spawned enough feedback that I thought I would dump some notes here for those interested in building a chroot on FreeBSD that allows you to test and prototype architectures, e.g. ARMv6 on AMD64.

The FreeBSD buildsys has many targets used for many things, the two we care about here are buildworld and distribution.  We will also be changing the output architecture through the use of TARGET and TARGET_ARCH command line variables.  I’ll assume csh is your shell here, just for simplicity.  You’ll need 10stable or 11current to do this, as it requires the binary activator via binmiscctl(8) which has not appeared in a release version of FreeBSD yet.

Checkout the FreeBSD source tree somewhere, your home directory will be fine and start a buildworld.  This will take a while, so get a cup of tea and relax.

make -s -j <number of cpus on your machine> buildworld TARGET=mips TARGET_ARCH=mips64 MAKEOBJDIRPREFIX=/var/tmp

Some valid combinations of TARGET/TARGET_ARCH are:









Once this is done, you have an installable tree in /var/tmp.  You need to be root for the next few steps, su now and execute these steps:

make -s installworld TARGET=mips TARGET_ARCH=mips64 MAKEOBJDIRPREFIX=/var/tmp DESTDIR=/opt/test

DESTDIR is where you intend on placing the installed FreeBSD system.  I chose /opt/test here only because I wanted to be FAR away from anything in my running system.  Just to be clear here, this will crush and destroy your host computer without DESTDIR set.

Next, there are some tweaks that have to be done by the buildsys, so run this command as root:

make -s distribution TARGET=mips TARGET_ARCH=mips64 MAKEOBJDIRPREFIX=/var/tmp DESTDIR=/opt/test

Now we need to install the emulator tools (QEMU) to allow us to use the chroot on our system.  I suggest using emulators/qemu-user-static for this as Juergen Lock has set it up for exactly this purpose.  It will install only the tools you need here.

Once that is installed, via pkg or ports, setup your binary activator module for the architecture of your chroot.  Use the listed options on the QEMU user mode wiki page for the architecture you want.  I know the arguments are not straight forward, but there should be examples for the target that you are looking for.

For this mips/mips64 example:

binmiscctl add mips64elf –interpreter “/usr/local/bin/qemu-mips64-static”
–magic “x7fx45x4cx46x02x02x01x00x00x00x00x00x00x00x00x00x00x02x00x08″
–mask “xffxffxffxffxffxffxffx00xffxffxffxffxffxffxffxffxffxfexffxff”
–size 20 –set-enabled

Copy the binary qemu that you setup in this step *into* the chroot environment:

mkdir -p /opt/tmp/usr/local/bin

cp /usr/local/bin/qemu-mips64-static /opt/tmp/usr/local/bin/

Mount devfs into the chroot:

mount -t devfs devfs /opt/tmp/dev

Want to try building ports in your chroot?  Mount the ports tree in via nullfs:

mkdir /opt/tmp/usr/ports

mount -t nullfs /usr/ports /opt/tmp/usr/ports

And now, through the QEMU and FreeBSD, you can simply chroot into the environment:

chroot /opt/tmp

Hopefully, you can now “do” things as though you were running on a MIPS64 or whatever architecture machine you have as a target.

arm:armv6, mips:mips, mips:mips64 are working at about %80-90 functionality.  powerpc:powerpc64 and powerpc:powerpc are still a work in progress and need more work.  sparc64:sparc64 immediately aborts and probably needs someone with an eye familiar with the architecture to give QEMU a look.  If you are interested in further development of the qemu-user targets, please see my github repo and clone away.

If you are looking to see what needs to be done, Stacey Son has kept an excellent log of open item on the FreeBSD Wiki