Category Archives: netflix

Experimenting with zero-copy network IO in FreeBSD-HEAD

Back when I started all of this networking hacking, the "big thing" was the overhead of doing poll() and select(). Various operating systems came up with ways of eliminating these - FreeBSD grew the kqueue infrastructure; linux received epoll, Solaris received an epoll-like device and then ended up with some form of kqueue-like event mechanism. Windows has completion ports/overlapped IO which combined the event mechanism with a zero-copy way of doing network IO.

So the Free/Open operating systems have scalable event notification mechanisms for handling large numbers of concurrent sockets but they don't all have some nice, efficient way of doing zero-copy network IO.

Linux has splice()/tee()/vmsplice(). So yes, it effectively does have a way of doing zero-copy socket reading and writing.

OpenBSD does have a splice style syscall to copy data from a source to a destination TCP socket.

FreeBSD, however, has mostly focused on the "disk to network" path for content serving and thus has a lot of time invested in their sendfile() implementation. This is great if you're doing a lot of file to network sending (which Netflix does), but it has some serious shortcomings. The main one I'll address here is the lack of being able to do general zero-copy socket writes from userland. So it can only send data from disk files to the network. You can't implement a zero-copy intermediary proxy server, nor a memory cache that keeps things in pre-allocated memory regions. You have to use disk files (whether that be a real filesystem on disks, or a memory filesystem) and leverage VM hints to control caching.

Recently there was some new sendfile() work to allow sending from POSIX shared memory segments. This intrigued me - it's not the most effective way of doing zero-copy network IO from userland but it's a start. So I set off to write an updated version of my network library from yesteryear to implement some massively parallel network applications with.

The idea is simple - you allocate a POSIX shared memory segment. You then mmap() that region into memory and treat it as a place to allocate write-side network buffers from. Then you use the shared memory filedescriptor and offset to schedule a sendfile() from the shared memory segment to the destination network socket. It's not as elegant as having a write path that wires the memory down and just populates mbufs from that, but that'll come later.

Here's what I found.

Firstly, there's no asynchronous "I'm done!" notification for the sendfile path. So you have no explicit notification that the underlying memory has been freed so you can reuse it. sendfile() has the SF_SYNC flag which causes it to sleep until the transaction is done - primarily so users can be sure they can change the underlying file contents after the syscall completes. This is used by caches such as Varnish that leverage on-disk files as their cache filesystem space.

So I've been adding that. I have a working prototype that is scaling quite well under load and I'll look to commit it to FreeBSD-HEAD soon. It posts a knote to a kqueue filedescrpitor once a transaction has completed.

Once that was done, I started benchmarking the performance of this setup.

The first real roadblock I hit was massive VM contention on the shared memory segment. It turns out that a single POSIX shared memory segment is represented as a single vm_object and this is protected by a single lock. So when 8 threads are actively doing IO from the same shared memory segment it hits massive lock contention. I fixed this in my test suite by allocating one shared memory segment per thread. It's not elegant but it works well enough for benchmarking.

I next hit issues with contention on the VM page lists. Besides the per-object list, there's also a global per-type list (active, inactive, etc.) There's one lock protecting each of these lists. What I found was the VM was shuffling pages between active/inactive and at the traffic rates I was doing (20+gbit/sec) it was a few hundred thousand pages a second being shuffled around. The solution? mlock() the whole region into memory. This prevented the VM from having the pages change state so often and eliminated that overhead.

The code for doing this sendfile() work with posix shared memory is in my libiapp code - http://github.com/erikarn/libiapp . It's terrible and hacky - I'm just experimenting with things for now. But with some tuning, I can get a good 35Gbit/sec out of 70,000 active TCP sockets. There's still a long way to go - I shouldn't be saturating an 8-core CPU with this traffic level when I'm doing no socket data copies. I'll write another update or two about that soon.

Now, what would I like to see? I did some experiments with physical disk IO using the FreeBSD AIO paths doing the same kinds of IO patterns as I am doing with network socket IO (4KiB to 64KiB random disk reads.) It turns out if you do everything correctly, the FreeBSD AIO code will turn physical disk IO into asynchronous disk buffer transactions by wiring the userland buffer into memory and then using that as the backing buffer memory. The overhead of doing the pmap work for this was not too high. So, I wonder if it's worth writing a new transmit path that uses the pmap code (and not the VM!) to wire in a region of memory and then use that for transmit buffers. Combined with an iovec style array of buffers and the above kqueue notification of the network IO completion, I think we can end up with a much more flexible method of doing network IO from userland without the shortcomings by using POSIX shared memory with sendfile().

Finding low hanging fruit with PMC, or "O(wtf)" ?

I've lately been focusing on performance counter stuff on Sandy Bridge (Xeon and non-Xeon.) Part of this has been fixing some of the counters that were wrong. Part has been digesting the Intel tuning guides and the Intel micro-architecture for Sandy Bridge. It's a little different to the older school pipeline driven architecture that rules the MIPS world.

So, I fired up some of my scripts (at http://github.com/erikarn/hwpmc) on a live cache pushing a whole lot of live video netflix traffic. The scripts use the PMC framework in global counter mode rather than sampling mode, so it's cheap to do and doesn't affect performance.

What I found:

  1. The pipeline slots per cycle metric is around 16% - so there's a lot of stalling going on.
  2. There's a lot of memory traffic going on - around 50% of clock cycles are spent in LLC_MISS - ie, it wasn't in L1, L2 or L3/LLC (last-level cache) and thus has to be fetched from memory.
So, I started digging into why there were so many memory accesses. It turns out the biggest abuser was the cross-CPU IPI involved in synchronising page mapping tables - there are a few places calling pmap_invalidate_range() as part of sendfile() buffer completion and this was causing issues. I pointed this out, someone else has addressed it internally. (Ideally if the IO path uses unmapped buffers on amd64, there shouldn't be any need to map them in and out of KVA.) I think that saved about 4% of total clock cycles spent being stalled.

Then I found a lot of stalling going on in the mwait and ACPI sleep path. It turns out that these paths seem to involve doing ISA space IO port accesses. These are .. very slow. I've just flipped my testing over to use no mwait and use HLT.

Next - flowtable had been turned on during experimentation and I had noticed that the flowtable expire/flush code would periodically spike up. It spiked up more when more clients and more TCP flows were connected. It showed up in both memory accesses and clock cycles busy PMCs - and the reason made me laugh out loud.

The flowtable uses a bitstring_t - effectively an array of bytes treated as a bitmap, like select() FD_SET's - and would walk this to look for flows to expire.

The expiry code would walk the list looking for flows to expire - it would loop over the entire set, calling ffs() over the whole set to look for the next new flow to check.

.. so looping over looping over the whole set. O(n^2). Right there, in the flow cleaning path. Doing byte offset fetches, rather than 32-bit fetches. Everything about it was ridiculous. As we scaled up to serve more flows the flowcleaner CPU cycle count was spiking really, really hard.

I pointed this out in an email to my coworkers and fell asleep. It was fixed when I awoke - a co-worker fixed it to be correctly O(n) whilst I was sleeping. It's now totally disappeared from the CPU cycle and stall analysis.

So, I've just been chipping away at things here and there. There are some larger scale issues that I really want to address but I'd like to make sure all the odd, silly and remaining low hanging fruit are addressed. Then comes the fun stuff.

Adrian Chadd’s Ramblings 2013-08-19 16:52:00

This blog post is mostly so I don't forget this kind of stuff.

http://software.intel.com/sites/default/files/m/a/d/2/2/e/15529-Intel_VTune_Using.pdf mentions "% execution stalled". This is the core i7 document rather than the Sandy Bridge document, but bear with me.

The formula is:

(UOPS_EXECUTED.CORE_STALL_CYCLES /(UOPS_
EXECUTED.CORE_ACTIVE_CYCLES +UOPS_EXECUTED.
CORE_STALL_CYCLES))* 100

However, there's no UOPS_EXECUTED.CORE_STALL_CYCLES in the PMC documentation, nor is it in the Intel SDM chapter on performance counters.

But wait! It kind of is there. There /is/ UOPS_EXECUTED.THREAD, which is "Counts the total number of uops to be executed per thread each cycle." In the same block, it says that to count stall cycles, set CMASK=1, INV=1. Ok, so how does one do that with PMC?

# pmcstat -S UOPS_EXECUTED.THREAD,inv,cmask=1 -T -w 5

Now, it seems to be showing me the ACPI wait and MWAIT functions as high sample events - which is odd, as I didn't think this particular PMC measured C1 and MWAIT states. I'll chase this up.

For Sandy Bridge it's UOPS_DISPATCHED.THREAD - this counts dispatched micro-operatons per-thread each cycle. CMASK=1,INV=1 counts the number of stall cycles.

Profiling on superscalar architectures, or "no, instruction counts don’t necessarily matter nowdays.."

I could spend a lot of time brain dumping various things to do with profiling on anything more recent than Pentium 1 (or, well, anything remotely to do with MIPS of that era, but I digress.) In any case, there's plenty of useful writings about it so I'm not going to - just jump over to https://wiki.freebsd.org/AdrianChadd/ProfilingNotes .

However, I thought I'd provide a basic example of where "instructions" doesn't actually matter, as well as a shortcoming of the current FreeBSD tools.

My network testing stack does a whole lot of read() and write() syscalls to achieve its goal. For those who know what's going on, I hope you know where I'm taking this. Anyway..

Firstly, the standard. "pmcstat -S instructions -T". This prints a "top" like output counting instructions retired.
Figure 1. # pmcstat -S instructions -T -w 5

This looks like the contention is in the mutexes protecting socket receive and the TCP output path. Sure, but why is it contending?

The problem with doing it based on instructions retired is that it hides any issues to do with stalls. There's a bunch of sources of potential stalls - memory reads, memory writes, stuff not being in cache but being needed for instructions that are running. They're generally either side-effects of operations not being able to complete in time (eg if you have a whole lot of completed operations that need to push stuff out to memory to continue, but there's no free bandwidth to queue memory writes) but sometimes it's just from straight bulk memory copies.

If you're interested about the Intel microarchitecture and how all of these pieces fit together to process an instruction stream in parallel, with all of the memory controller paths coming in and out, have a read of this: http://software.intel.com/sites/products/collateral/hpc/vtune/performance_analysis_guide.pdf .

Ok, so let's look at general stalls. There's a bunch of L1, L2, LLC (last level cache, think "L3" here) operations that can be looked at, as well as stuff that FreeBSD's PMC framework doesn't support  - notably some of the stuff on the "uncore" - the shared cache and pipelines between all cores on a socket. It supports the events implemented using MSRs, but not events implemented using the PCIe configuration space.

So, without further ado:

Figure 2. # pmcstat -S RESOURCE_STALLS.ANY -T -w 5
Yup. This looks much more like what I'd expect. The CPU is stalled doing copyout(). This is a mostly-read() workload, so that's what I'd expect. mb_free_ext() is interesting; I'll go look at that.

Now, copyout() is doing a bulk copy. So, yes - I'd expect that to be hurting. mb_free_ext() shouldn't be doing very much work though - I'll do some digging to see what's going on there.

The final output is from the Intel performance tuning overview tools. You can find them here - http://software.intel.com/en-us/articles/intel-performance-counter-monitor-a-better-way-to-measure-cpu-utilization . There's a nice overview tool (pcm.x) which will output the basic system overview. I like this; it gives a very simple overview of how efficient things are.
Figure 3. "pmc.x 1" running on FreeBSD-10.

Now, this isn't a stock version of pcm.x - I've hacked it up to look slightly saner when doing live reporting - but it still provides exactly the same output in that format. Note the instructions per CPU cycle and the amount of cache misses. It's .. very inefficient. Tsk.

So in summary - don't just do instruction count based profiling. You want to establish whether there are any memory and cache bottlenecks. If you're doing HPC, you want to also check to see if you're hitting SSE, FPU, divider unit and other kinds of math processing stalls.

Now - what would I like to see in FreeBSD?

  • The hwpmc framework needs to grow support for more of the socket and system events - right now it's very "core" focused.
  • Some better top-level tools to provide a system summary like Intel's pcm.x tool would be nice.
  • Some better documentation (read: better than just this wiki page!) looking at how to actually profile and understand system behaviour would be desirable.
  • .. and tying this into dtrace would be great too.
  • Once we get some (more) NUMA awareness, it would be great to have the uncore stuff reporting on things like QPI traffic, cache line and memory accesses from remote sockets/cores, and other stuff to do with NUMA allocation and awareness.
Mostly, however, I'd like to get this stuff into the hands of developers sooner rather than later so they can start running this as a normal day to day thing.

Why, oh why am I seeing RST frames from FreeBSD when I have a high connection rate?

I started seeing something odd in my testing. I was only getting around 120-odd new connections a second being accepted by the test server. I know FreeBSD needs some tuning to make it perform at high request rates, but .. hell. The odd thing? The other requests were sometimes getting RST frames (and the client would error out with a "connection reset by peer"; sometimes not.)

After doing some digging, I discovered that.. I was doing something a little odd in my testing framework and it (surprise!) elicited some very negative behaviour from FreeBSD. Said behaviour is actually valid - it's to avoid denial of service attacks. But it's worth talking about.

My test client was bursting 'n' connections per thread each second. So, I would do a test of say, 128 new connections back to back, each second, from each thread. This is definitely odd (but easy to implement!)

Here's what the server was doing.

Firstly - there's a "syncache". The syncache handles incoming embryonic requests (ie, the SYN from a remote peer.) It's separate from the rest of the TCP stack so a large flood of new connections (valid or otherwise) doesn't need to grab TCP stack locks in order to process these frames, or waste RAM with PCB (protocol control block) entries for these embryonic requests. It also makes it easier to time out half-completed requests - the PCB will only have completed or closing connections.

If the handshake succeeds but there's a failure in allocating a new PCB or socket for the connection, the TCP stack can return an RST to the peer.

If the syncache fills up, it should be sending syncookies. (google "SYN cookies" for more information.) The point of using SYN cookies is that it doesn't fill the syncache up with embryonic connections - there's a cookie that the client will reflect back to the server that validates the connection.

If the syncookie exchange suceeds but the application can't create new sockets fast enough (ie, servicing the accept() socket queue quickly enough), the TCP stack will throw an RST back at the client.

Now, for the fun bits.

  • The RST responses back to the server are rate limited - via net.inet.icmp.icmplim. Yes, it's not just for rate limiting ICMP responses.
  • So the client would see some connections hit an RST and fail immediately; others just wouldn't get the ACK and would try again, so..
  • .. over time, there'd be a burst of new connections every second from the client (causing the issue) as well as the connection retransmits for embryonic-but-not-yet-finished connections
When I staggered the new connections over smaller, quicker bursts (so instead of 128 connections a second per thread,  I'd do 12 connections every 100mS) then the problem went away. This is better behaviour (I can connect thousands of new connections a second here!) but I still expect to see this problem in the real world. As I approach my intended TCP connection rate (100,000 connections a second - which isn't specifically a Netflix requirement, but an "Adrian proxy load" requirement! - I'm going to start seeing microbursts of new connections that will temporarily look like back-to-back new connections, thus triggering this bug.

So, to work around this for now, one just has to bump up the accept queue depth (sysctl kern.ip.somaxconn) to something much higher than the default of 128.

Now - why is this happening? My theory is this:
  • We're getting this burst of frames coming in the NIC;
  • The syncache / cookie code is being run in the NIC RX path;
  • The new connection path gets run and quickly overflows the syncache and new connection queue handling in the TCP stack, as the userland code doesn't get a notification in time
  • .. so the accept queue overflows before userland gets a chance to run, and we start sending rate limited RSTs.

Hacking on the Intel 10GE driver (ixgbe) for fun and .. not-crashing

My job at Netflix can be summed up as "make the Open Connect Platform better." This involves, among other things, improving FreeBSD in ways that make it more useful for content delivery.

I interpret that as including "find ways to break things."

So, I've written a crappy little multi-threaded network library (http://github.com/erikarn/libiapp) which is absolutely, positively crappy and FreeBSD specific. Right now all it does is TCP and UDP network smashing using read() / write() for TCP, and recvfrom() / sendto() for UDP.

The aim with this is to stress test things and find where they break. So, the first thing I've written is a very simple TCP client/server - the client connects to the server and just write()s a lot of data.

.. except, that the clients are light-weight, in C, and multi-threaded.

So, I end up 'n' threads, with 'm' TCP sockets each, all doing write(). Right now I'm watching 4 threads with 12,288 sockets each sending data.

The test hardware is a pair of 1ru supermicro boxes with Intel E3-1260L CPUs in them, 32GB of RAM and dual-port Intel 82599EB 10GE NICs. The NICs are channel-bonded (using LACP) through a Cisco ASR9k switch.

I initially tested this on FreeBSD-9. I was rudely reminded of how utterly crappy the default mbuf sizing is. I constantly ran out of mbufs. Then, since FreeBSD-10 is on the cards, I just updated everything to the latest development branch and ran with it.

The result? The test ran for about 90 seconds before things got plainly pissed. The client (sender) would immediately hang. I'd get short packet errors, the LACP session would get unstable... everything was just plain screwed. The server (receiver) never saw any issues. I also saw lots of RX stalls, where one ring would seemingly fill up - and the whole RX path just ground to a halt. In addition, I'd also see a whole lot of out of order TCP segments on the server (receiver) side. Grr.

So, cue some driver hacking to see what was going on, reading the Intel 82599EB datasheet (that's freely available, by the way!) as well as discussions with Intel, Verisign and a few other companies that are using Intel 10GE hardware quite heavily, and here's what was discovered.

There's a feature called "RX_COPY" where small packets that are received are copied into a small, new mbuf - and the existing receive buffer is left in the RX ring. This improves performance - there's less churn of the mbuf allocator for those larger buffers. However. there was some dangling pointers around the management of that, leading so some stuff being DMAed where it shouldn't .. which, since ACKs and LACP frames are "small", would be triggered by this. Since the sender (client) is sending lots of segments, it's going to be receiving a lot of ACKs and this explains why the receiver (server) didn't hit this bug.

Next, the RX stalls. By default, if one of the RX rings fills up, the whole RX engine stalls. This is apparently configurable (read the data sheet!) but it's not on by default in FreeBSD/Linux. One of the verisign guys found the problem - in the general MSIX interrupt handler path, it was acknowledging all of the interrupts that were currently pending, rather than only the ones that were activated. The TX/RX interrupts are routed to other MSIX messages and thus should be handled by those interrupt threads. So, under sufficient load - and if you had any link status flaps - you may hit a situation where the non-packet MSIX interrupt thread runs, ACKs all the interrupts, and you immediately end up filling up the RX ring. You won't generate a subsequent interrupt as you've already hit the limit and the hardware won't generate anything further.. so you're stuck. That's been fixed. The annoying bit? It was fixed in the Linux driver but not the FreeBSD driver. Growl.

So, now the driver behaves much, much better. I can smash it with 20 gigabits a second of TCP traffic across 50,000 odd sockets and nary a crash/hang. But what bugs me is the out-of-order TCP packets on the receiver side of things.

The reason - it's highly likely due to the driver architecture. The driver will schedule deferred packet processing using the taskqueue if the interrupt handler ends up with too many packets to work with. Now, this taskqueue is totally separate to the interrupt thread - which means, you can have both of them running at the same time, and on separate CPUs.

So I'm going to hack the driver up to not schedule the taskqueue and instead just poke the hardware to post another interrupt to do further processing. I hope this will resolve the out of order TCP frames being received.

FreeBSD, Netflix, CDN

The big news this week is the Netflix Openconnect platform, which was just announced. It uses off the shelf hardware - and FreeBSD + nginx.


The question is how you could spin it.


You could say "Netflix chose FreeBSD because they can keep their changes proprietary." Sure, they could. But they're not making appliances that they're selling - they're owning the infrastructure and servers. It's unclear whether they'd have to contribute back any Linux changes if they ran Linux on their open connect platform. They're making a conscious, public decision to distribute their changes back to FreeBSD - even though they don't have to.


You could say "Netflix chose FreeBSD because the people inside the company knew FreeBSD." Sure, they may have. The same thing could be said about why start-ups and tech companies choose Linux. A lot of the time its because they're chasing enterprise support from Redhat. But technology startups using Ubuntu or Debian tend not to be paying support fees - they hire smart people who know the technology. So, yes - "using what they know."


According to the Netflix Openconnect website:


"This was selected for its balance of stability and features, a strong development community and staff expertise. We will contribute changes we make as part of our project to the community through the FreeBSD committers on our team."



Let's pull this apart a little.

  1. "Balance of stability and features." FreeBSD has long been derided for how slowly it moves in some areas. The FreeBSD developers tend to be a conservative bunch, trying to find the balance between new feature development and maintaining both stability and backwards compatibility.
  2. "Strong community." FreeBSD has a strong technical development community and Netflix finds this very important. They're also willing to join and participate in the community like many other companies do.
  3. "Staff expertise." So yes, their staff are familiar with FreeBSD. They're also familiar with Linux. They chose a platform which they have the expertise to develop, use and improve. They didn't just choose an unfamiliar platform because of marketing brochures or sales promises. I don't see any negatives here. I'm sure that Google engineers chose Linux to begin with because they were familiar with Linux.
  4. "Contribute changes we make as part of our project to the community." Netflix  has committed to push improvements and fixes back to the upstream project They contributed some bug fixes in the 10GE Intel driver and IPv6 stack this week. This is collaborative open source working the way it should.
Why would Netflix push back changes and improvements into a project when they're not required to? That's something you should likely ask them. But the same good practice arguments hold for both Linux AND BSD projects:
  • The project is a constantly moving target. If you don't push your changes back upstream, you risk carrying around increasingly larger changes as your project and your BSD upstream project diverge. This will just make things more difficult in the long run.
  • By pushing your changes upstream, you make it easier to move with the project - including adopting improvements and new features. If you keep large changes to yourself, you will likely find it increasingly difficult to update your software to the newer upstream versions. And that upstream project is likely adding bug fixes, improvements and new features - which at some point you may wish to leverage. By pushing your changes upstream, you make it a lot easier to move to future versions of the upstream project, allowing you to leverage all those fixes and improvements without too much engineering time.
  • By participating, you encourage others to adopt your technology. By pushing your changes and improvements upstream, you decrease the amount of software you have to maintain yourself (and keep patching as the upstream project moves along.) But you also start to foster technology adoption. The FreeBSD jail project started out of the desire by a hosting company to support virtualisation. Since then, the Jail infrastructure has been adopted by many other companies and individuals.
  • When others use your technology, they also find and fix bugs in your technology; they may even improve it. The FreeBSD jail support has been extended to include IPv6 support, shared memory support and integrates into the VIMAGE (virtualised networking) stack (which, by the way, came from Ironport/Cisco.) As a company, you may find that the community will do quite a lot of the work that you would normally have to hire engineers to do yourself. This saves time and saves money.
  • When companies contribute upstream, it encourages other companies to also contribute upstream. A common issue is "reinventing the wheel", where companies end up having to reinvent the same technology privately because no-one has contributed it upstream. They solve the same problems, they implement the same new features .. and they all spend engineering time and resources to do so.
  • And when companies contribute upstream, it encourages (private) developers to contribute. Open source developers love to see their code out there in the wild, in places they never quite thought of. It's encouraging to see companies build products with their code and contribute back bug fixes and improvements. It fosters a sense of community and participation, of "give and take", rather than just "take". This is exactly the kind of thing that keeps developers coming back to contribute more - and it attracts new developers. Honestly, who wouldn't want to say that some popular device is running code that they wrote in their spare time?
So, you could rant and rave about the conspiracy side of Linux versus FreeBSD. You could rant on about GPL versus BSD. Or, you could see the more useful side of things. You could see a large company who didn't have to participate at all, agreeing to contribute back their improvements to an open source operating system. You could see that by doing so, the entire open source ecosystem benefits - not just FreeBSD. There's nothing stopping Linux or other BSD projects from keeping an eye on the improvements made by Netflix and incorporating those improvements into their own project. And it's another case of a company participating and engaging the open source community - and having that community engage them right back.

Good show, Netflix. Good show.