Category Archives: ethernet

Porting over the AR8327 support

It's been a while since I posted. I'll post about why that is at some point but for now I figure it's time I wrote up the latest little side project - the Atheros AR8327 switch support.

The AR8327 switch is like the previous generation Atheros switches except for a couple of very specific and annoying differences - the register layouts and locations have changed. So it's not just a case of pretending it's an AR8316 except for the hardware setup - there's some significant surgery to do. And no, I did try just ignoring all of that - the switch doesn't come up and pass packets.

So, the first thing was to survey the damage.

The Linux driver (ar8216.c) has a bunch of abstractions that the FreeBSD driver doesn't have, so that's a good starting point. The VLAN operations and VLAN port configuration stuff is all methods in the Linux driver, so that was a good starting point. I stubbed most of the VLAN stuff out (because I really didn't want it to get in the way) - this turned out to be more annoying than I wanted.

Next was the hardware setup path. There's more configurable stuff with the AR8327 - there's two physical ports that I can configure the PHY/MAC parameters on for either external or internal connectivity. I just took the code from Linux (which yes, I have permission to relicence under BSD, thanks to the driver authors!) and I made it use the defaults from OpenWRT for the DB120. The ports didn't properly come up.

I then realised that I was reading total garbage from the PHY register space, so I went looking at the datasheet and ar8216 driver for some inspiration. Sure enough, the AR8327 has the PHY MDIO bus registers in different locations. So after patching the arswitch PHY routines with this knowledge, the PHYs were probed and attached fine. Great. But it still didn't detect port status changes.

So, back to the ar8216 driver. It turns out that there were a few things that weren't methodized - and these were the bits that read the PHY status from the switch. Both drivers didn't just poll the PHYs directly - they read the switch registers which had a summary of the port status. So, I taught the driver about this and voila! Port status changes worked.

But, no traffic.

Well, there's a few reasons for this. It's a switch, so I don't have to setup anything terribly difficult. The trick here is to enable port learning and make sure they're all in the same VLAN group. Now, here's where I screwed up and I found a bug that needed working around.

The port setup code did enable learning and put things into a vlan group.

Firstly, I found this odd behaviour that I got traffic only when I switched the ethernet cable to another port. Then learning worked fine. I then found that the ar8216 driver actually triggers a forwarding table flush upon port status change, so I added that. This fixed that behaviour.

But then it was flooding traffic to all ports. This is kinda stupid. What did I screw up? I put each port in a separate vlangroup, rather than put them in the same vlangroup. Then, I programmed the "which ports can you see?" to include all the other ports. What this meant was:
  • The forwarding table (ie, what addresses were learnt) were linked to the vlangroup the port is in;
  • .. and when the switch did a lookup for a given MAC on another port, it wouldn't find it, as the address in the forwarding table showed it was for another vlangroup;
  • .. so it would do what switches do when faced with not knowing about the MAC (well, and how I had configured it) - it flooded traffic.
The solution was thankfully easy - I just had to change the vlangroup (well, "port vlan" here) to be '1', instead of the port id. Once this was done, all the ports came up perfectly and things worked great.

So, this now works great on the Atheros DB120 reference board. It's not working on other boards - there's likely some timing issues that need to be resolved. But we're making progress!

Finally, I spent a bunch of time porting over the port configuration and LED configuration stuff from OpenWRT so I didn't have the driver just hard-coded to the DB120 board. I'll update the configuration and code when I get my hands on other boards that use the AR8327 but for now this is all I have.

Enjoy!

Hacking on the Intel 10GE driver (ixgbe) for fun and .. not-crashing

My job at Netflix can be summed up as "make the Open Connect Platform better." This involves, among other things, improving FreeBSD in ways that make it more useful for content delivery.

I interpret that as including "find ways to break things."

So, I've written a crappy little multi-threaded network library (http://github.com/erikarn/libiapp) which is absolutely, positively crappy and FreeBSD specific. Right now all it does is TCP and UDP network smashing using read() / write() for TCP, and recvfrom() / sendto() for UDP.

The aim with this is to stress test things and find where they break. So, the first thing I've written is a very simple TCP client/server - the client connects to the server and just write()s a lot of data.

.. except, that the clients are light-weight, in C, and multi-threaded.

So, I end up 'n' threads, with 'm' TCP sockets each, all doing write(). Right now I'm watching 4 threads with 12,288 sockets each sending data.

The test hardware is a pair of 1ru supermicro boxes with Intel E3-1260L CPUs in them, 32GB of RAM and dual-port Intel 82599EB 10GE NICs. The NICs are channel-bonded (using LACP) through a Cisco ASR9k switch.

I initially tested this on FreeBSD-9. I was rudely reminded of how utterly crappy the default mbuf sizing is. I constantly ran out of mbufs. Then, since FreeBSD-10 is on the cards, I just updated everything to the latest development branch and ran with it.

The result? The test ran for about 90 seconds before things got plainly pissed. The client (sender) would immediately hang. I'd get short packet errors, the LACP session would get unstable... everything was just plain screwed. The server (receiver) never saw any issues. I also saw lots of RX stalls, where one ring would seemingly fill up - and the whole RX path just ground to a halt. In addition, I'd also see a whole lot of out of order TCP segments on the server (receiver) side. Grr.

So, cue some driver hacking to see what was going on, reading the Intel 82599EB datasheet (that's freely available, by the way!) as well as discussions with Intel, Verisign and a few other companies that are using Intel 10GE hardware quite heavily, and here's what was discovered.

There's a feature called "RX_COPY" where small packets that are received are copied into a small, new mbuf - and the existing receive buffer is left in the RX ring. This improves performance - there's less churn of the mbuf allocator for those larger buffers. However. there was some dangling pointers around the management of that, leading so some stuff being DMAed where it shouldn't .. which, since ACKs and LACP frames are "small", would be triggered by this. Since the sender (client) is sending lots of segments, it's going to be receiving a lot of ACKs and this explains why the receiver (server) didn't hit this bug.

Next, the RX stalls. By default, if one of the RX rings fills up, the whole RX engine stalls. This is apparently configurable (read the data sheet!) but it's not on by default in FreeBSD/Linux. One of the verisign guys found the problem - in the general MSIX interrupt handler path, it was acknowledging all of the interrupts that were currently pending, rather than only the ones that were activated. The TX/RX interrupts are routed to other MSIX messages and thus should be handled by those interrupt threads. So, under sufficient load - and if you had any link status flaps - you may hit a situation where the non-packet MSIX interrupt thread runs, ACKs all the interrupts, and you immediately end up filling up the RX ring. You won't generate a subsequent interrupt as you've already hit the limit and the hardware won't generate anything further.. so you're stuck. That's been fixed. The annoying bit? It was fixed in the Linux driver but not the FreeBSD driver. Growl.

So, now the driver behaves much, much better. I can smash it with 20 gigabits a second of TCP traffic across 50,000 odd sockets and nary a crash/hang. But what bugs me is the out-of-order TCP packets on the receiver side of things.

The reason - it's highly likely due to the driver architecture. The driver will schedule deferred packet processing using the taskqueue if the interrupt handler ends up with too many packets to work with. Now, this taskqueue is totally separate to the interrupt thread - which means, you can have both of them running at the same time, and on separate CPUs.

So I'm going to hack the driver up to not schedule the taskqueue and instead just poke the hardware to post another interrupt to do further processing. I hope this will resolve the out of order TCP frames being received.