Category Archives: atheros

TDMA (somewhat) working on AR9380 chips

(Wow, I have a lot of posts to write to catch up on things.)

I've just brought up FreeBSD's TDMA support on the AR9380 chipset. Specifically, the AR9331, since I have a Carambola 2 on me today.

It was pretty simple to bring up - I was missing the beacon configuration HAL call that the TDMA code expected. It's only used by the TDMA code - the STA and AP modes rely on the normal HAL beacon methods that date back to the Atheros HAL.

The only problem - it seems something is up with ANI (noise immunity) and sensitivity on at least the AR9331. It doesn't seem to behave well on slightly loaded channels and thus the beacons don't always go out when they're supposed to.

But, if you've been wanting to play with TDMA on the later Atheros chips, now you can!

Adding chipset powersave support to FreeBSD’s Atheros driver

I've started adding some basic powersave support to the FreeBSD Atheros ath(4) driver. The NICs support putting parts of the device to sleep to conserve power but.. well, it's tricky.

In order to make things consistent, I either need to not do things when the NIC is asleep (for example, doing calibration when the NIC isn't running), but I also need to ensure that I force the NIC awake when the NIC may be asleep. During normal running, the NIC may have put itself into temporary sleep whilst waiting for some packets from the AP to signal that it needs to wake up. So I will also need to force the NIC awake before programming it.

So, before I start down the path of handling the whole dynamic power management stuff, I figured I'd tackle the initial bits - handling powering on the NIC at startup and powering it off when it's not in use. This includes powering it down during device detach and suspend, as well as when all of the VAPs are down.

This is turning out to be slightly more complicated than I'd like it to be.

The first really stupid thing I found was that during the interface down process, the VAP state change from RUN -> INIT would reset the BSS, which included re-programming the slot time. So, I have to wake up the hardware when programming that. It can then go back to sleep when I'm done with it.

Now there's some issues in the suspend path with the NIC being marked as asleep when it is being reset, which is confusing - the NIC should be woken up when ath_reset() is called. So, I'll have to debug these.

The really annoying bit is that if I read a register whilst the silicon is asleep, the reads return 0xDEADBEEF. So if I am storing the register contents anywhere, I'll end up storing and programming a potentially totally invalid value.

There's also some real problems with race conditions. I can put the power state changes behind a lock, but imagine something like this:

* ATH_LOCK; force awake; do something; ATH_UNLOCK .. ATH LOCK; do some more; put back to sleep; ATH_UNLOCK

Now, if a second thread puts the NIC back to sleep in between those two lock sections, the second "do some more" work may occur once the NIC was put to sleep by said second thread. So I have to correctly track if the NIC is being forced awake by refcounting how many times its being forced awake, then when the refcount hits zero and we can put it to sleep, put it back to sleep.

Once this is all done, I can start down the path of supporting proper network sleep - where the NIC stays asleep and wakes up to listen for beacons and received frames from the AP. I then choose to force the NIC awake and do more work. I have to make absolute sure that I don't queue things like transmitted frames or add more frames to the receive queue if it may fall asleep. There's also some mechanisms to have a transmit frame put the NIC to sleep - there's a bit that says "when this frame is transmitted, transition the NIC back to sleep." I have to go and figure out how that works and implement that.

But for now, let's keep it simple and debug just putting the NIC to sleep when it's not in use.

Porting over the AR8327 support

It's been a while since I posted. I'll post about why that is at some point but for now I figure it's time I wrote up the latest little side project - the Atheros AR8327 switch support.

The AR8327 switch is like the previous generation Atheros switches except for a couple of very specific and annoying differences - the register layouts and locations have changed. So it's not just a case of pretending it's an AR8316 except for the hardware setup - there's some significant surgery to do. And no, I did try just ignoring all of that - the switch doesn't come up and pass packets.

So, the first thing was to survey the damage.

The Linux driver (ar8216.c) has a bunch of abstractions that the FreeBSD driver doesn't have, so that's a good starting point. The VLAN operations and VLAN port configuration stuff is all methods in the Linux driver, so that was a good starting point. I stubbed most of the VLAN stuff out (because I really didn't want it to get in the way) - this turned out to be more annoying than I wanted.

Next was the hardware setup path. There's more configurable stuff with the AR8327 - there's two physical ports that I can configure the PHY/MAC parameters on for either external or internal connectivity. I just took the code from Linux (which yes, I have permission to relicence under BSD, thanks to the driver authors!) and I made it use the defaults from OpenWRT for the DB120. The ports didn't properly come up.

I then realised that I was reading total garbage from the PHY register space, so I went looking at the datasheet and ar8216 driver for some inspiration. Sure enough, the AR8327 has the PHY MDIO bus registers in different locations. So after patching the arswitch PHY routines with this knowledge, the PHYs were probed and attached fine. Great. But it still didn't detect port status changes.

So, back to the ar8216 driver. It turns out that there were a few things that weren't methodized - and these were the bits that read the PHY status from the switch. Both drivers didn't just poll the PHYs directly - they read the switch registers which had a summary of the port status. So, I taught the driver about this and voila! Port status changes worked.

But, no traffic.

Well, there's a few reasons for this. It's a switch, so I don't have to setup anything terribly difficult. The trick here is to enable port learning and make sure they're all in the same VLAN group. Now, here's where I screwed up and I found a bug that needed working around.

The port setup code did enable learning and put things into a vlan group.

Firstly, I found this odd behaviour that I got traffic only when I switched the ethernet cable to another port. Then learning worked fine. I then found that the ar8216 driver actually triggers a forwarding table flush upon port status change, so I added that. This fixed that behaviour.

But then it was flooding traffic to all ports. This is kinda stupid. What did I screw up? I put each port in a separate vlangroup, rather than put them in the same vlangroup. Then, I programmed the "which ports can you see?" to include all the other ports. What this meant was:
  • The forwarding table (ie, what addresses were learnt) were linked to the vlangroup the port is in;
  • .. and when the switch did a lookup for a given MAC on another port, it wouldn't find it, as the address in the forwarding table showed it was for another vlangroup;
  • .. so it would do what switches do when faced with not knowing about the MAC (well, and how I had configured it) - it flooded traffic.
The solution was thankfully easy - I just had to change the vlangroup (well, "port vlan" here) to be '1', instead of the port id. Once this was done, all the ports came up perfectly and things worked great.

So, this now works great on the Atheros DB120 reference board. It's not working on other boards - there's likely some timing issues that need to be resolved. But we're making progress!

Finally, I spent a bunch of time porting over the port configuration and LED configuration stuff from OpenWRT so I didn't have the driver just hard-coded to the DB120 board. I'll update the configuration and code when I get my hands on other boards that use the AR8327 but for now this is all I have.

Enjoy!

So, FreeBSD on the AR9344? What happened?

I committed a bunch of code a while ago to FreeBSD-HEAD to at least start booting on the AR934x SoCs. The AR934x SoC is a MIPS74k core - a dual-issue superscalar 11-stage pipeline MIPS32r2 CPU. It's slightly different to the existing MIPS24k stuff (which is a single 8-stage pipeline.)

So - first step - it booted up a little, then hit a machine check. At that point the FreeBSD MIPS peeps believed there was hilarity in the TLB exception handling code, so we put it to sleep for a while and I went back to real work.

Then a few weeks ago I decided to finish it off. I brought my developer board to Eurobsdcon in Malta and sat down with Warner Losh, who also has said developer board. We spent a bunch of time going over the TLB code and realised that FreeBSD's instruction/execution hazards are all.. just wrong. Then, on a whim, I read up some more about MIPS32r2 and superscalar stuff and discovered that the correct hazard instruction isn't NOPs or SSNOPs - it's EHB (execution hazard barrier.) It's 'SLL $0, $0, 3' in MIPS parlance which on older CPUs is just a NOP (since register 0 is always 'zero'.) So, this fixed the TLB management and the boot proceeded quite a bit further.

Next - bringing up ethernet and the switch PHY. I was seeing totally crappy and invalid register values when reading/writing the attached switch chips. Even probing didn't work reliably - in fact, I got to the point where I was reading the value I'd expect from the previous register read. So, I wondered if this was another out-of-order behaviour from the MIPS74k superscalar architecture.

After digging into the MIPS bus space code, I found two things:


  1. The MIPS driver(s) don't call bus barrier functions at all - so there's no driver enforced access ordering. It was all assuming that the CPU doesn't re-order things; and
  2. The bus barrier code for MIPS was a no-op. It just plainly wasn't defined.
So, I added read/write memory barriers to the MIPS bus barrier routines and I modified the ethernet driver to use barriers. For good measure, I also added barriers to the SPI driver code as that also has a bunch of register accesses that require ordering.

And with that, the switch PHY probe/attached fine, the SPI driver worked fine and the device started booting userland off of SPI connected NOR flash.

Then, it hung. I dug into that a bit and wondered what the hell was going on. Then after a day of poking, I discovered that the interrupt acknowledgement was not working. It's a quirky thing that I should really fix in the atheros platform support - the AR71xx chips don't require the CPU peripheral interrupts to be ack'ed (eg the uart) but later chips do. I added the AR934x to the list of SoCs that need interrupts to be ack'ed and the system kept booting, all the way to userland.

Next - I haven't yet written the AR8327 support but I started fleshing out the AR934x on-board switch support. I got it probing, attaching.. but not passing any traffic. After more digging, I realised my mistake - I was writing some registers incorrectly. I would mask out the right bits to set, but then I'd always set bit 0. Sigh. So, that came up and things worked.

Then I decided to do the wifi part. This was pretty damned simple. The HAL from Qualcomm Atheros already has support for the AR934x in it and I had already modified it to work for the AR933x SoC (which just required me to 'teach' it the FreeBSD way of exposing the calibration/configuration data from on-board flash.) So, all I had to do was this:

  1. Add the device to the kernel configuration;
  2. Add a hint pointing out where the device is mapped in IO space;
  3. Add a hint pointing out where the calibration data is in the NOR flash;
  4. Reboot.
That's it. No weeks of merging code in from Linux or the internal Qualcomm Atheros driver into the FreeBSD driver. No real debugging required. Just enable it, point it at the right place in memory/flash and .. boot it. I think this again vindicates my efforts to open source the Qualcomm Atheros HAL - I just inherit this working code for free. I don't have to try and merge it into anything.

So, I have a port that's dirty and working. There's a lot of infrastructure changes I need to commit before I can commit this port - lots of new clocking options (there's now variations on the clock rate that the MDIO bus (the MII bus connecting the ethernet port(s) to a PHY or switch), there's lots of new configuration options for how the on-chip ethernet port(s) map to external ports and a bunch of other ancillary stuff that's not really worth mentioning. But it's going to show up in FreeBSD-HEAD soon.

Doing traffic with the Carambola 2..


Now that the port is working, I've started doing some traffic with the carambola 2 board on FreeBSD.

So far, so good:


# athstats
546236       data frames received
509242       data frames transmit
155          tx frames with an alternate rate
14818        short on-chip tx retries
13617        long on-chip tx retries
645          tx failed 'cuz too many retries
MCS7         current transmit rate
2            recv eol interrupts
9            tx frames with no ack marked
506786       tx frames with short preamble
1414         rx failed 'cuz of bad CRC
1543         rx failed 'cuz of PHY err
    12           OFDM restart
    1531         CCK restart
20610        beacons transmitted
71           periodic calibrations
-0/+0        TDMA slot adjust (usecs, smoothed)
24           rssi of last ack
25           avg recv rssi
-96          rx noise floor
2447         tx frames through raw api
39730        A-MPDU sub-frames received
494045       Half-GI frames received
5967         40MHz frames received
8037         CRC errors for non-last A-MPDU subframes
2            CRC errors for last subframe in an A-MPDU
498972       Frames transmitted with HT Protection
3            TX Timeout
177          Number of frames retransmitted in software
15717        A-MPDU sub-frame TX attempt success
177          A-MPDU sub-frame TX attempt failures
1            spur immunity level
4            first step level
128          OFDM weak signal detect
9            CCK weak signal threshold
108          ANI increased spur immunity
105          ANI decrease spur immunity
108          ANI increased first step level
105          ANI decreased first step level
943666       cumulative OFDM phy error count
108574       cumulative CCK phy error count
2            ANI parameters zero'd for non-STA operation
44           ANI forced listen time to zero
44           ANI calculated listen time < 0
13603        missing ACK's
14996        RTS without CTS
504970       successful RTS
34928        bad FCS
Antenna profile:
[0] tx   496835 rx        0
[2] tx        0 rx   546236

Making the AR9330 SoC wifi, or "how it feels doing things right.."

Well, "doing it right" is subjective. Sure. I'll grant you that.

I brought up the AR9330/AR9331 SoC support a couple of months ago. Unfortunately the Atheros reference board (AP121) comes with 16MB of RAM and 4MB of flash - which is just painful to do FreeBSD-HEAD development in.

Yes, I know. 16MB of RAM is tons of space... for FreeBSD-4. Anyway. That is a rant for another day.

So I managed to bring up the basic SoC support (which took longer than I thought - I had to learn how to write a FreeBSD uart driver!) but I decided to put wifi on hold until I found a board with more RAM and flash.

Along comes the Carabola 2 from 8devices.com. (http://8devices.com/carabola-2) . It's an AR9330, but with 64MB RAM, 16MB flash and a full-featured uboot. This is perfect for .. well, anything. And it's 30 Euros in quantities of one. Wait, it's cheap, it's fully-featured and it's available online? No way. What's the catch?

The catch - it wasn't running FreeBSD.

So I finally decided to bring up wifi support on FreeBSD.

The AR9300 HAL from Qualcomm Atheros includes the AR9330/AR9331 SoC wifi support. So I had to make it compile and make it work. How hard could it be?

Firstly - I wasn't compiling it in by default as it's only really useful for the SoC and not for normal PCIe NIC support. So, I needed to add that in. Luckily, I had to set AH_SUPPORT_HORNET into the source. Cool.

Next - the bus glue. The SoC internal bus isn't PCIe, it's what they call AHB, or "Atheros Host Bus." It's a derivative of a standard on-chip peripheral interconnect bus. The FreeBSD ath_ahb driver only supported AR9130, so I had to extend it to support non-AR9130 devices. That got it probing and attaching, but it wasn't finding the calibration / configuration space.

Next - gluing in the calibration data. It's on-board in the system flash, rather than on-chip (OTP) or an external EEPROM. The EEPROM space is 16KiB in size, rather than the 4KiB space used by the AR9xxx series SoCs. Also, the AR9300 HAL already seeks into the EEPROM space to grab the data at offset 0x1000, so I don't have to do that like I do with the AR9130 and related chips.

Finally - I had to teach ar9300_attach() that it needed to copy the EEPROM data I was giving it from ath_ahb into the copy it uses when setting things up.

And... that was it. After that, it booted and came up correctly. I was shocked.

You can find the boot log and dmesg at https://code.google.com/p/freebsd-wifi-build/wiki/Carambola2 .

I haven't yet tested 802.11s (mesh) on this stuff, nor have I made TDMA work with this series of chips. But it's my eventual goal to make this board one of the "gold standard" boards for people wishing to enable their projects with wifi mesh. I bet it'll work out of the box as it stands, so if you're up for a bit of tinkering, buy a handful and set it up!

Enjoy! It's the best 30 euro you'll spend!

Working on Bluetooth Coexistence

I decided to bite the bullet and start hacking on bluetooth coexistence on these Atheros NICs. It's a bit of a rabbit hole.

I'll write up a bit more documentation on this when I'm not overly tired, but the general overview is pretty simple: "It's all done in software."

The bluetooth and wifi stacks need to speak to each other to know when is an appropriate time to prefer wifi traffic or bluetooth traffic. When pairing, bluetooth should be preferred. When scanning, associating, authenticating and rekeying, wifi should be preferred. When different profiles are active (eg A2DP audio), the bluetooth traffic should be periodically given preference so the A2DP frames can go out reliably. This has to be controlled in software.

So to make this work well on FreeBSD, I'll have to teach the wifi and bluetooth stacks to interface with each other somehow so this can be synchronised.

I have basic (static) coexistence working with the AR9285+AR3011 combo NIC. That's now in -HEAD.

I'm working on basic (static) coexistence on the AR9485+AR3012 combo NIC, however my NIC has an older BT part which requires quite a bit of dancing to make work. I'll have to teach ath3kfw how to load the config and firmware image for the required NIC. It's going to take some time but it'll be worth it.

I was hoping that FreeBSD would have basic A2DP support but it currently doesn't. I'd love to see that happen as it'd simplify a lot of my development/testing - as I can then do audio stream testing both playing and recording audio, then stream that over wifi.

Oh well. Another day of hacking!

So long, and thanks for all the fish!

After 18 months at Qualcomm Atheros, I decided I needed a bit of a change.

This is what I sent out to the open source community:

Hi all,

This Friday will be my last day at Qualcomm Atheros. I've enjoyed working with the extremely bright and driven engineers and designers that make the wireless chips and SoCs that people everywhere take for granted. I've achieved a bunch of goals both with their internal product development and open source. But now it's time to move onto different things.

I'd especially like to thank Luis Rodriguez for introducing me to the QCA folk and helping me get access to the Atheros open source project, as well as the follow-up discussions that led to me being hired. The open source wireless community has been driving innovation in a lot of areas for a number of years. I'd like to hope that I've had a small, positive effect on that. I wish you all the best of luck in pushing forward and continuing to innovate.

Now, I'm still NDA-enabled and I quite like hacking on this wireless stuff so I won't be quitting hacking on things. I will just have other things on my mind.

Good luck to you all!

Now, this generated a flurry of private emails asking me what happened and where I'm going to.

So, the summary - I accepted a job at Netflix, as part of their OpenConnect CDN team.

They've built a world-wide CDN using FreeBSD and they're looking to continue growing and improving it. They've committed to improving FreeBSD's network, storage and VM layer to facilitate moving tens of gigabits of Netflix video traffic per server. And, they're going to open source the bulk of it. They realise that the best benefit from open source comes from working with open source - and that's exactly what they've done. They've contributed back their improvements and fixes.

I've enjoyed my time at Qualcomm Atheros. The people are brilliant, the hardware is excellent and it was a great learning experience. I got to experience what it was like working at a silicon company during chip design, validation and bring-up - both the good and the bad bits. But when it came down to it, I couldn't contribute to and improve the process in any meaningful way. I was one engineer in a very large, diverse organisation - and like large organisations, things move slowly.

So, I hope to continue to maintain close ties with the hardware and software people at Qualcomm Atheros. I hope to continue hacking on the FreeBSD wireless stack in my spare time, as I have been to date. I wish I could've contributed more positively to their evolving hardware and software strategy. But there's only so much an engineer in an established company can do, and that engineer wasn't going to be me.

Today’s Journey: Making AP mode power-save work better

I've been working on improving the net80211 and ath driver support for AP mode power save.

There's a few parts to it:

  • A station can tell an access point it's going to sleep by setting the power mgmt bit to 1 in a TXed frame;
  • The AP will then update the TIM entry in the beacon frames it sends out to reflect whether that station has any traffic queued;
  • A station can signal an AP that it's awake by sending a data frame with the power mgmt bit set to 0;
  • .. or it can request a frame at a time by using PS-POLL;
  • There's also the uAPSD stuff which I haven't yet implemented and won't likely do so for a while.
Now, it shouldn't be that difficult. Except, that it is.

If an AP has a bunch of frames queued to a station that has gone to sleep, it will keep trying to transmit those frames. That wastes air-time and results in annoying levels of packet loss.

When you're doing 802.11n, there's a whole lot more traffic going on and a lot more room to cause massive traffic issues if you drop frames. But you don't want to keep failing to transmit those frames or you'll end up spending a lot of time transmitting BAR frames to the station.

If the driver maintains a queue of frames (for say, software retransmit) then it also needs to ensure that the TIM bit is set correctly. Otherwise the AP may set the TIM bit to 0 because the net80211 stack has no queued frames to that node; but the driver itself has some frames. Thus, the station won't wake up and you'll see increased packet latency.

When PS-POLL is received, frames need to first be leaked from the driver queue BEFORE it starts leaking frames from the net80211 power save queue. The last thing you want is the wrong set of frames to go out.

So, I've spent the last few months extending the driver and network stack to make this feasible. There's new net80211 driver methods for tying into the TIM update process, the node power save status and the PS-POLL handling. The filtered frames handling in the ath driver is another precursor to this - it means that frames can be failed out very quickly and retried when appropriate.

(No, I'm not implementing software retransmit for non-11n traffic just yet. I will eventually. Just not yet.)

The final bits that I've been working on have been tricky.

When a node goes to sleep, you want to pause the driver transmission to the node - otherwise it will keep trying to transmit whatever is in the driver queue. For 11n this is terrible; it means that frames will keep failing to be transmitted and with enough failures, the traffic will stop whilst a BAR frame is sent. Grr.

Next was figuring out how to send frames whilst the node is "paused". I introduced a per-node "leak" counter which tells the driver transmit path that even though the node is asleep, a single frame should be scheduled. If one isn't available, the next frame sent will be scheduled. This handles the PS-POLL "null" response - ie, if there's nothing in the queue, the net80211 stack will queue a null data response with the MORE bit clear. That way the station will know there's currently nothing to receive.

But then, something odd started happening. Devices would disassociate and re-associate, but they'd still be marked as "asleep". So no traffic would occur. After digging into it a bit, I discovered that the only time a station transitions back to awake is when it receives a DATA frame with the power mgmt bit set to 0. Seeing management/control traffic from the station isn't enough. So for now, I just always transmit management/control frames regardless if the station is asleep or awake - except BAR frames. Those get software queued if the node is asleep. Now that management/control frames are transmitted directly, a station can re-associate and be marked as 'awake.'

Then I found that once a station re-associates, it should have all of its current association state reset. It may have had a bunch of aggregate frames queued to the hardware and those need to finish transmitting before we can start transmitting new data to the re-associated station. It may even have been in the middle of receiving a BAR frame! So, I have to gently (well, "gently") reset the association state to allow for currently queued frames to be cleaned up, but reset things like filtered frame state and BAR TX. Ew, but it needs to be done.

Also, if there's data queued to an asleep station and a BAR frame needs to go out, the BAR frame needs to go into the head of the software queue, not the tail. Otherwise it will have to wait for the queue to be transmitted - which, if there's a gap in the transmit block-ack window (hence needing the BAR), no further transmission will occur. Oops!

I then found that a sufficiently chatty node could end up filling the software queue full of buffers destined to it. This is a general problem in the ath driver which I'll eventually fix, but it became a huge problem with power save enabled. So, I've introduced a per-node maximum queue depth when it's asleep. That should limit the amount of pain that a single sleeping node can cause. I'll eventually introduce a limit for how many buffers an individual node can consume whether it's awake or asleep but that's for another day.

There's likely lots more corner cases that need to be addressed before I can merge this into -HEAD. I'm still seeing my macbook pro occasionally disassociate and not automatically re-associate and I'm not sure why. But things are behaving much, much better with sleeping devices.

Hey, look, it’s lots of atheros NICs in one laptop

So after many months of evenings and a whole lot of work internally to get the AR9380 HAL release vetted by legal, I bring you: a single, unified ath(4) and ath_hal(4) driver which works on all chipsets.

Now, the only chipsets I can fit _in_ this laptop:

[100309] ath0: mem 0xebf00000-0xebf0ffff irq 17 at device 0.0 on pci3
ath0: AR9280 mac 128.2 RF5133 phy 13.0
[100309] ath1: mem 0xedf00000-0xedf1ffff irq 18 at device 0.0 on pci4
ath1: AR9380 mac 448.3 RF5110 phy 0.0
[100309] ath2: mem 0xe4310000-0xe431ffff irq 16 at device 0.0 on cardbus0
ath2: AR5212 mac 5.9 RF5112 phy 4.3

.. that's an AR9280, AR5212 and AR9380 in the same laptop.

And, that's a 3x3 AR9380:

static_rix (-1) ratemask 0xffffffff
[ 250] cur rate 20 MCS since switch: packets 1 ticks 2647581
[ 250] last sample (6  Mb) cur sample (0 ) packets sent 9
[ 250] packets since sample 9 sample tt 0
[1600] cur rate 22 MCS since switch: packets 15 ticks 2647530

[1600] last sample (21 MCS) cur sample (0 ) packets sent 6049
[1600] packets since sample 0 sample tt 532
   TX Rate     TXTOTAL:TXOK       EWMA          T/   F     avg last xmit

[ 6  Mb: 250]        4:4        (100.0%)        4/   0   760uS 2640242
[20 MCS: 250]        9:9        (100.0%)        9/   0   440uS 2647581
[20 MCS:1600]      969:969      (100.0%)       57/   0   572uS 2647445
[21 MCS:1600]     1517:1517     (100.0%)       74/   0   613uS 2647557
[22 MCS:1600]     1990:1990     (100.0%)       92/   0   529uS 2647557
[23 MCS:1600]    73986:73462    ( 99.5%)     5661/   0   755uS 2647538

Now, I'm sure the AR5210 will work with an AR9280 and an AR9380 in the same laptop - it's just that the hardware form factor won't let me fit them all at the same time.

Hey, look, it’s lots of atheros NICs in one laptop

So after many months of evenings and a whole lot of work internally to get the AR9380 HAL release vetted by legal, I bring you: a single, unified ath(4) and ath_hal(4) driver which works on all chipsets.

Now, the only chipsets I can fit _in_ this laptop:

[100309] ath0: mem 0xebf00000-0xebf0ffff irq 17 at device 0.0 on pci3
ath0: AR9280 mac 128.2 RF5133 phy 13.0
[100309] ath1: mem 0xedf00000-0xedf1ffff irq 18 at device 0.0 on pci4
ath1: AR9380 mac 448.3 RF5110 phy 0.0
[100309] ath2: mem 0xe4310000-0xe431ffff irq 16 at device 0.0 on cardbus0
ath2: AR5212 mac 5.9 RF5112 phy 4.3

.. that's an AR9280, AR5212 and AR9380 in the same laptop.

And, that's a 3x3 AR9380:

static_rix (-1) ratemask 0xffffffff
[ 250] cur rate 20 MCS since switch: packets 1 ticks 2647581
[ 250] last sample (6  Mb) cur sample (0 ) packets sent 9
[ 250] packets since sample 9 sample tt 0
[1600] cur rate 22 MCS since switch: packets 15 ticks 2647530

[1600] last sample (21 MCS) cur sample (0 ) packets sent 6049
[1600] packets since sample 0 sample tt 532
   TX Rate     TXTOTAL:TXOK       EWMA          T/   F     avg last xmit

[ 6  Mb: 250]        4:4        (100.0%)        4/   0   760uS 2640242
[20 MCS: 250]        9:9        (100.0%)        9/   0   440uS 2647581
[20 MCS:1600]      969:969      (100.0%)       57/   0   572uS 2647445
[21 MCS:1600]     1517:1517     (100.0%)       74/   0   613uS 2647557
[22 MCS:1600]     1990:1990     (100.0%)       92/   0   529uS 2647557
[23 MCS:1600]    73986:73462    ( 99.5%)     5661/   0   755uS 2647538

Now, I'm sure the AR5210 will work with an AR9280 and an AR9380 in the same laptop - it's just that the hardware form factor won't let me fit them all at the same time.

AR9380 support on FreeBSD; why it’s taken so long..

There's now public, open source support for the AR9380 and later chips for FreeBSD.

It's not yet in the -HEAD tree - I'll get to that.

Let me take you on a bit of a journey.

I started a little side project late last year - I wanted to see if I could make the AR9380 HAL from the Qualcomm Atheros mainline driver (10.x branch) work on FreeBSD. I was hoping that the HAL API hadn't drifted all that much over the years.

Why do this? Two reasons:
  • I wanted to see if I could open source the HAL and have it work with FreeBSD; and
  • I didn't want to take on a similar project to what ath9k had to do - which is to take the existing HAL, convert it into something Linux-upstream-compatible, then push THAT into open source.
There's only one of me, and I don't want to spend all of my evenings trying to figure out which changes to the internal driver HAL need merging into "my" version of the HAL. I want to leverage all of the development and debugging that we do internally for the HAL. The ath9k team (both public and internally) need to do a lot of manual inspection and coding in order to pick up fixes and features from the internal driver. Since there's one of me, I'd rather optimise my time (read: get some sleep at some point.)

Then there's the third point that I didn't mention above:
  • I want to see how feasible it is to do snapshots from our internal codebase and push those out, rather than having to maintain a separate driver tree (sometimes based on the internal driver tree, sometimes re-implemented) and all the associated complication there.
This bit is pretty important. There's plenty of code I didn't want to open up. The bulk of the AR9300 HAL is already open sourced via the ath9k driver in Linux. So for the most part I'm open sourcing what we already have open sourced. However, I want to try and streamline the process for taking internally developed code and push it open.

This involves a few things.

Firstly, how much of the internal driver code is written with the idea that it's going to appear in the public eye? It depends what you think of as public - are your company developers "public" ? Are your customers with source code "public" ? It may not necessarily be "the general community." When you're writing code that's eventually going to be open sourced, you may need to make some decisions about how you structure your code.

For me, it was (mostly) easy. A very large amount of the "stuff that shouldn't be released" was already wrapped up in #ifdef's - stuff like emulation code, for example. So the public HAL snapshot is actually missing a lot of code that our internal version has. All I did (heh!) was pass it through 'unifdef'.

Next is whether the code is nice to look at. Is it formatted well? Is it well designed? Does it compile without warnings? Even on clang? These should be thought about whether or not your target audience is public or not. It's just good design. Companies may be worried about exposing the code, as if it will show badly on them. Well, yes, you should. But hey - we the open community would rather you release the code and take constructive criticism instead of keeping it closed. Who knows, it may actually help you!

The Linux upstream push is actually good here - the Linux system maintainers don't take "bad code". They hold the developers to a higher standard and this is forcing companies to think a bit more about how they develop things. Now, whether companies view this as a cost-centre or a benefit is not something I wish to discuss here. The point is that by working in the Linux upstream community, companies are being forced to tidy up their game a little.

Ok, enough of the back-story. How'd it actually all happen?
The short version - there was API drift, yes. There was a bunch of driver layer stuff that needed to happen. But it wasn't terribly painful. It required me to clean up the driver a bit and implement some nicer tools.

The long version:
  • There was an internal attempt to partly convert the HAL code internally over to a format that is Linux-upstream compatible. This involved a variety of formatting changes - function names and indentation changed. It also involved a variety of variable / method changes - eg halMciSupport became hal_mci_support. The boolean type changed - HAL_BOOL and AH_TRUE/AH_FALSE became bool, true & false. These needed to be renamed back to the HAL style before I could make it compile.
  • FreeBSD stripped out the HAL_CHANNEL stuff from its HAL, replacing it with a direct reference to the net80211 type (struct ieee80211_channel.) This made things slightly tidier but it did put an external dependency on the HAL. I may end up going through the FreeBSD HAL and undoing this at some point; but it's a big job.
  • A variety of APIs changed over time. Although the bulk of the APIs stayed the same, they grew parameters (eg 11n TX and RX antenna and chain configuration); the TX descriptor APIs now take a list of TX buffers rather than a single TX buffer, and other random other things.
So, what was I going to do?

My first cut was to just take a snapshot of the HAL and rename / shuffle things around enough to make it compile.

The first thing I did was to create a set of HAL stub functions. All the stub functions did was print out their method name and return. This way I wasn't surprised by a NULL pointer dereference when the HAL or driver called an unimplemented method - I'd get told which method was being called.

I started with the bare minimum code needed to support probe and attach - which required a surprising amount of code to be converted over. But it was mostly mechanical work. And it worked - enough to get things probing and attaching. I didn't bother with frame transmission and reception just yet - getting probe/attach was enough.

Then I realised that I wanted to this in a git branch, so I could import future versions of the HAL into master and then merge it into my branch. That's what I did. The HAL from 10.x was in master, and my FreeBSD port lived in 'local/freebsd'.

Next was figuring out whether to rename/fix API functions, or to use glue functions in order to deal with API differences. I've fixed some API differences (eg the reset path), but I ended up using a lot of wrapper functions to get the APIs to line up.

The important bits to bring up (in rough order) in order to see whether things are working:
  • Probe/attach/detach;
  • The reset path;
  • The initial calibration path (ADC calibration, IQ calibration, NF/AGC calibration);
  • The radio configuration path (ie, programming the analog section with the right frequencies, channel width, filter setup and such);
  • Interrupt handling;
  • ANI support;
  • RX path.
The RX path was the important bit. Once frame RX was working, I could do things like run the NIC in monitor mode and verify that HT20 and HT40 were working. And yes, that's pretty much what I did.

But at this point, the RX path exposed the first major API change - the whole FIFO setup that the AR9380 and later required. They don't support the list-based TX and RX that previous NICs supported. (Well, they _kind_ of do on the TX side, see below.)

The major change here required in the driver is that the RX descriptor is actually in the same memory area as the RX buffer. Ie, the first 'x' bytes of the passed in buffer is where the NIC DMAs the RX completion information to. Previous NICs have two areas for each RX frame - a RX descriptor area and an RX buffer area. Descriptors are in non-cachable memory, so I had to teach the FIFO RX path to support descriptors in cachable memory. I also had to teach the RX path to "skip" the 'x' bytes in order to hand the start of the data payload up to the net80211 stack. Finally, there's two RX FIFOs - one for high priority frames (beacons, uAPSD frames, PS-POLL frames, etc) and low-priority frames (everything else.) I had to teach the stack about this.

So, you can see the changes to the RX code - there's now a set of methods that implement RX - stop, start, flush, descriptor processing. The legacy routines stayed where they were. The new routines just overrode those methods.

And with that, RX came to life.

Next was TX. TX is a bit more special. There's only 8 TX FIFO entries per hardware queue (QCU 0..9); so I can't just push all the frames I want into the list. I also have four TX data buffer pointers per descriptor, rather than one per descriptor in the past. Finally, the TX status FIFO is completely separate from the TX FIFO itself - legacy chips would put the TX status at the end of the final descriptor in a frame.

This required some pretty significant refactoring of the TX path in order to expose the correct hooks to do this all properly. I won't go into the details here - suffice to say that I'm still working on it.

The next problem with TX was figuring out exactly what TX descriptor flags I was setting incorrectly. I eventually gave in and wrote some ALQ based logging which dumps the TX and RX descriptors into an ALQ log which I can then read from userland. This made it very, very easy to inspect what was going on - I was even finding bugs with the earlier chipset code!

Initially I used this to discover I wasn't correctly filling out all four buffer pointers in each TX descriptor. I can't leave any NULL if there's more descriptors for a given frame.

Then I used it to discover whether I was setting up the general flags right - TX chainmask, TX rate, duration, etc. I (re) discovered a hardware limitation with the AR9380 - I need to pad aggregate frames that use RTS with a little more pad delimiters or the transmission underruns. I was able to take these text dumps and give them to the Qualcomm Atheros MAC/PHY team for assistance and they were very impressed by the sophistication of my debugging tools.

Now I have the TX and RX side working. I pushed all of the driver side code into the public FreeBSD repository. I promised people that I would eventually open up the HAL side of things, but I figured that keeping the driver side of this closed was just plain silly. It also meant that if I did stop working on things (for whatever reason), the driver side was done - all that would need porting was the HAL.

Then I began the internal process to get the HAL opened up. I won't go into this in too much detail - suffice to say it took some engineering and legal review to get approval for this. The approval came in about two weeks ago and I pushed the repository into github shortly afterward.

Shortly after that, people started testing it and filing bugs. This part made me happy - there's a few small bugs that are actually in our 10.x mainline tree. I'll be pushing fixes back into the internal driver tree soon.

So, what's next?

I need to push the repository into a vendor branch in FreeBSD, then merge it into the kernel tree so it can be compiled by default.

I then need to get an updated version of the HAL approved by legal/engineering and push that update into the public git repository. Once that's done, I'll do a git merge into my branch and fix up whatever merge issues there are. This updated HAL includes some fixes for TX power and the AR95xx embedded SoC that we've just released. I hope to try and do a HAL update every month or two based on what bugs and features are introduced into the internal mainline driver.

I still have a bunch of driver work to finish up - notably I need to finish optimising the TX FIFO path in the driver and I need to implement MCI support. But the driver is now usable for me at least and I hope it'll become increasingly usable by others.

This has been a long and interesting trip.

Why PCI latency timers matter..

My latest "are you serious?" moment recently was trying to figure out the root cause of this performance issue with the AR5416 cardbus NIC on some of my test laptops.

Now, the AR5416 is Atheros' first 802.11n NIC, so it has some rough edges. But I was seeing some ridiculously bad transmission failures and I couldn't pinpoint them.

Not only that, I was seeing great performance (~ 130mbit TCP) on a specific laptop (Lenovo T41p) but the Lenovo T60 and T400 both performed extremely poorly.

To make matters weirder - the NIC performed great when speaking to another NIC in the same laptop. Just not to another physically separate device.

So, after much digging, here's what I discovered.

Firstly - I used my athalq packet descriptor logging and inspection tool (that's in FreeBSD-HEAD - no custom closed source code here!) to investigate the TX frames being sent to the hardware. What I found was troubling - large numbers of frames had TX data and TX delimiter underruns.

I then discovered that my code for counting TX data / delimiter underruns was totally incorrect - it's possible to see both a data/delimiter underrun error _with_ a valid transmitting frame. What was going on was cute - the hardware would start transmitting an aggregate frame but the DMA wouldn't keep up during said transmission and half way through the frame it would underrun. This only happened at higher MCS rates.

So making shorter aggregate frames fixed it, as well as increasing the delimiter count between frames. Both had the effect of reducing the likelihood of the NIC failing to transmit a longer aggregate. But they weren't solutions.

So I went digging. What I found was pretty simple in theory: the PCI latency timer on the NIC was being set to something appropriate (0xa8) but the PCI latency timer on the cardbus PCI bridge itself was not (0x20.) So any other bus activity would cause the NIC to not get the bus and it'd miss its DMA window.

Once I manually fixed the PCI bridge latency timer to be 0xa8, everything returned to normal.

However - there's only one thing on this PCI bridge - the cardbus interface itself. That's why it's so kooky. I would've thought that I'd have to up the value on the rest of the PCI bridges up to the root complex. There's no latency timer for PCIe, so it's not a problem there. So there's likely some very subtle timing involved that's just plain broken by default on how the BIOS initialises this cardbus slot and FreeBSD is not overriding it.

Now, if you see crappy performance on the PCI/cardbus 802.11n NICs in FreeBSD, you can check the output of 'athstats' to see if you do see TX underruns of any sort. If you are, the hardware isn't meeting the DMA deadlines it needs to DMA out frames and you need to do some further digging into your system to see why.

Be careful of adding debugging, as microseconds count..

.. after tinkering with the TDMA code a bit more, I discovered why I was seeing larger swings in the TDMA slot timings.

Two words: Debug Code.

Well, to be more specific - I added some debugging code that by default didn't do anything. But it was still there; it checked a debug flag and didn't log anything if it was disabled. But that would take time to execute. Since that debugging code sat _between_ the routines doing math with the RX timestamp and the nexttbtt register, it would calculate a slightly larger TSF offset.

Once I moved the debug code out from where it is and grouped all that register access and math together, the slot timing swings dropped by a few microseconds and everything went back to smooth.

Tsk. I should've known better.

At least now the TDMA code is working well on the 802.11n chips. Yes, it's still only 802.11abg rates, but it works. I've also found the PCU MISC_MODE bit to enforce packets don't transmit outside of the burst window and that is working quite fine with TDMA.

So, I think I can say "mission accomplished." I'll tidy up a few more things and make sure TX only occurs in one data queue (as mentioned in my previous post, they all burst independently at the moment..) and then patiently wait for someone to implement 802.11n adhoc negotiation so 802.11n MCS rates and aggregation magically begins to work. Once that's done, 802.11n TDMA will become a reality.

Getting TDMA working on 802.11n chipsets

A few years ago, a bunch of clever people figured out how to implement TDMA using the Atheros 802.11abg NICs. Sam Leffler has a great write-up here. He finished that particular paper with some comments about the (then) upcoming 802.11n chipsets from Atheros and how they would be better suited to the kinds of tricks he pulled with the Atheros MAC.

But, if you tried bringing up TDMA on the Atheros 802.11n chips, it plain just didn't work. Lots of people gnashed teeth about it. I was knee deep in TX aggregation work at the time so I just pushed TDMA to the back of my mind.

How it works is pretty cute in itself. To setup a TX "slot", the beacon timer is used to gate the TX queues to be able to start transmitting. Then a "channel ready time" burst length is configured, which is the period of time the TX queue can transmit. Once that timer expires, no new TX is allowed to begin. Sam then slides the slave TX window along based on when it sees a beacon from the master, as everything is synchronised against that.

Luckily, someone did some initial investigation and discovered that a couple of things were very very wrong.

Firstly, when fetching the next target beacon transmission time ("TBTT"), the AR5212 era NICs returned it in TU, but the AR5416 and later returned it in TSF.

Secondly, the TSF from each RX frame on the AR5212 is only 15 bits; on the AR5416 and later its 32 bits. The wrong logic was used when extending the RX frame timestamp from the AR5416 from 32 bits to 64 bits, and it was causing the TSF to jump all over the place.

So with that in place, he managed to stop the NICs from spewing stuck beacons everywhere (a classic "whoa, who setup the timers wrong!" symptom) and got two 11n NICs configured in a TDMA setup. But he reported the traffic was very unstable, so he had to stop.

Fast-forward about 12 months. I've finished the TX aggregation and BAR handling; I've debugged a bunch of AP power save handling and I'm about to reimplement some things to allow me to finish of AP power save handling (legacy/ps-poll and uapsd) in a sane, correct fashion. I decide, "hey, TDMA shouldn't be that hard to fix. Hopefully there are no chip bugs, right?" So, I plug in a pair of AR5413 (pre-11n NICs) and get it up and running. Easy. Then I plug in an AR5416 as the slave node, and .. it worked. Ok, so why was he reporting such bad results?

Firstly, Sam exposed a bunch of useful TDMA stats from "athstats". Specifically, if you start tinkering with TDMA, do this:

$ athstats -i ath0 -o tdma 1


   input   output  bexmit tdmau   tdmadj crcerr  phyerr  TOR rssi noise  rate
  619817   877907   25152 25152    -4/+6    142     143    1   74   -96   24M
     492      712      20    20    -0/+7      0       0    0   74   -96   24M
     496      720      20    20    -2/+6      0       0    0   74   -96   24M
     500      723      21    21    -6/+4      0       0    0   75   -96   24M

When I was debugging the initial AR5416 TDMA stuff, the tdma adjust figures bounced everywhere between 0 and 1000uS off. That was obviously not stable.

So, I looked at what debugging was in the driver itself. There was some (check if_ath_debug.h for the TDMA and TDMA timer flags), and after a bit of digging I realised that every time the TSF was just about to converge, it would be bumped out 1000uS. Then it would slowly drift back to converge, then it'd fall out 1000uS. This kept repeating. It made no sense; every time it calculated the delta between the expected and real TSF, it would "bump" the TSF by that much. That way the TSF would actually be correct. It shouldn't be out by almost as much the next RX'ed frame.

I did some initial testing to ensure the TSF was running at the expected 1uS interval (it was) and the master side was also running at the expected 1uS interval (it also was), so it wasn't out of sync clocks. The TSF bump must not be "right".

Enter the next bug - on the AR5416 and later, the TSF writes must be done as a 64 bit write. Ie, you write TSF_L32 first, then TSF_U32. At that point it gets internally updated and everything is consistent. If you don't do that, it doesn't latch.

Ok, so that fixed the intial drift. But after about 60 seconds, the TSF adjust parameters started varying ridiculously wildly. Ok, so 60 seconds equaled around 65,535 TU (where a TU is 1.024 milliseconds) so I began to wonder if I was seeing something wrap at that point.

Enter the next bug. The math involved in calculating the expected slot time was based on the 64 bit TSF and it was converted down to a 16 bit TU value from 0 .. 65535 TU. On the AR5212 era chips, the nexttbtt timer had a 16 bit resolution. When the nexttbtt value was read from that register, it was already 16 bits. So the "TSF delta" between the expected and real slot time was calculated between these two 16 bit values. However, on the AR5416 and later, the nexttbtt value was a 32 bit TSF (microsecond) value. Even when converted to a TU (1.024 millisecond) value, it would wrap at a value much greater than 65,535 TU. So the comparison would soon be between a value from 0..65,535 TU and 0 .. much-bigger-than-65,535 TU. The tsfdelta would become very, very negative.. and things would go nuts.

Ok, so that fixed another behavioural issue. Things were looking good. The slot time sync was stable. So I started passing traffic. Everything looked good.. for about 60 seconds. Then everything went slightly nuts again. But only with traffic. The timing calculations went way, way out.

Here's an example of the beacons coming in. Note that the expected beacon interval here is 49,152uS.

[34759308] [100933] BEACON: RX TSF=67127545 Beacon TSF=3722387514 (49152)
[34759357] [100933] BEACON: RX TSF=67176714 Beacon TSF=3722436670 (49156)
[34759442] [100933] BEACON: RX TSF=67262432 Beacon TSF=3722521354 (84684)
[34759454] [100933] BEACON: RX TSF=67275216 Beacon TSF=3722533850 (12496)
[34759504] [100933] BEACON: RX TSF=67325995 Beacon TSF=3722583802 (49952)
[34759552] [100933] BEACON: RX TSF=67374479 Beacon TSF=3722632108 (48306)
[34759602] [100933] BEACON: RX TSF=67424546 Beacon TSF=3722681282 (49174)
[34759652] [100933] BEACON: RX TSF=67475842 Beacon TSF=3722731578 (50296)
[34759701] [100933] BEACON: RX TSF=67525900 Beacon TSF=3722780730 (49152)

The master beacons were not coming in stable in any way. The main reason this would happen is if the air was busy at the master target beacon transmission time. So it would delay transmitting the beacon until the air was free.

This is where I decided it was about time I inserted some tracing into the TDMA code. I had introduced some ALQ based tracing in the ath(4) driver recently, specifically to trace TX and RX descriptors. I decided to add TDMA trace points. That way I could look at the TDMA recalculation along with the TX and RX from the driver.

What I found was very .. grr-y. After about 60 seconds (surprise), the TX would burst FAR past the 2.5 milliseconds it was supposed to. Why the heck was that happening?

After a bunch of staring-at-documentation and talking with some people well-versed in how the Atheros MAC worked, we realised the only real explanation is that the beacon timer was firing after the burst time, retriggering the timer. But why would it be? I stared at the debugging output a little more, and look at what I saw:

[34759258] [100933] BEACON: RX TSF=67077388 Beacon TSF=3722338362 (49152)
[34759258] [100933] SLOTCALC: NEXTTBTT=67081216 nextslot=67081224 tsfdelta=8 avg (5/8)
[34759258] [100933] TIMERSET: bt_intval=8388616 nexttbtt=65510 nextdba=524078 nextswba=524070 nextatim=65511 flags=0x0 tdmadbaprep=2 tdmaswbaprep=10
[34759259] [100933] TSFADJUST: TSF64 was 67077561, adj=1016, now 67078577

.. everything here is fine. We're programming nexttbtt in TU, not TSF (because the HAL API specifies it in TU for the older, pre-11n chips. Ok. Suspiciously close to the 65,535 TU boundary.

Then:

[34759308] [100933] BEACON: RX TSF=67127545 Beacon TSF=3722387514 (49152)
[34759308] [100933] SLOTCALC: NEXTTBTT=22528 nextslot=67131381 tsfdelta=-11 avg (5/7)
[34759308] [100933] TSFADJUST: TSF64 was 67127704, adj=11, now 67127715

Ok, but it's just a TSF adjust, no biggie. But, then this happened:

[34759357] [100933] BEACON: RX TSF=67176714 Beacon TSF=3722436670 (49156)
[34759357] [100933] SLOTCALC: NEXTTBTT=71680 nextslot=67180550 tsfdelta=6 avg (5/7)
[34759357] [100933] TIMERSET: bt_intval=8388616 nexttbtt=71 nextdba=566 nextswba=558 nextatim=72 flags=0x0 tdmadbaprep=2 tdmaswbaprep=10
[34759357] [100933] TSFADJUST: TSF64 was 67176888, adj=1018, now 67177906

At this point, it was clear. nexttbtt was very very small. Somehow it was very very small - 71 TU is very, very much before the current TSF of somewhere around 67,127,545. At this point the Next TBTT timer would just keep continously firing. And this would keep re-gating the TX queue, allowing it to just plain keep bursting. That explains why everything was going crazy during traffic.

This again was another example of the code assuming it was an AR5212 era NIC. The nexttbtt value was being trimmed to be between 0 and 65,535 TU. After I fixed that and fixed up the math a bit, nexttbtt was being correctly programmed and suddenly everything started working. And quite well.

So, now the basics are working. I'll audit the math to ensure everything wraps consistently at the 32-bit TSF boundary (ie, 4 billion microseconds, give or take) as that doesn't take too long to occur. But the 11n chips now behave the same as the 11a chips do when doing TDMA.

So what's next?
  • The "tx time" calculation needs to be aware of the 11n rate configuration, so it can calculate the guard time correctly. Right now it uses the non-11n aware rate -> duration HAL function;
  • The TX path has to be rejiggled a bit to ensure _all_ traffic gets stuffed into one TX queue (well, besides beacons.) Management and higher priority traffic has to do this too. If not, then multiple TX queues can burst and they'll burst separately, blowing out the TX slot timing;
  • Someone needs to get 11n adhoc working, so that 11n rates are negotiated during adhoc peer establishment. Then aggregation can just magically work at that point (the TDMA code reuses a lot of adhoc mode vap behaviour code);
  • 802.11e / 802.11n delayed block-ACK support needs to be implemented;
  • Then when doing TDMA, we can just burst out an aggregate or two inside the given slot time, then wait for a delayed block ACK to come back from the remote peer in the next slot time! Yes, I'd like to try and reuse the standard stuff for doing delayed block-ack rather than implementing something specific for 802.11n aggregation + TDMA.
  • .. and yes, it'd be nice for this to support >2 slave terminals, but that's a bigger project.
Right now I think I'll tackle #1 and then make sure the 11n NICs can be configured in a static MCS rate, without aggregation. The rest will have to be up to someone else in the community. My plate is full.

So, TDMA on the 802.11n NICs is now working. Go forth and hack!

Making the AR5210 NIC work in the office..

I'm quite happy that FreeBSD's ath(4) driver supports almost all of the PCI and PCIe devices that Atheros has made. Once I find a way to open source this AR9380 HAL I've constructed, we'll actually support them all. However, there are a few little niggling things that have been bugging me. Today I addressed one of those.

The AR5210. It's their first 11a-only NIC. It does up to 54MBit OFDM 802.11a; it doesn't do QoS/WME (as it only has one data queue); it "may" go up to 72MBit if I hack on some magic extensions. And in open mode, it works great.

But it didn't work in the office or at home. All of which are 802.11n APs with WPA2 authentication and AES-CCMP encryption.

Now, the AR5210 only does open and WEP encryption. It doesn't do TKIP or AES-CCMP. So the encryption has to happen in software. The NIC was associating fine, but when wpa_supplicant went to program in the AES-CCMP encryption keys, the HAL simply refused.

What I discovered was this.

The driver keycache was also trying to allocate keycache slots for the AR5210, where it only supports the 4 WEP keys.  This is a big no-no. So once I mapped them to all be slot 0, I made a little progress.

The net80211 layer was trying to program in an AES-CCMP key, which the driver was dutifully passing to the HAL. The AR5210 HAL doesn't support anything but WEP or open, so the encryption key type was "clear". Now, "clear" means "for this MAC address, don't try decrypting anything." But the AR5210 HAL code rejected it - as I said, it doesn't do that.

Ok, so I ignored that entirely. I mapped all of the software encrypted key entries to slot 0 and just didn't program the hardware. So now the HAL didn't reject things. But it wasn't working. The received frames were being corrupted somehow and failed the CCMP MIC integrity check. I took at look at the frames being received (which should've been "clear" versus what was going on in the air - luckily, this laptop has an AR9280 inside so I could put it into monitor mode and sniff things. The packets just didn't add up. I was confused.

Then after discussing this with my flatmate, I idly wondered if the hardware was decrypting the traffic anyway. And, well, it was. Encrypted frames have the WEP bit set in the 802.11 header - whether they're WEP, TKIP, AES-CCMP. The AR5210 didn't know it wasn't WEP, so it tried decoding the frames itself. And corrupting them.

So after finding a PCU control register (hi AR_DIAG_SW) that lets me disable encryption/decryption, I was able to pass through the encrypted traffic fine and everything just plain worked. It's odd seeing an 11a, non-QoS station on my 11n AP, but that just goes to show that backwards interoperability is still useful.

And yes, I did take the AR5210 into the office and I did sit in a meeting with it and use it to work from. It let me onto the corporate wireless just fine, thankyou.

So now the FreeBSD AR5210 support doesn't do any hardware encryption. You can turn it on again if you'd like. Why? Because I don't want the headache of someone coming to me and asking why a dual-VAP AP with WEP and CCMP is failing. The hardware can only do _either_ WEP/open with hardware encryption, _or_ it can do everything without hardware encryption. So I decided to just disable it for now.

There's also a problem with how encryption is specified to net80211. It's done at startup time, when the driver attaches. Anything that isn't specified as being done in hardware is done in software. There is currently no clean way to dynamically change that configuration. So, if I have WEP encryption in hardware but CCMP/TKIP in software, I have to dynamically flip on/off the hardware encryption _AND_ I have to enforce that WEP and CCMP doesn't get configured at the same time.

The cleaner solution would be to:
  • Create a new driver attribute, which indicates the hardware can do WEP and CCMP at the same time - make sure it's off for the AR5210;
  • Add a HAL call to enable/disable hardware encryption;
  • If a user wants to do WEP or open - enable hardware encryption;
  • If a user wants to do CCMP/TKIP/etc - disable hardware encryption;
  • Complain if the user wants to create a VAP with CCMP/TKIP and WEP.
 If someone wants a mini-project - and they have an AR5210 - I'm all for it. But at the moment, this'll just have to do.

Power save, CABQ, multicast frames, EAPOL frames and sequence numbers (or why does my Macbook Pro keep disassociating?)

I do lots of traffic tests when I commit changes to the FreeBSD Atheros HAL or driver. And I hadn't noticed any problems until very recently (when I was doing filtered frames work.) I noticed that my macbook pro would keep disassociating after a while. I had no idea why - it would happen with or without any iperf traffic. Very, very odd.

So I went digging into it a bit further (and it took quite a few evenings to narrow down the cause.) Here's the story.

Firstly - hostapd kept kicking off my station. Ok, so I had to figure out why. It turns out that the group rekey would occasionally fail. When it's time to do a group rekey, hostapd will send a unicast EAPOL frame to each associated station with the new key and each station must send back another EAPOL frame, acknowledging the fact. This wasn't happening so hostapd would just disconnect my laptop.

Ok, so then I went digging to see why. After adding lots of debugging code I found that the EAPOL frames were actually making to my Macbook Pro _AND_ it was ACKing them at the 802.11 layer. Ok, so the frame made it out there. But why the hell was it being dropped?

Now that I knew it was making it to the end node, I could eliminate a bunch of possibilities. What was left:


  • Sequence number is out of order;
  • CCMP IV replay counter is out of order;
  • Invalid/garbled EAPOL frame contents.
I quickly ruled out the EAPOL frame contents. The sequence number and CCMP IV were allocated correctly and in order (and never out of sequence from each other.) Good. So what was going on?

Then I realised - ok, all the traffic is in TID 16 (the non-QoS TID.) That means it isn't a QoS frame but it still has a sequence number; so it is allocated one from TID 16. There's only one CCMP IV number for a transmitter (the receiver tracks a per-TID CCMP IV replay counter, but the transmitter only has one global counter.) So that immediately rings alarm bells - what if the CCMP IV sequence number isn't being allocated in a correctly locked fashion?

Ok. So I should really fix that bug. Actually, let me go and file a bug right now. There.

There. Bug filed. PR 172338.

Now, why didn't this occur back in Perth? Why is it occuring here? Why doesn't it occur under high throughput iperf (150Mbps+) but it is when the iperf tests are capped at 100Mbps ethernet speeds? Why doesn't it drop my FreeBSD STAs?

Right. So what else is in TID 16? Guess what I found ? All the multicast and broadcast traffic like ARPs are in TID 16.

Then I discovered what was really going on. The pieces fell into place.

  • My mac does go in and out of powersave - especially when it does a background scan.
  • When the mac is doing 150Mbps+ of test traffic, it doesn't do background scans.
  • When it's doing 100Mbps of traffic, the stack sneaks in a background scan here and there.
  • Whenever it goes into background scan, it sends a "power save" to the AP..
  • .. and the AP puts all multicast traffic into the CABQ instead of sending it to the destination hardware queue.
  • Now, when this occured, the EAPOL frames would go into the software queue for TID 16 and the ARP/multicast/etc traffic would go into the CABQ
  • .. but the CABQ has higher priority, so it'll be transmitted just after the beacon frame goes out, before the EAPOL frames in the software queue.
Now, given the above set of conditions, the ARP/multicast traffic (which there's more of in my new place, thanks to a DSL modem that constantly scans the local DHCP range for rogue/disconnected devices) would be assigned sequence numbers AFTER the EAPOL frames that went out but are sitting in the TID 16 software queue. The Mac would receive those CABQ frames with later sequence numbers, THEN my EAPOL frame. Which would be rejected for being out of sequence.

The solution? Complicated.

The temporarily solution? TID 16 traffic is now in a higher priority hardware queue, so it goes out first. Yes, I should mark EAPOL frames that way. I'll go through and tidy this up soon. I just needed to fix this problem before others started reporting the instability.

The real solution is complicated. It's complicated because in power save mode, there's both unicast and multicast traffic going into the same TID(s) but different hardware queues. Given this, it's quite possible that the traffic in the CABQ will burst out before the unicast packets with the same TID make it out via another hardware queue.

I'm still thinking of the best way to fix this.

Lessons learnt from fiddling with the rate control code..

(Note before I begin: a lot of these ideas about rate control are stuff I came up with before I began working at my current employer.)

Once I had implemented filtered frames and did a little digging, I found that the rate control code was doing some relatively silly things. Lots of rates were failing quite quickly and the rate control was bouncing all over the place.

The first bug I found was that I was checking the TX descriptor completion before I had copied it over - and so I was randomly failing TX when it didn't fail. Oops.

Next, don't call the rate control code on filtered frames. They've been filtered, not transmitted. My code wasn't doing that - I'm just pointing it out to anyone else who is implementing this.

Then I looked at what was going on with rate control. I noticed that whenever the higher transmission rates failed, it took a long time for the rate control code to try and sample them again. I went and did some digging - and found it was due to a coding decision I had made about 18 months ago. I treated higher rate failures with a low EWMA success rate as successive failures. The ath_rate_sample code treats "successive failures" as "don't try to probe this for ten seconds." Now, there's a few things you need to know about 802.11n:


  • The higher rates fail, often;
  • The channel state changes, often;
  • Don't be afraid to occasionally try those higher rates; it may actually work out better for you even under higher error rates.
So given that, I modified the rate control code a bit:

  • Only randomly sample a few rates lower than the current one; don't try sampling all 6, 14 or 22 rates below the high MCS rates;
  • Don't treat low EWMA as "successive failures"; just let the rate control code oscillate a bit;
  • Drop the EWMA decay time a bit to let the oscillation swing a little more.
Now the rate control code behaves much better and recovers much quicker during unstable channel conditions (eg - adrian walking around a house whilst doing iperf tests.)

Given this, what could I do better? I decided to start reading up on what the current state of play with 802.11n aware rate control and rapidly came to the conclusion that - wow, we likely could do it better. The Linux minstrel_ht algorithm is also based on John Bickett's sample rate code, but instead of using a EWMA and minimising packet transmission time, it uses the EWMA to calculate a theoretical throughput and maximises that. So, that sounds good, right?

Except that the research shows that 802.11n channels can vary very frequently and very often, especially at the higher MCS rates. The higher MCS rates can become better and worse within a window of a second or two. So, do you want to try and squeeze the last of throughput out of that, or not?

Secondly, using "throughput" as a metric is fine if your air time is .. well, cheap. But what if you have many, many clients on an AP? Your choice of maximising throughput based on what the error rate predicts your data throughput is doesn't take airtime into account. In fact, if you choose a higher MCS rate with a higher error rate but higher throughput, you may actually be wasting more air with those retransmissions. Great for a single station, but perhaps not so great when you have lots.

So what's the solution? The open source rate control stuff doesn't take the idea of "air utilisation" into account. There's enough data available to create an air time model, but no-one is using it yet. Patches are gratefully accepted. :-)

Finally, the current packet scheduler is pretty simple and stupid (and does break in a lot of scenarios, sigh.) It's just a FIFO, servicing nodes/TIDs with traffic in said FIFO mechanism. But that's not very fair - both from a "who is next" standpoint and "what's the most efficient use of the air" view. In addition, the decision about which node/TID to schedule next is done totally separate to the rate control decision. Rate control occurs rather late in the packet transmission process (ie, once we've committed to queuing it to the hardware.) Wouldn't it be better to have the packet scheduler and rate control code joined at the hip, where the scheduler doesn't aggressively schedule traffic to a slow/lossy end node?

Lots of things to think about..

Filtered frames support, or how not to spam the air with useless transmission attempts

I haven't updated this blog in a while - but that doesn't mean I haven't been a busy little beaver in the world of FreeBSD Atheros support.

A small corner part of correct and efficient software retransmission handling is what Atheros hardware calls "filtered frames", or what actually happens - "how to make sure the hardware doesn't just spam the air with transmission attempts to a dead remote node."

So here's the run-down.

You feed the Atheros hardware a set of frames to transmit. There's a linked list (or FIFO for the more recent hardware) of TX descriptors which represent the fragments of each frame you're transmitting. The hardware will attempt each one in turn and then return a TX completion status explaining what happened. For frames without ACKs the TX status is "did I get a chance to squeeze this out into the air" - there's no ACK, so there's no way to know if it made it out there. But for frames with ACKs, there's a response from the remote end which the transmitter uses, and it'll attempt retransmission in hardware before either succeeding or giving up. Either way, it tells you what happened.

The reality is a little more complicated - there's multiple TX queues with varying TX priorities (implementing the 802.11 "WME" QoS mechanism. There's all kinds of stuff going on behind the scenes to figure out which queue wins arbitration, gets access to the air to transmit, etc, etc. For this particular discussion it doesn't matter.

Now, say you then queue 10 frames to a remote node. The hardware will walk the TX queue (or queues) and transmit those frames one at a time. Some may fail, some may not. If you don't care about software retransmission then you're fine.

But say you want to handle software retransmission. Say that retransmission is for legacy, non-aggregation / non-blockack sessions. You transmit one frame at a time. Each traffic identifier (TID) has its own sequence number space, as well as the "Non-QoS" traffic identifier (ie, non-QoS traffic that does have a sequence number.) By design, each frame is transmitted in order, with incrementing sequence numbers. The receiver throws away frames that are out of sequence. That way packets are delivered in order. They may not be reliably received, but the sequence number ordering is enforced.

So, you now want to handle software retransmission of some frames. You get some frames that are transmitted and some frames that weren't. Can you retransmit the failed ones? Well, the answer is "sure", but there are implications. Specifically, those frames may now be out of sequence, so when you retransmit them the receiver will simply drop them. You could choose to reassign them new sequence numbers so the receiver doesn't reject them out of hand, but now the receiver is seeing out-of-order frames. It doesn't see out of sequence frames, but the underlying payloads are all jumbled. This makes various protocols (like TCP) very unhappy. If you're running some older PPTP VPN sessions, the end point may simply drop your now out-of-order frames. So it's very important that you actually maintain the order of frames to a station.

Given that, how can you actually handle software retransmission? The key lies in this idea of "filtered frames." The Atheros MAC has what it calls a "keycache", where it stuffs encryption keys for each destination node (and multicast keys, and WEP keys..) So upon reception of a frame, it'll look in the keycache for that particular station MAC, then attempt to decrypt the data with that key. Transmission is similar - there's keycache entries for each destination station and the TX descriptor has a TX Key ID.

Now, I don't know if the filtered frame bit is stored in the keycache entry or elsewhere (I should check, honest) but I'm pretty sure it is.

So the MAC has a bit for each station in the keycache (I think!) and when a TX fails, it sets that bit. That bit controls whether any further TX attempts to that station will actually occur. So when the first frame TX to that station fails, any subsequent attempts are automatically failed by the MAC, with the TX status "TX destination filtered." Thus, anything already in the hardware TX queue(s) for that destination get automatically filtered for you. The software can then grab those frames in the order you tried them (which is in sequence number order, right?) and since _all_ frames are filtered, you don't have to worry about some intermediary frames being transmitted. You just queue them in a software queue, wait until the node is alive again, and then set the "clear destination mask (CLRDMASK)" bit in the next TX attempt.

This gives you three main benefits:
  • All the frames are filtered at the point the first fails, so you get all the subsequent attempted frames back, in the order you queued them. This makes life much easier when retransmitting them;
  • The MAC now doesn't waste lots of time trying to transmit to destinations that aren't available. So if you have a bunch of UDP traffic being pushed to a dead or far away node, the airtime won't be spent trying to transmit all those frames. The first failure will filter the rest, freeing up the TX queue (and air) to transmit frames to other destinations;
  • When stations go into power save mode, you may have frames already in the hardware queue for said station. You can't cancel them (easily/cleanly/fast), so instead they'll just fail to transmit (as the receiver is asleep.) Now you just get them filtered; you store them away until the station wakes up and then you retransmit them. A little extra latency (which is ok for some things and not others!) but much, much lower packet loss.
This is all nice and easy. However, there are a few gotchas here.

Firstly, it filters all frames to that destination. For all TIDs, QoS or otherwise. That's not a huge deal; if however you're me and you have per-TID queues you need to requeue the frames into the correct queues to retry. No biggie.

Secondly, if a station is just far away or under interference, you'll end up filtering a lot of traffic to it. So a lot of frames will cycle through the filtered frames handling code. Right now in FreeBSD I'm treating them the same as normal software retransmissions and dropping them after 10 attempts. I have a feeling I need to fix that logic a bit as under heavy dropping conditions, the traffic is being prematurely filtered and prematurely dropped (especially when the node is going off-channel to do things like background scans.) So the retransmission and frame expiry is tricky. You can't just keep trying them forever as you'll just end up wasting TX air time and CPU time transmitting lots of frames that will just end up being immediately filtered. Yes, tricky.

Thirdly, there may be traffic that needs to go out to that node that can't be filtered. If that's the case, you may actually end up with some subsequent frames being transmitted. You then can't just naively requeue all of your failed/filtered frames or you'll transmit some frames out of sequence. You then _have_ to drop anything with a sequence number or that was queued _before_ the successfully transmitted frame.

Luckily for 802.11n aggregation sessions the third point isn't a big deal - you already can transmit out of sequence (just within the block-ack window or BAW), so you can just retransmit filtered frames with sequence numbers in any particular sequence. Life is good there.

So what have I done in FreeBSD?

I have filtered frames support working for 802.11n aggregation sessions. I haven't yet implemented software retransmission (and all of the hairy logic to handle point #3 above) for non-aggregate traffic, so I can't do filtered frames there. But I'm going to have to do it in order to nicely support AP power save operation.

Reading rate control information from userland..

I've begun working on exporting the ath(4) driver rate control information. The "current" way to extract this information has been to issue a sysctl to the rate control module, which dumps debugging state to the kernel console. This is quite suboptimal. Instead, I'm now packaging that up via a driver ioctl() call. It's temporary (hah, famous last words in the software industry :) and it's driver specific for now. The thirty second summary:

  • There's a new ioctl to ath(4) to query the rate control module for a single associated MAC address (or the BSS MAC when running as a STA);
  • Since the rate control is currently done at the driver level rather than at the VAP level, the call is to the driver rather than via the VAP (wlanX) interface;
  • There's no easy way to get "all" station details whilst maintaining correct locking.

The last point deserves a little more explanation. I've introduced (well, _using_ now) a per-node lock when doing rate control updates. I acquire this lock when copying the rate control data out, so the snapshot is consistent.

So to fetch the state for a node, the following occurs:


  • Call the net80211 layer to find an ieee80211_node for the given mac address - that involves locking the node table and getting a reference for the node (if found); 
  • Then locking the ath_node associated with it;
  • Copy the data out;
  • Unlock the ath_node;
  • Decrement the ieee80211_node reference counter (which requires the node table lock.)


Now, the node table lock only occurs whilst fetching the node reference. It isn't held whilst doing the actual rate control manipulation. Compare to what I'd do if I wanted to walk the node table. The net80211 API for doing this holds the node lock whilst waking the node list. This means that I'll end up holding the node table lock whilst acquiring the ath_node lock. Now, that's fine - however, if I then decide somewhere else to try and do any ieee80211 operation whilst holding the ath_node lock, I may find myself with a lock ordering problem.

So for now the API will just support doing a single lookup for a given MAC, rather than trying to pull all of the rate control table entries down at once.

Here's an example output from the command:


adrian@marilyn:~/work/freebsd/ath/head/src/tools/tools/ath/athratestats]> ./athratestats -i ath1 -m 06:16:16:03:40:d0
static_rix (-1) ratemask 0xf
[ 250] cur rate 5  Mb since switch: packets 1 ticks 43028655
[ 250] last sample (11  Mb) cur sample (0 ) packets sent 10708
[ 250] packets since sample 16 sample tt 6275
[1600] cur rate 11  Mb since switch: packets 15 ticks 43025720
[1600] last sample (5  Mb) cur sample (0 ) packets sent 2423
[1600] packets since sample 7 sample tt 12713
[ 2  Mb: 250]        9:9        (100%) (EWMA 100.0%) T       11 F    0 avg  2803 last 42176930
[ 5  Mb: 250]     3139:3139     (100%) (EWMA 100.0%) T     3273 F    0 avg  1433 last 43028656
[ 5  Mb:1600]       29:29       (100%) (EWMA 100.0%) T       39 F    0 avg  5303 last 42192044
[11  Mb: 250]     7560:7560     (100%) (EWMA 100.0%) T     7838 F    0 avg  1857 last 43026094
[11  Mb:1600]     2394:2394     (100%) (EWMA 100.0%) T     2581 F    0 avg  2919 last 43026411