- A station can tell an access point it's going to sleep by setting the power mgmt bit to 1 in a TXed frame;
- The AP will then update the TIM entry in the beacon frames it sends out to reflect whether that station has any traffic queued;
- A station can signal an AP that it's awake by sending a data frame with the power mgmt bit set to 0;
- .. or it can request a frame at a time by using PS-POLL;
- There's also the uAPSD stuff which I haven't yet implemented and won't likely do so for a while.
Category Archives: 802.11n
Hey, look, it’s lots of atheros NICs in one laptop
ath0: AR9280 mac 128.2 RF5133 phy 13.0
[100309] ath1:
ath1: AR9380 mac 448.3 RF5110 phy 0.0
[100309] ath2:
ath2: AR5212 mac 5.9 RF5112 phy 4.3
[ 250] cur rate 20 MCS since switch: packets 1 ticks 2647581
[ 250] last sample (6 Mb) cur sample (0 ) packets sent 9
[ 250] packets since sample 9 sample tt 0
[1600] cur rate 22 MCS since switch: packets 15 ticks 2647530
[1600] last sample (21 MCS) cur sample (0 ) packets sent 6049
[1600] packets since sample 0 sample tt 532
TX Rate TXTOTAL:TXOK EWMA T/ F avg last xmit
[ 6 Mb: 250] 4:4 (100.0%) 4/ 0 760uS 2640242
[20 MCS: 250] 9:9 (100.0%) 9/ 0 440uS 2647581
[20 MCS:1600] 969:969 (100.0%) 57/ 0 572uS 2647445
[21 MCS:1600] 1517:1517 (100.0%) 74/ 0 613uS 2647557
[22 MCS:1600] 1990:1990 (100.0%) 92/ 0 529uS 2647557
[23 MCS:1600] 73986:73462 ( 99.5%) 5661/ 0 755uS 2647538
Hey, look, it’s lots of atheros NICs in one laptop
ath0: AR9280 mac 128.2 RF5133 phy 13.0
[100309] ath1:
ath1: AR9380 mac 448.3 RF5110 phy 0.0
[100309] ath2:
ath2: AR5212 mac 5.9 RF5112 phy 4.3
[ 250] cur rate 20 MCS since switch: packets 1 ticks 2647581
[ 250] last sample (6 Mb) cur sample (0 ) packets sent 9
[ 250] packets since sample 9 sample tt 0
[1600] cur rate 22 MCS since switch: packets 15 ticks 2647530
[1600] last sample (21 MCS) cur sample (0 ) packets sent 6049
[1600] packets since sample 0 sample tt 532
TX Rate TXTOTAL:TXOK EWMA T/ F avg last xmit
[ 6 Mb: 250] 4:4 (100.0%) 4/ 0 760uS 2640242
[20 MCS: 250] 9:9 (100.0%) 9/ 0 440uS 2647581
[20 MCS:1600] 969:969 (100.0%) 57/ 0 572uS 2647445
[21 MCS:1600] 1517:1517 (100.0%) 74/ 0 613uS 2647557
[22 MCS:1600] 1990:1990 (100.0%) 92/ 0 529uS 2647557
[23 MCS:1600] 73986:73462 ( 99.5%) 5661/ 0 755uS 2647538
AR9380 support on FreeBSD; why it’s taken so long..
It's not yet in the -HEAD tree - I'll get to that.
Let me take you on a bit of a journey.
I started a little side project late last year - I wanted to see if I could make the AR9380 HAL from the Qualcomm Atheros mainline driver (10.x branch) work on FreeBSD. I was hoping that the HAL API hadn't drifted all that much over the years.
Why do this? Two reasons:
- I wanted to see if I could open source the HAL and have it work with FreeBSD; and
- I didn't want to take on a similar project to what ath9k had to do - which is to take the existing HAL, convert it into something Linux-upstream-compatible, then push THAT into open source.
Then there's the third point that I didn't mention above:
- I want to see how feasible it is to do snapshots from our internal codebase and push those out, rather than having to maintain a separate driver tree (sometimes based on the internal driver tree, sometimes re-implemented) and all the associated complication there.
The long version:
- There was an internal attempt to partly convert the HAL code internally over to a format that is Linux-upstream compatible. This involved a variety of formatting changes - function names and indentation changed. It also involved a variety of variable / method changes - eg halMciSupport became hal_mci_support. The boolean type changed - HAL_BOOL and AH_TRUE/AH_FALSE became bool, true & false. These needed to be renamed back to the HAL style before I could make it compile.
- FreeBSD stripped out the HAL_CHANNEL stuff from its HAL, replacing it with a direct reference to the net80211 type (struct ieee80211_channel.) This made things slightly tidier but it did put an external dependency on the HAL. I may end up going through the FreeBSD HAL and undoing this at some point; but it's a big job.
- A variety of APIs changed over time. Although the bulk of the APIs stayed the same, they grew parameters (eg 11n TX and RX antenna and chain configuration); the TX descriptor APIs now take a list of TX buffers rather than a single TX buffer, and other random other things.
Then I realised that I wanted to this in a git branch, so I could import future versions of the HAL into master and then merge it into my branch. That's what I did. The HAL from 10.x was in master, and my FreeBSD port lived in 'local/freebsd'.
- Probe/attach/detach;
- The reset path;
- The initial calibration path (ADC calibration, IQ calibration, NF/AGC calibration);
- The radio configuration path (ie, programming the analog section with the right frequencies, channel width, filter setup and such);
- Interrupt handling;
- ANI support;
- RX path.
The major change here required in the driver is that the RX descriptor is actually in the same memory area as the RX buffer. Ie, the first 'x' bytes of the passed in buffer is where the NIC DMAs the RX completion information to. Previous NICs have two areas for each RX frame - a RX descriptor area and an RX buffer area. Descriptors are in non-cachable memory, so I had to teach the FIFO RX path to support descriptors in cachable memory. I also had to teach the RX path to "skip" the 'x' bytes in order to hand the start of the data payload up to the net80211 stack. Finally, there's two RX FIFOs - one for high priority frames (beacons, uAPSD frames, PS-POLL frames, etc) and low-priority frames (everything else.) I had to teach the stack about this.
Why PCI latency timers matter..
Now, the AR5416 is Atheros' first 802.11n NIC, so it has some rough edges. But I was seeing some ridiculously bad transmission failures and I couldn't pinpoint them.
Not only that, I was seeing great performance (~ 130mbit TCP) on a specific laptop (Lenovo T41p) but the Lenovo T60 and T400 both performed extremely poorly.
To make matters weirder - the NIC performed great when speaking to another NIC in the same laptop. Just not to another physically separate device.
So, after much digging, here's what I discovered.
Firstly - I used my athalq packet descriptor logging and inspection tool (that's in FreeBSD-HEAD - no custom closed source code here!) to investigate the TX frames being sent to the hardware. What I found was troubling - large numbers of frames had TX data and TX delimiter underruns.
I then discovered that my code for counting TX data / delimiter underruns was totally incorrect - it's possible to see both a data/delimiter underrun error _with_ a valid transmitting frame. What was going on was cute - the hardware would start transmitting an aggregate frame but the DMA wouldn't keep up during said transmission and half way through the frame it would underrun. This only happened at higher MCS rates.
So making shorter aggregate frames fixed it, as well as increasing the delimiter count between frames. Both had the effect of reducing the likelihood of the NIC failing to transmit a longer aggregate. But they weren't solutions.
So I went digging. What I found was pretty simple in theory: the PCI latency timer on the NIC was being set to something appropriate (0xa8) but the PCI latency timer on the cardbus PCI bridge itself was not (0x20.) So any other bus activity would cause the NIC to not get the bus and it'd miss its DMA window.
Once I manually fixed the PCI bridge latency timer to be 0xa8, everything returned to normal.
However - there's only one thing on this PCI bridge - the cardbus interface itself. That's why it's so kooky. I would've thought that I'd have to up the value on the rest of the PCI bridges up to the root complex. There's no latency timer for PCIe, so it's not a problem there. So there's likely some very subtle timing involved that's just plain broken by default on how the BIOS initialises this cardbus slot and FreeBSD is not overriding it.
Now, if you see crappy performance on the PCI/cardbus 802.11n NICs in FreeBSD, you can check the output of 'athstats' to see if you do see TX underruns of any sort. If you are, the hardware isn't meeting the DMA deadlines it needs to DMA out frames and you need to do some further digging into your system to see why.
Be careful of adding debugging, as microseconds count..
Two words: Debug Code.
Well, to be more specific - I added some debugging code that by default didn't do anything. But it was still there; it checked a debug flag and didn't log anything if it was disabled. But that would take time to execute. Since that debugging code sat _between_ the routines doing math with the RX timestamp and the nexttbtt register, it would calculate a slightly larger TSF offset.
Once I moved the debug code out from where it is and grouped all that register access and math together, the slot timing swings dropped by a few microseconds and everything went back to smooth.
Tsk. I should've known better.
At least now the TDMA code is working well on the 802.11n chips. Yes, it's still only 802.11abg rates, but it works. I've also found the PCU MISC_MODE bit to enforce packets don't transmit outside of the burst window and that is working quite fine with TDMA.
So, I think I can say "mission accomplished." I'll tidy up a few more things and make sure TX only occurs in one data queue (as mentioned in my previous post, they all burst independently at the moment..) and then patiently wait for someone to implement 802.11n adhoc negotiation so 802.11n MCS rates and aggregation magically begins to work. Once that's done, 802.11n TDMA will become a reality.
Getting TDMA working on 802.11n chipsets
But, if you tried bringing up TDMA on the Atheros 802.11n chips, it plain just didn't work. Lots of people gnashed teeth about it. I was knee deep in TX aggregation work at the time so I just pushed TDMA to the back of my mind.
How it works is pretty cute in itself. To setup a TX "slot", the beacon timer is used to gate the TX queues to be able to start transmitting. Then a "channel ready time" burst length is configured, which is the period of time the TX queue can transmit. Once that timer expires, no new TX is allowed to begin. Sam then slides the slave TX window along based on when it sees a beacon from the master, as everything is synchronised against that.
Luckily, someone did some initial investigation and discovered that a couple of things were very very wrong.
Firstly, when fetching the next target beacon transmission time ("TBTT"), the AR5212 era NICs returned it in TU, but the AR5416 and later returned it in TSF.
Secondly, the TSF from each RX frame on the AR5212 is only 15 bits; on the AR5416 and later its 32 bits. The wrong logic was used when extending the RX frame timestamp from the AR5416 from 32 bits to 64 bits, and it was causing the TSF to jump all over the place.
So with that in place, he managed to stop the NICs from spewing stuck beacons everywhere (a classic "whoa, who setup the timers wrong!" symptom) and got two 11n NICs configured in a TDMA setup. But he reported the traffic was very unstable, so he had to stop.
Fast-forward about 12 months. I've finished the TX aggregation and BAR handling; I've debugged a bunch of AP power save handling and I'm about to reimplement some things to allow me to finish of AP power save handling (legacy/ps-poll and uapsd) in a sane, correct fashion. I decide, "hey, TDMA shouldn't be that hard to fix. Hopefully there are no chip bugs, right?" So, I plug in a pair of AR5413 (pre-11n NICs) and get it up and running. Easy. Then I plug in an AR5416 as the slave node, and .. it worked. Ok, so why was he reporting such bad results?
Firstly, Sam exposed a bunch of useful TDMA stats from "athstats". Specifically, if you start tinkering with TDMA, do this:
$ athstats -i ath0 -o tdma 1
input output bexmit tdmau tdmadj crcerr phyerr TOR rssi noise rate
619817 877907 25152 25152 -4/+6 142 143 1 74 -96 24M
492 712 20 20 -0/+7 0 0 0 74 -96 24M
496 720 20 20 -2/+6 0 0 0 74 -96 24M
500 723 21 21 -6/+4 0 0 0 75 -96 24M
Then:
- The "tx time" calculation needs to be aware of the 11n rate configuration, so it can calculate the guard time correctly. Right now it uses the non-11n aware rate -> duration HAL function;
- The TX path has to be rejiggled a bit to ensure _all_ traffic gets stuffed into one TX queue (well, besides beacons.) Management and higher priority traffic has to do this too. If not, then multiple TX queues can burst and they'll burst separately, blowing out the TX slot timing;
- Someone needs to get 11n adhoc working, so that 11n rates are negotiated during adhoc peer establishment. Then aggregation can just magically work at that point (the TDMA code reuses a lot of adhoc mode vap behaviour code);
- 802.11e / 802.11n delayed block-ACK support needs to be implemented;
- Then when doing TDMA, we can just burst out an aggregate or two inside the given slot time, then wait for a delayed block ACK to come back from the remote peer in the next slot time! Yes, I'd like to try and reuse the standard stuff for doing delayed block-ack rather than implementing something specific for 802.11n aggregation + TDMA.
- .. and yes, it'd be nice for this to support >2 slave terminals, but that's a bigger project.
Power save, CABQ, multicast frames, EAPOL frames and sequence numbers (or why does my Macbook Pro keep disassociating?)
So I went digging into it a bit further (and it took quite a few evenings to narrow down the cause.) Here's the story.
Firstly - hostapd kept kicking off my station. Ok, so I had to figure out why. It turns out that the group rekey would occasionally fail. When it's time to do a group rekey, hostapd will send a unicast EAPOL frame to each associated station with the new key and each station must send back another EAPOL frame, acknowledging the fact. This wasn't happening so hostapd would just disconnect my laptop.
Ok, so then I went digging to see why. After adding lots of debugging code I found that the EAPOL frames were actually making to my Macbook Pro _AND_ it was ACKing them at the 802.11 layer. Ok, so the frame made it out there. But why the hell was it being dropped?
Now that I knew it was making it to the end node, I could eliminate a bunch of possibilities. What was left:
- Sequence number is out of order;
- CCMP IV replay counter is out of order;
- Invalid/garbled EAPOL frame contents.
Then I realised - ok, all the traffic is in TID 16 (the non-QoS TID.) That means it isn't a QoS frame but it still has a sequence number; so it is allocated one from TID 16. There's only one CCMP IV number for a transmitter (the receiver tracks a per-TID CCMP IV replay counter, but the transmitter only has one global counter.) So that immediately rings alarm bells - what if the CCMP IV sequence number isn't being allocated in a correctly locked fashion?
- My mac does go in and out of powersave - especially when it does a background scan.
- When the mac is doing 150Mbps+ of test traffic, it doesn't do background scans.
- When it's doing 100Mbps of traffic, the stack sneaks in a background scan here and there.
- Whenever it goes into background scan, it sends a "power save" to the AP..
- .. and the AP puts all multicast traffic into the CABQ instead of sending it to the destination hardware queue.
- Now, when this occured, the EAPOL frames would go into the software queue for TID 16 and the ARP/multicast/etc traffic would go into the CABQ
- .. but the CABQ has higher priority, so it'll be transmitted just after the beacon frame goes out, before the EAPOL frames in the software queue.
Lessons learnt from fiddling with the rate control code..
Once I had implemented filtered frames and did a little digging, I found that the rate control code was doing some relatively silly things. Lots of rates were failing quite quickly and the rate control was bouncing all over the place.
The first bug I found was that I was checking the TX descriptor completion before I had copied it over - and so I was randomly failing TX when it didn't fail. Oops.
Next, don't call the rate control code on filtered frames. They've been filtered, not transmitted. My code wasn't doing that - I'm just pointing it out to anyone else who is implementing this.
Then I looked at what was going on with rate control. I noticed that whenever the higher transmission rates failed, it took a long time for the rate control code to try and sample them again. I went and did some digging - and found it was due to a coding decision I had made about 18 months ago. I treated higher rate failures with a low EWMA success rate as successive failures. The ath_rate_sample code treats "successive failures" as "don't try to probe this for ten seconds." Now, there's a few things you need to know about 802.11n:
- The higher rates fail, often;
- The channel state changes, often;
- Don't be afraid to occasionally try those higher rates; it may actually work out better for you even under higher error rates.
- Only randomly sample a few rates lower than the current one; don't try sampling all 6, 14 or 22 rates below the high MCS rates;
- Don't treat low EWMA as "successive failures"; just let the rate control code oscillate a bit;
- Drop the EWMA decay time a bit to let the oscillation swing a little more.
Filtered frames support, or how not to spam the air with useless transmission attempts
- All the frames are filtered at the point the first fails, so you get all the subsequent attempted frames back, in the order you queued them. This makes life much easier when retransmitting them;
- The MAC now doesn't waste lots of time trying to transmit to destinations that aren't available. So if you have a bunch of UDP traffic being pushed to a dead or far away node, the airtime won't be spent trying to transmit all those frames. The first failure will filter the rest, freeing up the TX queue (and air) to transmit frames to other destinations;
- When stations go into power save mode, you may have frames already in the hardware queue for said station. You can't cancel them (easily/cleanly/fast), so instead they'll just fail to transmit (as the receiver is asleep.) Now you just get them filtered; you store them away until the station wakes up and then you retransmit them. A little extra latency (which is ok for some things and not others!) but much, much lower packet loss.
Secondly, if a station is just far away or under interference, you'll end up filtering a lot of traffic to it. So a lot of frames will cycle through the filtered frames handling code. Right now in FreeBSD I'm treating them the same as normal software retransmissions and dropping them after 10 attempts. I have a feeling I need to fix that logic a bit as under heavy dropping conditions, the traffic is being prematurely filtered and prematurely dropped (especially when the node is going off-channel to do things like background scans.) So the retransmission and frame expiry is tricky. You can't just keep trying them forever as you'll just end up wasting TX air time and CPU time transmitting lots of frames that will just end up being immediately filtered. Yes, tricky.
Reading rate control information from userland..
- There's a new ioctl to ath(4) to query the rate control module for a single associated MAC address (or the BSS MAC when running as a STA);
- Since the rate control is currently done at the driver level rather than at the VAP level, the call is to the driver rather than via the VAP (wlanX) interface;
- There's no easy way to get "all" station details whilst maintaining correct locking.
The last point deserves a little more explanation. I've introduced (well, _using_ now) a per-node lock when doing rate control updates. I acquire this lock when copying the rate control data out, so the snapshot is consistent.
So to fetch the state for a node, the following occurs:
- Call the net80211 layer to find an ieee80211_node for the given mac address - that involves locking the node table and getting a reference for the node (if found);
- Then locking the ath_node associated with it;
- Copy the data out;
- Unlock the ath_node;
- Decrement the ieee80211_node reference counter (which requires the node table lock.)
Now, the node table lock only occurs whilst fetching the node reference. It isn't held whilst doing the actual rate control manipulation. Compare to what I'd do if I wanted to walk the node table. The net80211 API for doing this holds the node lock whilst waking the node list. This means that I'll end up holding the node table lock whilst acquiring the ath_node lock. Now, that's fine - however, if I then decide somewhere else to try and do any ieee80211 operation whilst holding the ath_node lock, I may find myself with a lock ordering problem.
So for now the API will just support doing a single lookup for a given MAC, rather than trying to pull all of the rate control table entries down at once.
Here's an example output from the command:
adrian@marilyn:~/work/freebsd/ath/head/src/tools/tools/ath/athratestats]> ./athratestats -i ath1 -m 06:16:16:03:40:d0
static_rix (-1) ratemask 0xf
[ 250] cur rate 5 Mb since switch: packets 1 ticks 43028655
[ 250] last sample (11 Mb) cur sample (0 ) packets sent 10708
[ 250] packets since sample 16 sample tt 6275
[1600] cur rate 11 Mb since switch: packets 15 ticks 43025720
[1600] last sample (5 Mb) cur sample (0 ) packets sent 2423
[1600] packets since sample 7 sample tt 12713
[ 2 Mb: 250] 9:9 (100%) (EWMA 100.0%) T 11 F 0 avg 2803 last 42176930
[ 5 Mb: 250] 3139:3139 (100%) (EWMA 100.0%) T 3273 F 0 avg 1433 last 43028656
[ 5 Mb:1600] 29:29 (100%) (EWMA 100.0%) T 39 F 0 avg 5303 last 42192044
[11 Mb: 250] 7560:7560 (100%) (EWMA 100.0%) T 7838 F 0 avg 1857 last 43026094
[11 Mb:1600] 2394:2394 (100%) (EWMA 100.0%) T 2581 F 0 avg 2919 last 43026411
Don’t let anyone tell you that FreeBSD doesn’t "do" 802.11n:
This is from my FreeBSD-HEAD 802.11n access point, currently doing ~ 130MBit/s TCP:
# athstats -i ath0
41838297 data frames received
31028383 data frames transmit
78260 short on-chip tx retries
3672 long on-chip tx retries
197 tx failed 'cuz too many retries
MCS13 current transmit rate
8834 tx failed 'cuz destination filtered
477 tx frames with no ack marked
239517 rx failed 'cuz of bad CRC
10 rx failed 'cuz of PHY err
10 OFDM restart
42043 beacons transmitted
143 periodic calibrations
-0/+0 TDMA slot adjust (usecs, smoothed)
45 rssi of last ack
51 avg recv rssi
-96 rx noise floor
812 tx frames through raw api
41664029 A-MPDU sub-frames received
42075948 Half-GI frames received
42075981 40MHz frames received
13191 CRC errors for non-last A-MPDU subframes
129 CRC errors for last subframe in an A-MPDU
2645042 Frames transmitted with HT Protection
351457 Number of frames retransmitted in software
23299 Number of frames exceeding software retry
30674735 A-MPDU sub-frame TX attempt success
374408 A-MPDU sub-frame TX attempt failures
8676 A-MPDU TX frame failures
443 listen time
6435 cumulative OFDM phy error count
161 ANI forced listen time to zero
3672 missing ACK's
78260 RTS without CTS
1469003 successful RTS
239605 bad FCS
2 average rssi (beacons only)
Antenna profile:
[0] tx 1466665 rx 1
[1] tx 0 rx 41838296
A tale of two sequence numbers, or "when QoS seqno and CCMP PN don’t match up"..
The 30 second explanation:
Imagine you allocated four frames - sequence numbers 1, 2, 3 and 4. They have to be added to the block-ack window in precisely that order. Ie:
- Starting condition: Window is at 0:63 (64 frame window, starting at 0, so ending at 63)
- Add 1: Window is now at 0:63, starting at 1
- Add 2: Window is now at 0:63, starting at 2
- Add 3: Window is now at 0:63, starting at 3.
- Allocating a sequence number, then it was dropped; then
- Adding it to the software queue.
- Starting condition: Window is at 0:63
- Add 2: Window is now 0:63, starting at 2.
- Add 1: Window is 0:63, starting at 2; 1 is outside of the BAW (it's treated as a 'wraparound', so imagine it's 4095 seqno's away) so TX stalls.
- When I was queueing frames to the software queue, I'd check how deep the hardware queue was. If the hardware queue was shallow/empty, I'd direct dispatch up to two frames to the hardware to get things 'busy'. That will (hopefully) let further frames come along in the meantime and be aggregated. However, I was queueing the new frame to the hardware rather than queueing the new frame to the tail of the queue, and queueing the head frame of the queue to the hardware. That led to some out of order behaviour.
- ath_tx_xmit_aggr() would check if the sequence number was within the block-ack window and if it wasn't, it'd queue the frame to the tail of the queue. This meant that any new frames that came along would be queued to the end of the queue, even if they had been dequeued from the head of the queue. This lead to frames on the software queue being out of order.
- Frames on the software queue don't have to be in-sequence (as retries are prepended to the beginning of the list, and new frames are appended to the end) however they have to be in-order. If they end up being out of order, the BAW logic fails.
Fixing BAR handling and handle corner cases of things..
This exposed a very annoying problem - what if the driver runs out of ath_buf entries to schedule TX frames? Or, what if the network stack runs out of mbufs? If we need to allocate an ath_buf/mbuf to send a BAR frame, but they're all allocated and unavailable, the driver/wireless stack will come to a grinding halt. Typically these allocated ath_buf's are allocated in the software queue, waiting for the BAR TX (or power-save wakeup) to send a frame.
So, I haven't fixed this. It's on my (very) short term to-do list. But it did expose some issues in how the net80211 BAR send code (ieee80211_send_bar()) works. In short - it didn't handle resource allocation failures at all. It worked fine if the driver send method (ic->ic_raw_xmit()) succeeded and just failed to TX the frame. But if it couldn't allocate an mbuf, or if the driver send method failed.. things just stopped. And when the BAR TX just stopped, the ath(4) software TX queue would just keep buffering frames, right until all the TX ath_buf entries were consumed.
This is obviously .. sub-optimal.
But this raises an interesting point - how much of your kernel and/or userland application handle resource shortages correctly? I've seen plenty of userland software just not check whether malloc() returned NULL and I've seen some that specifically terminate (non-gracefully) if malloc()/calloc() fails - Squid does this. But what about your network stack? How's it handle mbuf shortages? What about the driver stack? What about net80211 (ew) ? What if the kernel malloc() API has to sleep because there's no free memory available?
I don't (currently) have an answer - it's a difficult, cross-discipline problem. What I -can- do though (at least in my corner of the FreeBSD world - net80211 and ath(4)) is to start testing some of these corner cases, where I force some resource shortages and ensure that the wireless stack and driver(s) recover somewhat gracefully. 802.11n is very unforgiving if you start dropping frames involved in an active aggregation session. So it's best I try and address these sooner rather than later.
And the winner of the most committing committer to src/sys over the last 12 months is ..
(Source: http://people.freebsd.org/~peter/commits.html)
.. but I wouldn't call myself the most important committer. Or the most active. What I'd call myself is the "most active fixing a sorely needed corner of the codebase."
What I _could_ have done is simply do all my work in a branch and then merge it back into -HEAD when I was done. And, for about 6 months, this is what I did. The "if_ath_tx" branch is where I did most of the initial TX aggregation work.
But as time goes on, your branch diverges more and more from the master branch (-HEAD in FreeBSD) and you are faced with some uncomfortable decisions.
If you stay on the same branch point and never merge in anything from your master branch, you _may_ have a stable snapshot of code, but who knows how stable (or relevant) your work will be when you merge it back into master.
You have no idea if your work will break anything in master and you have no idea if changes in master have broken your work.
As time goes on, the delta between your branch point and the master branch increases, making it even more difficult to do that final merge back. It also has the side effect of making it increasingly likely that problems will occur with the merge (your code breaking master, master breaking your code, etc.)
So as uncomfortable as it was - and as much as I wanted things to stay stable - I did press through with relatively frequent merging. This means:
- I would pick specific development targets to work towards, at which point I'd stop developing and go into a code review/tidyup/testing phase;
- I'd do frequent merges from master back into my branch during active development - I wouldn't leave this until I was ready to merge my work back into master;
- Once I reached my development target and had done sufficient testing - including integrating changes from master back into my branch - I'd kick off a semi-formal review (read: email freebsd-wireless@) and call for testers/review;
- Only _then_ would I merge what was suitable back into master.
I wouldn't merge everything from my branch into master. In my instance, there were some debugging extensions that were easy to maintain (read: lots of device_printf() calls) but weren't suitable for FreeBSD-HEAD. But I merged the majority of my work each time.
But that doesn't always work. I managed to merge a bunch of ath(4), ath_hal(4) and net80211 fixes back into -HEAD as appropriate. But the TX aggregation code was .. well, rather large. So I attempted to break up my commit into as many small, self-contained functional changes as possible. Yes, there was a big "here's software TX queue and aggregation" as a big commit at the end but I managed to peel off more than 30% of that in the lead-up commits.
Why bother doing that?
Two words - version bisection. Once I started having users report issues, they would report something like "FreeBSD-HEAD revision X worked, revision Y didn't." (If I were lucky, of course.) Or, they'd note that a certain snapshot from a certain day worked, but the next day had a regression. If I had committed everything as one enormous commit after having spent 6 + months on the branch, I'd be in for a whole lot of annoying line-by-line debugging of issues. Instead, I was able to narrow down most of the regressions by trying all the different commits.
Now that 802.11n ath(4) TX aggregation and general 802.11n support is in the tree, I only use branches for larger scale changes that take a couple of weeks. For example, when fixing up the reset path to not drop any TX/RX frames. I do most of the bugfixing in FreeBSD-HEAD. I could do it in a branch and then do monthly merges, but I then have the same problems I've listed above.
In summary: don't underestimate how helpful it is to break down your commits into little, piecemeal, self-contained functional changes. It has the side effect of making you look really good in the committer statistics.
The initial introduction into "it’s the NIC, stupid!"
In my case, an IBM/Lenovo Thinkpad T60 has been modified (not by me) to take an Atheros AR9280 NIC. Unfortunately, the NIC was proving to be very unstable when doing 802.11n throughput. The investigations did show I was doing something slightly incorrect with TX descriptors (and I've since fixed that) but the stability issues remained.
The Atheros NICs can expose some host interface error conditions via the AR_INTR_SYNC_CAUSE register. These include PCI(e) transaction timeouts, illegal chip access (eg whilst the MAC is asleep), parity errors, and other rather nice things. FreeBSD's HAL and Linux ath9k does have the register definition for what the bits do - but unfortunately we don't keep statistics.
In my particular case:
- I'd see AR_INTR_SYNC_LOCAL_TIMEOUT occur. This is because a PCI(e) transaction didn't complete in time. I can tune these timeouts via a local register but that's not the point - I was seeing these errors when receiving only beacons from the access point. That's a bit silly.
- I'd also see AR_INTR_SYNC_RADM_CPL_DLLP_ABORT, which is an indication that the PCIe layer isn't behaving well.
I swapped it out with another AR9280 based NIC and suddenly all the instabilities have gone away. No TX hangs, no missed TX interrupts. Everything looks great.
So as an open source developer, I want to try and put some tools into the hands of the community to be able to debug what's going on - or, if that's not possible, at least get some indication that things are going wrong. Right now the only thing people see is "I see TX timeouts, it must be the driver/chip fault." There's too much going on to be able to conclude that.
My game plan is this:
- Implement statistics keeping for each of the SYNC interrupts and expose those via a diagnostic interface. Ben Grear has done something similar for Linux ath9k after a private email discussion. He's also seeing MAC sleep accesses, so it's quite likely we'll start finding/squishing these.
- Take the offending laptop/NIC to the office and attach it to a very expensive and fancy looking PCIe analyser. I'm hoping we'll find something really silly occuring - like lots of sleep state transitions, or a high number of parity errors.
- Try documenting this a lot better so users are able to understand what's going on when their NIC is misbehaving.
Concurrency in the TX path and when it all falls down..
FreeBSD on the TP-Link TL-WR1043nd!
It supports hostap mode (which is what I bet most of you want to use) and I'm currently using it at home alongside my Ubiquiti Routerstation Pro based hostap (which is what I use to test out all the other pre-11n and 11n NICs that I currently own.)
I currently get around 50mbit TCP throughput - but I leave full FreeBSD-HEAD debugging on. I'm sure I can push the unit closer to 100mbit. (Compare to the Routerstation Pro + AR9160 hostap - where I routinely get 160mbit of TCP throughput.)
What works (read: what I've tested):
- Ethernet (at least the WAN port);
- Wireless - 802.11bgn - 20/40mhz operation as well as legacy operation (and both, if that's what you need);
- Serial console - if you've soldered in one.
The firmware image stores the configuration in a 64k flash partition which is read upon boot. You can modify files in /etc and then save these to flash via "cfg_save".
- The onboard switch - so I believe the only port available at the present moment is the WLAN port;
- The GPIO lines aren't being configured, so the WLAN, status and USB/QSS buttons don't function.
Further details about the hardware and how to build the software for yourself can be found here in my FreeBSD wifi development project wiki.
No, I won't (yet) be putting up firmware images for people to test. Things are changing quite rapidly and there's no easy way to reflash a unit once you've placed FreeBSD on it - you'll need to have added a serial console to the device.
FreeBSD 802.11n update: 27 November 2011.
Frames are still dropped during things like channel/operation mode changes and channel scanning (which does do a channel change.) I'll have to look into that at a later stage. If you're using this in station mode you will likely need to disable background scanning or your aggregation sessions may occasionally drop. You'll have random messages logged when frames are dropping during a flush or reset, so just check your system dmesg log for anything from the ath driver.
I'll be next working on correctly handling failed/filtered frames and then adding some transition stuff to net80211 so the TIM/ATIM bitmaps can be kept correctly up to date. This should fix some of the power saving issues that I'm sure exist.
Unfortunately transmitting BAR frames is still quite a bit off. There's a lot more tidying up that I'd like to do before I start down the path of handling BAR TX, including trying to figure out how to better handle packet transmission and reception when the NIC is off-channel (eg when doing a background channel scan.)
I also have a long list of things I'd like to do to the rate control code and all the surrounding code which sets up rates and creates aggregates. The code I ported/wrote is a little too verbose and duplicate-y for me. That likely will occur after the christmas break.
Enjoy!
FreeBSD is now doing (even more) 802.11n..
So, it's in there, bugs and all, supporting both station and hostap mode. No, wds, adhoc, mesh and TDMA aren't currently supported (I have enough bugs to worry about for the time being, without trying to debug the other operating modes. But I'd like to.)
What works:
- TX and RX aggregation!
- The rest of the 802.11n negotiation stuff, mostly thanks to Bernhard Schmidt who fixed up a lot of the net80211 quirks.
- Lots of ANI changes which hopefully make noisy environments more stable.
What doesn't yet work:
- Interface resets cause frames to be dropped from the RX and TX queues. This messes up aggregation and causes sessions to hang. I'm fixing that up in a git branch at the moment.
- BAR TX - I'll implement BAR TX soon - it's just tricky to get right.
- Filtered frames - ie, TX failed frames from the hardware. Instead of the current method of "always try", the hardware supports failing the current and subsequent frames in a set. That way a hostap seeing a station going into power saving mode can quickly abort all TX frames to said station and then only retransmit them when the station indicates it's again awake. If I don't do this then the hardware will constantly fail a lot of frames, causing BAR frames to be TXed when they likely shouldn't be.
But it's enough to try. So if you have an AR5416, AR9160, AR9220, AR9280, AR9285, AR9227 or AR9287, give it a whirl. If you have a pre-11n NIC then please, give it a go too. I'd like to ensure that the hardware support for earlier chipsets hasn't broken.
If you'd like to use this in production on a hostap then please keep in mind that power saving support isn't entirely functional and featured, so stations which go into frequent power saving mode may have some performance issues. I'll tinker with this some more soon.
Finally, thank you very much to Hobnob, Inc. for sponsoring this work and Qualcomm Atheros for providing me source code, documentation and assistance in understanding how all of this works.