PC-BSD and 4K — Oh my!

I had recently seen some chatter on IRC about 4K monitor support. One of the people discussing stated that PC-BSD didn’t support 4K monitors, which was curious to me, since I hadn’t tried it yet. This week I did something about that, and ended up getting two 4K monitors and a new video card (With triple DisplayPort) to do some testing.

The Hardware

2x — ASUS PB287Q 28-Inch Screen LED-Lit 4K Monitor (Amazon)

Zotac GeForce GTX 960 2GB GDDR5 PCI Express 3.0 HDMI DVI DisplayPort (Amazon)

The Setup

The setup process was thus:

1. Remove the old monitors and video card

2. Connect the new monitors and video card

3. Go into monitor menus — Enable DisplayPort 1.2 mode

4. Boot up, selecting the “Run Display Wizard” option at the boot-loader.

5. Click “Accept”

That’s it! The display wizard properly detected both 4k monitors, and set them up in a proper dual-head configuration. Logging into KDE resulted in it detecting the new monitors, and automatically re-configuring the display to take advantage of the new resolution. At this point the system was fully usable, font-scaling even worked properly for Firefox and such.

Having 4K setup I did find a few minor issues, which are already being fixed in time for PC-BSD 10.1.2 next week, such as changing the mouse cursor to something which supports larger sizes /  resolutions better. The only real hiccup was figuring out that the monitor ships with DisplayPort set to 1.1 mode, only giving 30hz mode out of box. Changing that in the monitors menus to DisplayPort 1.2 mode, enabled 60hz, giving a silky-smooth desktop experience at 4K resolutions. Not sure if that is something typical of all 4K monitors, or just this model in particular, but if you can’t go beyond 30hz, you may want to check that in your monitor documentation.

Anyway, I hope this information is of use to anybody else who is thinking of taking the 4K plunge. Of course I would love additional feedback from other users who are running 4K, to see if there are other tweaks we can make for optimal support out of box. Let us know on the bug tracker, mailing lists or new forums!

Intel DDIO, LLC cache, buffer alignment, prefetching, shared locks and packet rates.

I've been digging into the low level behaviour of high throughput packet classification and pushing for my job. The initial suggestions from everyone was "use netmap!" Which was cool, but it only seems to to fast packet work if you're only ever really flipping packets between receive and transmit rings. Once you start actually looking into the payload, you start having to take memory misses and things can slow down quite a bit. An L3 miss (ie, RAM access) on Sandybridge is ~50ns. (There's also costs involved in walking the TLB, but I won't cover that here.)

For background: http://7-cpu.com/cpu/SandyBridge.html .

But! Intel has this magical thing called DDIO. In theory (and there's a lot of theory here), DMA is done via a small (~10%) fraction of LLC (L3) cache, which is shared between all cores. If the data is already in cache when the CPU accesses it, it will be quick. Also, if you then wish to DMA out data from something in cache, it doesn't have to get flushed to memory first - it's just DMAed straight out of cache.

However! When I was doing packet bridge testing (using netmap + bridge, 64 byte payloads), I noticed that I was doing a significant amount of memory bandwidth. It wasn't quite at the rate of 10G worth of bridged data, but DDIO should be doing almost all of that work for me at 64 byte payloads.

So, to reproduce: run netmap bridge (eg 'bridge -i netmap:ix0 -i netmap:ix1') and run pkt-gen between two nodes.

This is the output of 'pcm-memory.x 1' from the intel-pcm toolkit (which is available as a binary package on FreeBSD.)

--                   System Read Throughput(MB/s):    300.68                  --
--                  System Write Throughput(MB/s):    970.81                  --
--                 System Memory Throughput(MB/s):   1271.48                  --

The first theory - the bridging isn't occuring fast enough to service what's in LLC before it gets flushed out by other packets. So, assume:

  1. It's 1/10th of the LLC - which is 1/10th of an 8 core * 2.5MB per core setup, is ~ 2MB.
  2. 64 byte payloads are being cached.
  3. Perfect (!) LLC use.
That's 32,768 packets at a time. Now, netmap is doing ~ 1000 packets a batch and it's keeping up line rate bridging on one core (~14 million packets per second), so it's not likely that.

Ok, so what if it's not perfect LLC usage?

Then I thought back to cache line aliasing and other issues that I've previously written about. What if the buffers are perfectly aligned (say, 2048 byte aligned) - the cache line aliasing effects should also manifest themselves as low LLC utilisation.

Luckily netmap has a twiddle - 'dev.netmap.buf_size' / 'dev.netmap.priv_buf_size'. They're both .. 2048. So yes, the default buffer sizes are aligned, and there's likely some very poor LLC utilisation going on.

So, I tried 1920 - that's 2048 - (2 * 64) - ie, two cache lines less than 2048.

--                   System Read Throughput(MB/s):    104.92                  --
--                  System Write Throughput(MB/s):    382.32                  --
--                 System Memory Throughput(MB/s):    487.24                  --

It's now using significantly less memory bandwidth to do the same thing. I'm guessing this is because I'm now using the LLC much more efficiently.

Ok, so that's nice - but what about when it comes time to actually look at the packet contents to make decisions?

I've modified a copy of bridge to do a few things, mostly inspired by netmap-ipfw:
  • It does batch receive from netmap;
  • but it then looks at the ethernet header do decap that;
  • then it gets the IPv4 src/dst addresses;
  • .. and looks them up in a (very large) traditional hash table.
I also have a modified copy of pkt-gen that will use completely random source and destination IPv4 addresses and ports, so as to elicit some very terrible behaviour.

With an empty hash set, but still dereferencing the ethernet header and IPv4 source/destination, handling a packet at a time, no batching, no prefetching and only using one core/thread to run:

  • Bridges about 6.5 million pps;
  • .. maxes out the CPU core;
  • Memory access: 1000MB/sec read; 423MB/sec write (~1400MB/sec in total).
  • Bridges around 10 million pps;
  • 98% of a CPU core;
  • Memory access: 125MB/sec read, 32MB/sec write, ~ 153MB/sec in total.
So, it's a significant drop in memory throughput and a massive increase in pps for a single core.

Ok, so most of the CPU time is now spent looking at the ethernet header in the demux routine and in the hash table lookup. It's a blank hash table, so it's just the memory access needed to see if the bucket has anything in it. I'm guessing it's because the CPU is loading in the ethernet and IP header into a cache line, so it's not already there from DDIO.

I next added in prefetching the ethernet header. I don't have the code to do that, so I can't report numbers at the moment. But what I did there was I looped over everything in the netmap RX ring, dereferenced the ethernet header, and then did per-packet processing. This was interesting, but I wanted to try batching out next. So, after some significant refactoring, I arranged the code to look like this:
  1. Pull in up to 1024 entries from the netmap receive ring;
  2. Loop through, up to 16 at a time, and place them in a batch
  3. For each packet in a batch do:
    1. For each packet in the batch: optional prefetch on the ethernet header
    2. For each packet in the batch: decapsulate ethernet/IP header;
    3. For each packet in the batch: optional prefetch on the hash table bucket head;
    4. For each packet in the batch: do hash table lookup, decide whether to forward/block
    5. For each packet in the batch: forward (ie, ignore the forward/block for now.)
I had things be optional so I could turn on/off prefetching and control the batch size.

So, with an empty hash table, no prefetching and only changing the batch size, at buf_size=1920:
  • Batch size of 1: 10 million pps;
  • Batch size of 2: 11.1 million pps;
  • Batch size of 4: 11.7 million pps.
Hm, that's cute. What about with prefetching of ethernet header? At buf_size=1920:
  • Batch size of 1: 10 million pps;
  • Batch size of 2: 10.8 million pps;
  • Batch size of 4: 11.5 million pps.
Ok, so that's not that useful. Prefetching on the bucket header here isn't worthwhile, because the buckets are all empty (and thus NULL pointers.)

But, I want to also be doing hash table lookups. I loaded in a reasonably large hash table set (~ 6 million entries), and I absolutely accept that a traditional hash table is not exactly memory or cache footprint happy. I was specifically after what the performance was like for a traditional hash table. Said hash table has 524,288 buckets, and each points to an array of IPv4 addresses to search. So yes, not very optimal by any measure, but it's the kind of thing you'd expect to find in an existing project.

With no prefetching, and a 6 million entry hash table:

At 2048 byte buffers:
  • Batch size of 1: 3.7 million pps;
  • Batch size of 2: 4.5 million pps;
  • Batch size of 4: 4.8 million pps.
At 1920 byte buffers:
  • Batch size of 1: 5 million pps;
  • Batch size of 2: 5.6 million pps;
  • Batch size of 4: 5.6 million pps.
That's a very inefficient hash table - each bucket is going to have around 11 IPv4 entries in it, and that's checking almost a cache line worth of IPv4 addresses in it. Not very nice. But, it's within a cache line worth of data, so in theory it's not too terrible.

What about with prefetching? All at 1920 byte buffers:
  • Batch size of 4, ethernet prefetching: 5.5 million pps
  • Batch size of 4, hash bucket prefetching: 7.7 million pps
  • Batch size of 4, ethernet + hash bucket prefetching: 7.5 million pps
So in this instance, there's no real benefit from doing prefetching on both.

For one last test, let's bump the bucket count from 524,288 to 2,097,152. These again are all at buf_size=1920:
  • Batch size of 1, no prefetching: 6.1 million pps;
  • Batch size of 2, no prefetching: 7.1 million pps;
  • Batch size of 4, no prefetching: 7.1 million pps;
  • Batch size of 4, hash bucket prefetching: 8.9 million pps.
Now, I didn't quite predict this. I figured that since I was reading in the full cache line anyway, having up to 11 entries in it to linearly check would be cheap. It turns out that no, that's not exactly true.

The difference between the naive way (no prefetching, no batching) to 4-packet batching, hash bucket prefetching is not trivial - it's ~ 50% faster. Going all the way to a larger hash bucket was ~75% faster. Now, this hash implementation is not exactly cache footprint friendly - it's bigger than the LLC, so with random flows and thus no real useful cache behaviour it's going to degrade to quite a few memory accesses.

This has been quite a fun trip down the optimisation peephole. I'm going to spend a bunch of time writing down the hardware performance counters involved in analysing this stuff and I'll look to write a follow-up post with details about that.

One final things: threads and locking. I wanted to clearly demonstrate the cost of shared read locks on a setup like this. There's been lots of discussions about the right kind of locking and concurrency strategies, so I figured I'd just do a simple test in this setup and explain how terrible it can get.

So, no read-locks between threads on the hash table, batch size of 4, hash bucket prefetching, buf_size=1920:
  • 1 thread: 8.9 million pps;
  • 4 threads: 12 million pps.
But with a read lock on the hash table lookups:
  • 1 thread: 7 million pps;
  • 4 threads: 4.7 million pps.
I'm guessing that as I add more threads, the performance will drop.

Even taking a rwlock as a reader lock in pthreads is expensive - it's purely just an atomic increment/decrement in FreeBSD, but it's still not free. I'm getting the lock once for two hash table lookups - ie, the source and destination IP hash table lookups are done under one lock. I'm sure if I took the lock for the whole batch hash table lookup it'd work out a little better on a small number of CPU cores, but I think this demonstrates my point - read locks aren't going to cut it when you have a frequently accessed thing to protect.

The best bit about this post? The prefetching, terrible (large) hash table performance and general cache abuse is not new. Doing batching on superscalar Intel CPUs is not new. Documenting DDIO effectiveness using non-power-of-two-aligned buffer sizes is new, but it's just a rehash of the existing cache aliasing effect. But, I now have a little test bed to experiment with these things without having to try and involve the rest of a kernel.

Yes, I'll publish code soon.

ICC profiles and “startx” on FreeBSD

Beyond photography, I find that a calibrated screen is more comfortable to use, especially if you have two or more monitors with uneven panels and colors rendering.

You can buy a professional colorimeter for a not-so-small amount of money, but fortunately, there’s an open-source alternative. Richard Hughes developed ColorHug, a free open-source colorimeter that you can build yourself or order from him. I bought one myself shortly after it was released. It gives very good results, comparable to professional devices.

To calibrate the screen, FreeBSD lacks the appropriate applications in the Ports tree. The ColorHug is provided with a modified Fedora live CD with all the required programs installed and configured. Just boot your computer using this CD, plug the device and the wizard starts automatically, offerring you to calibrate all connected monitors. The process is quite straightforward and takes from a few minutes to 20-25 minutes, depending on the accuracy you want. When it’s finished, you have your shiny new profile(s) in ~/.local/share/icc. Profiles are named after monitor’s name and the date of the calibration. You need to copy those files to the same location on your FreeBSD desktop.

Now, you need to configure your computer to use the generated profile(s). GNOME, KDE and any other modern desktop environment come with display configuration tools which are easy to use. For those preferring a more lightweight environment and startx, things are a bit more complicated. Here is what I had to do.

  1. ICC profiles are managed by colord (graphics/colord). This daemon is started by dbus automatically. However, it seems that, after a fresh install, the service fails to start as reported in bug 197946 by several users of darktable. Running the service manually once “fixed” the problem for me:


  2. colord finds ICC profiles located in ~/.local/share/icc without further configuration. However, monitors are probed and managed by the X.Org server so another tool is required to declare all monitors in colord during session startup. I used xiccd, compiled from source (I need to make a port for it).

  3. Now, you need to start xiccd but PolKit and ConsoleKit get in the way as soon as xiccd tries to communicate with colord. Debugging this is where I spent most of the time… So here is the solution:

    • ConsoleKit is started by a login manager such as XDM, GDM, KDM and so on. If you don’t use one, you need to enable ConsoleKit in /etc/pam.d/login:

      # ConsoleKit.
      session  optional  /usr/local/lib/pam_ck_connector.so  nox11

    • To test if ConsoleKit works, log out and log in again on the console and run ck-list-sessions. You should get something like:

      	unix-user = '...'
      	realname = '...'
      	seat = 'Seat1'
      	session-type = ''
      	active = TRUE
      	x11-display = ''
      	x11-display-device = ''
      	display-device = '/dev/ttyv0'
      	remote-host-name = ''
      	is-local = TRUE
      	on-since = '2015-04-06T08:12:57.167178Z'
      	login-session-id = ''
      	idle-since-hint = '2015-04-06T08:13:27.293605Z'

      The “active = TRUE” line is the important one.

    • You need to start your X session with ConsoleKit. Modify your ~/.xinitrc to start your session with ck-launch-session:

      exec ck-launch-session dbus-launch --exit-with-session ~/.xstart

      dbus-launch starts dbus under the ConsoleKit session and ~/.xstart (of course, choose the name you want) is a script running all applications requiring a working dbus and ConsoleKit environment.

  4. My .xstart launches xiccd, loads ICC profiles and starts a window manager:

    set -e
    if [ -x $HOME/Projects/graphics/xiccd/install/bin/xiccd ]; then
            # xiccd, compiled from source, is present, let's start it.
            $HOME/Projects/graphics/xiccd/install/bin/xiccd &
            # It takes some time to declare all screens in colord. Wait here
            # until colord knows about the laptop's monitor.
            # To know the name of the monitors, use "colormgr get-devices".
            while ! colormgr get-devices | grep -q xrandr_LVDS_(...); do 
                    sleep 1 
            # Load the ICC profile for the laptop's monitor.
            # To list the ICC profile IDs, use "colormgr get-profiles".
            colormgr device-add-profile 
            # Do this again for each possible external screens.
            if colormgr get-devices | grep -q xrandr_EA244UHD_(...); then 
                    # NEC EA244UHD. 
                    xrandr --output LVDS --off 
                    xrandr --output HDMI-0 --mode 1920x1080 --rate 60 
                    colormgr device-add-profile 
            if colormgr get-devices | grep -q xrandr_SONY_TV_(...); then 
                    # Sony Bravia KDL-40WE5. 
                    xrandr --output LVDS --off
                    colormgr device-add-profile 
    # Start window manager.
    exec awesome

Now, after startx(1), all monitors display the same colors!

With a single monitor to manage, oyranos (graphics/oyranos) was working for me. There is no dependency to colord, ConsoleKit or PolKit and is way simpler to use. However, it didn’t want to manage my external monitor (or two monitors at a time, I don’t know). This made an opportunity to play with colord. I just didn’t expect ConsoleKit or PolKit to join the party :)

Huge Announcement for PC-BSD

Hey PC-BSDers!

Some of you may have heard the speculation surrounding the last couple weeks of development here at the PC-BSD project, and I’d like to go ahead and address those rumors in today’s post.  We know there has been a large amount of misinformation circulating so we recommend that all of our users please read the following explanation so the information is clear.

For the last couple of months we have been beating our heads against the wall trying to find a way to put us on par with the best Linux distros out there such as Ubuntu.  During a phone meeting on Monday between Kris, Dru, Ken and myself (Josh), we began tossing around ideas on how we can make that happen.  One of the ideas that was presented was making Lumina DE the only supported desktop environment in PC-BSD and by doing so focus all of our development time exclusively on it.  Ken Moore argued that if we could create a desktop environment that everyone loved, and “Unify” the user experience, that no one would have any reason ever to use anything else because it was the best.  Although we did toss the idea around for a while we thought that might be a little bit of an issue with a few of our users that like to use other crappy desktop environments.

After an hour of spirited back and forth debate, many good ideas were presented.  As we all took a moment during our video call to think about the ideas that were just discussed, I interrupted the awkward silence and told the rest of the group about an idea I had been thinking about for a few days.  “What if we could do something even better than Unify the user experience…What if we could make the system boot extremely fast”. I explained that if we could hard code a “boot shim” so to speak into the kernel, that we could be the fastest booting unix-like distribution out there.  We all looked at each other and started to realize this was the “holy grail” so to speak that we had been looking for.  After perusing the internet for about 3 and a half minutes we found out there was already a Linux solution available.

We hammered out the details and now want to present you with our new “Everything Manager” the new SystemBS-D.  Most of you are familiar with how we can port different types of Linux software and run them through our emulation layer.  Using the same emulation magic we have taken what one kid that lives on my street called “the best piece of software ever. The end.”, and ported it over to PC-BSD.  SystemBS-D not only makes your system boot faster, but it can basically manage everything on your system as well.  Sure it still crashes a lot and it has trouble displaying log files, but we’re working on that.  In a discussion with Kris Moore he stated “SystemBS-D may be unstable, but it DID make the system boot super fast.  I feel like that’s a pretty good trade.”

Taking cues from other popular software companies we have also thought about integrating a really cool storefront into PC-BSD and making people look at it before they can go to their desktop.  We could also consider locking down the user’s system with “grub-lock” so no other operating systems could be installed on it… oops I meant “secure” the user’s system.

We look forward to hearing your feedback on these new developments.

P.S. if you thought any of this was real look at the date on your computer.  Happy 1st!

Best Regards,



Lumina Desktop 0.8.3 Released!

The next version of the Lumina Desktop Environment has just been released!

This is mainly a bugfix release to correct an urgent issue with the system tray on FreeBSD 11, but there are a number of other slight improvements/updates included as well. The full list of changes is included at the bottom of this announcement, but the notable changes are as follows:

  • New Panel Plugin: “Application Launcher“
    • This allows the user to pin the shortcut for an application directly to a panel.
  • New Utility: “lumina-xconfig“
    • This utility allows the user to easily enable/disable additional monitors/screens within the desktop session.
  • Fix the issue with transparent system tray icons on FreeBSD 11
  • Add support for the XDG autostart specifications.


The FreeBSD port has already been updated and this version will be included in the next set of PC-BSD package updates (“Edge” packages being created now) as well as included in the next PC-BSD 11 image for April (coming soon).


Reminder: The Lumina desktop environment is still considered to be “beta-quality”, so if you find things that either don’t work or don’t work well, please report them on the PC-BSD bug tracker so that they can get fixed as soon as possible. Feel free to also post tickets for any feature requests or improvements that you think might be useful!




Changes Since 0.8.2

FreeDesktop Standards Compliance:

  •  A number of bugs related to detecting/using XDG mimetypes were fixed.
  • Support for the XDG autostart specifications was added (usage only  — more work is still necessary to convert the current Lumina autostart spec over)
  •  Add some additional fallback routines to account for possible errors in *.desktop files.


New Utility: “lumina-xconfig”

  •  This utility is a graphical front-end to xrandr, and allows the user to easily add/remove screens from the current X session.
  • Shortcuts to this utility are available in the user button plugin, and the settings menu plugin


Insight File Manager:

  •  Add support for creating new (empty) files.
  •  Add an option for enabling/disabling the use of image thumbnails (useful if you have massive directories of pictures — just be sure you disable thumbnails *before* loading the directory).
  •  Add initial drag-and-drop support for moving files/dirs within a directory.
  •  Load the specific icon for any application shortcuts.
  •  Add the ability to view checksums of files.
  •  Add some additional checks/excludes for copy/move operations in the background to prevent the user from performing illegal operations (such as moving a directory into itself).
  •  Add support for listing statistics about the current directory on the window (number of files, total size of files, percent of the filesystem which is used).
  •  Streamline the frequency of the background directory checker — now it runs much less often.


Desktop Changes:

  •  Disable the shutdown/restart options on PC-BSD if the system is in the middle of performing updates. The system may still be shutdown/restart from within PCDM — this just adds an extra layer of safety for users.
  •  Have the shutdown/restart options use the “-o” option on FreeBSD/PC-BSD so that the system performs the action much faster.
  •  Add support for thumbnails, increasing/decreasing icon sizes, removing files, and  cut/copy files to the “desktopview” desktop plugin (the plugin which provides traditional desktop icons).
  •  Add support for increasing/decreasing the icon size for the application launcher desktop plugin.
  •  Update the icon used for the “favorites” system in the user button and the file manager.
  •  Add the ability to display alternate timezones in the system clock. This does *not* change the system time at the moment, it is just a setting for the visual clocks/plugins.
  •  Add a new panel plugin for pinning application shortcuts directly to the panel. (just like the “applauncher” desktop plugin, but on the panel).
  •  Perform the initial search for applications on the system within the session initialization. This ensure that buttons/plugins are responsive as soon as the desktop becomes visible.
  •  Fix an issue with transparent system tray icons on FreeBSD 11, and convert the system tray embed/unembed routines to use the XCB library instead of XLib.

EuroBSDCon 2015

EuroBSDCon 2015 (http://2015.eurobsdcon.org/), Stockholm University, Stockholm, Sweden 1 - 4 October, 2015. EuroBSDcon is the premier European conference on the open source BSD operating systems attracting about 250 highly skilled engineering professionals, software developers, computer science students and professors, and users from all over Europe and other parts of the world. The goal of EuroBSDcon is to exchange knowledge about the BSD operating systems, facilitate coordination and cooperation among users and developers. The dates for EuroBSDCon 2015 in Stockholm have been set to October 1-2nd for tutorials and October 3-4th for the main conference.

Using the arswitch ethernet switch on FreeBSD

I sat down a few weeks ago to make the AR8327 ethernet switch work and in doing so I wanted to add per-port and 802.1q VLAN support. It turned out that I .. didn't know as much I thought I did about the etherswitch support. So, after a whole bunch of trial-and-error, I wrapped my head around things. This post is mostly a braindump so if I do forget I have something written down about it - at least until I turn it into a FreeBSD manpage.

There's three modes:
  • default - all ports are in the same VLAN;
  • per-port - each port can be in a VLAN 'group';
  • dot1q - each port can be in multiple VLAN groups, with 802.1q tagging going on.
The per-port VLAN group is for switches that don't have an arbitrary VLAN table - you just assign each port an ID from some low set of values (say, 16), and then the VLAN tag can either be added or not added. I think the RTL8366 switch is like this, but I'd have to check.

The dot1q VLAN is for switches that support multiple VLANs, each can have an arbitrary VLAN ID (0..4095) with optional other VLAN options (like tag-in-tag support.)

The etherswitch configuration side has a few options and they're supported by different hardware:
  • Each port has a port VLAN ID - this is the "native port" for dot1q support. I don't think it has any particular meaning in the per-port VLAN code in arswitch but I could be terribly wrong. I thought it did when I initially did the port, but the documentation is .. lacking.
  • Then there's a set of per-port flags - eg q-in-q, 802.1q tagging, etc.
  • Then there's the vlangroup - each vlangroup has a vlan ID, and then a set of port members. Each port member can be tagged or untagged.
This is where things get odd.

Firstly - the AR934x SoC switch support doesn't include VLANs. I need to add that. I'm not sure which side of the wall this falls.

The switches previous to the AR8327 support per-port and VLAN configuration, but they don't support per-port-per-VLAN tagging. Ie, you can configure 802.1q VLANs, and you can enable tagging on the port - but it tags all packets that aren't the port 'VLAN ID'.

The per-port VLAN ID seems ignored by the arswitch code - it's only used by the dot1q support.

So I think (and it hasn't yet been tested) that on the earlier switches, I can use per-port VLANs with tagging by:
  • Configuring per port vlans - "etherswitch config vlan_mode port"
  • Adding vlangroups as appropriate with membership - tag/untag doesn't matter
  • Set the CPU port up to have tagging - "etherswitch port0 addtag"
When configuring dot1q VLANs, the mode is "config vlan_mode dot1q" and the 802.1q VLAN IDs are used, but the above still holds - the port is tagged or untagged.

But on the AR8327, the VLAN map hardware actually supports enabling/disabling tagging on a per-port-per-VLAN basis. Ie, when the VLAN table is programmed with the port membership, it takes a list of both the ports and whether the ports are tagged/untagged/open/filtered. So, I don't think per-port VLAN tagging works - only dot1q tagging. Maybe I can make it work, but I haven't really sat down for long enough with the documentation to see what combinations are required.
  • Configure the hardware - "etherswitch config vlan_mode dot1q"
  • Add vlangroups as appropriate, set pvid as appropriate
  • For each vlangroup membership, the port can be tagged or untagged - eg to tag the cpu port 0, you'd use '0t' as the port member. That says "port0 is a member, and it's tagged."
I still have a whole lot more to add - the ingress/egress filters aren't configurable, the per-port vlan stuff needs to be made much more sensible and consistent - and the AR934x SoC switch needs to support VLANs. Oh, and much more documentation. But, hey, I can get the thing spitting out VLAN tags, so when it's time to setup my home network with some VLANs, i'll be sure to document what I did and share it with everyone.

bsdtalk252 – devio.us with Brian Callahan

This episode has been brought to you by jot, the utility for printing sequential or random data.  Jot first appeared in 4.2 BSD.

An interview with devio.us admin Brian Callahan.  http://devio.us is a free shell provider that runs on OpenBSD.

File Info: 18Min, 8MB.

Ogg Link: https://archive.org/download/bsdtalk252/bsdtalk252.ogg

Cache Line Aliasing #2, or "What happens when you page align everything"

After a little more digging into the Intel performance side of things, I discovered one of the big reasons for the performance drop on this particular workload: how Intel CPUs do memory reordering.

The TL;DR is this - there's some hardware inside the Intel CPUs that tracks memory ordering and cache contents - but they don't use all the address bits.

The relevant chapter in the intel optimisation guide is 3.6.8 - Capacity Limits and Aliasing in Caches. The specific thing I was hitting was in - Store Forwarding Aliasing.

Assembly/Compiler Coding Rule 56. (H impact, M generality) Avoid having a store followed by a non-dependent load with addresses that differ by a multiple of 4 KBytes. Also, lay out data or order computation to avoid having cache lines that have linear addresses that are a multiple of 64 KBytes apart in the same working set. Avoid having more than 4 cache lines that are some multiple of 2 KBytes apart in the same first-level cache working set, and avoid having more than 8 cache lines that are some multiple of 4 KBytes apart in the same first-level cache working set.

So, given this, what can be done? In this workload, a bunch of large matrices were allocated via jemalloc, which page aligns large allocations. In the default invocation of the benchmark (where the allocation padding size is 0), the memory access patterns showed a very large number of counter events on "LD_BLOCKS_PARTIAL.ADDRESS_ALIAS" - which is the number of 64k address aliases on the Sandy Bridge Xeon processors I've been testing on. (The same occurs on Westmere, Ivy Bridge and Haswell.) As I vary the padding size, the address aliasing value drops, the memory access counters increase, and the general performance increases.

On the test boxes I have (running pmcstat -w 120 -C -p LD_BLOCKS_PARTIAL.ADDRESS_ALIAS ./himenobmtxpa M )

0 217799413 830.995025
64 18138386 1624.296713
96 8876469 1662.486298
128 19281984 1645.370750
192 18247069 1643.119908
256 18511952 1661.426341
320 19636951 1674.154119
352 19716236 1686.694053
384 19684863 1681.110499
448 18189029 1683.163673
512 19380987 1691.937818

So there's still plenty of aliasing going on at different padding offsets, however it's a very marked drop between 0 and, well, anything.

It turns out that someone's gone and done a bunch more digging into the effects of various CPU magic under the hood. The last paper in the list (Analysing Contextual Bias..) looks at Aliasing and Cache Effects and the effect of memory layout. There's some cute (and sobering!) analysis of the performance changes due to something as simple as the length of your login name in the UNIX environment. It's worth reading.

The summary? Maybe page alignment of all of your memory accesses isn't the way to go.

For further reading:

Mesa 10.x and i915 on FreeBSD 9.x

Currently, the i915 driver on FreeBSD 9.3 has no support for hardware context. This means we are stuck with Mesa 9.1 and xserver 1.14.

To solve this, the plan is to provide the i915 kernel driver as a port. xf86-video-intel is modified to automatically load this driver, instead of the i915kms.ko module shipped with FreeBSD 9.3-RELEASE.

A patch is available: we would like users running FreeBSD 9.3 on an i915-based desktop to test it with their daily workload.

When this works, we can get rid of Mesa 9.1 and move forward with Mesa 10.x and xserver 1.17+.

Unifying Mesa ports’ configure

To minimize the time to build each Mesa port and limit the dependencies, we always configured Mesa with different flags, depending on the port being compiled. For instance, graphics/libEGL was built without Gallium, so it didn’t pull LLVM.

We discovered this strategy produces a non-working Mesa:

  • libEGL is built without the “drm” platform because it is enabled only if Gallium drivers are enabled too. This prevented us from activating glamor support in the X.Org server. We thought it was something too Linux-centric and it made the server crash on FreeBSD but we were wrong: once libEGL is built with the same flags as graphics/dri, everything works out-of-the-box.
  • Likewise with Clover, the OpenCL implementation. Without Gallium drivers, pipeloader modules are not compiled. Therefore, libOpenCL.so is installed but it is non-functional.

Therefore, we are unifying all Mesa ports so they all use the same configure flags and dependencies. This means:

  • The GALLIUM option will be removed and the support for Gallium will turned on for everybody. We do this because building half of Mesa with Gallium and not the other half leads to hard to debug situations.
  • All Mesa ports will depend on LLVM, at least on x86. Fortunately graphics/dri already depends on it with default options and Mesa is useless without it.

This greatly simplifies Mesa ports. Furthermore, as said above, this will allow us to enable glamor in the X.Org server. Glamor is the only 2D acceleration API supported by recent Radeon cards where EXA was not implemented.