Category Archives: FreeBSD

The ports tree is now stage only

The ports tree is now fully staged (only 2% has been left unstaged, marked as broken and will be removed from the ports tree if no PR to stage them are pending in bugzilla).

I would like to thank every committer and maintainers for their work on staging!
It allowed us to convert more than 23k packages to support stage in only 11 months!

Staging is a very important state, it allows us to right now be able to run quality testing scripts on the packages (which already allowed to fix tons of hidden problems) and it allows use to be able to build packages as a regular user!

It also opens the gates to new features that users have been requesting for many years:

  • flavors
  • multiple packages

Expect those features to happen in the near future.

ZFS support in libvirt

An upcoming release of libvirt, 1.2.8 that should be released early September, will include an initial support of managing ZFS volumes.

That means that it's possible to boot VMs and use ZFS volumes as disks. Additionally, it allows to control volumes using the libvirt API. Currently, supported operations are:

  • list volumes in a pool
  • create and delete volumes
  • upload and download volumes

It's not possible to create and delete pools yet, hope to implement that in the next release.

Defining a pool

Assume we have some pools and want to use one of them in libvirt:

# zpool list
NAME SIZE ALLOC FREE FRAG EXPANDSZ CAP DEDUP HEALTH ALTROOT
filepool 1,98G 56,5K 1,98G 0% - 0% 1.00x ONLINE -
test 186G 7,81G 178G 0% - 4% 1.00x ONLINE -

Let's take filepool and define it with libvirt. This could be done using this virsh command:

virsh # pool-define-as --name zfsfilepool --source-name filepool --type zfs
Pool zfsfilepool defined

virsh # pool-start zfsfilepool
Pool zfsfilepool started

virsh # pool-info zfsfilepool
Name: zfsfilepool
UUID: 5d1a33a9-d8b5-43d8-bebe-c585e9450176
State: running
Persistent: yes
Autostart: no
Capacity: 1,98 GiB
Allocation: 56,50 KiB
Available: 1,98 GiB

virsh #

As you can see, we specify a type of the pool, its source name, such as seen in zpool list output and a name for it in libvirt. We also need to start it using the pool-start command.

Managing volumes

Let's create a couple of volumes in our new pool.


virsh # vol-create-as --pool zfsfilepool --name vol1 --capacity 1G
Vol vol1 created

virsh # vol-create-as --pool zfsfilepool --name vol2 --capacity 700M
Vol vol2 created

virsh # vol-list zfsfilepool
Name Path
------------------------------------------------------------------------------
vol1 /dev/zvol/filepool/vol1
vol2 /dev/zvol/filepool/vol2

virsh #

Dropping a volume is also easy:

virsh # vol-delete --pool zfsfilepool vol2
Vol vol2 deleted

Uploading and downloading data

Let's upload an image to our new volume:

virsh # vol-upload --pool zfsfilepool --vol vol1 --file /home/novel/FreeBSD-10.0-RELEASE-amd64-memstick.img 

... and download

virsh # vol-download --pool zfsfilepool --vol vol1 --file /home/novel/zfsfilepool_vol1.img

Note: if you would check e.g. md5 sum of the downloaded files, the result would be different as downloaded file will be of the same size as a volume. However, if you trim zeros, it'll be the same.

$ md5 FreeBSD-10.0-RELEASE-amd64-memstick.img zfsfilepool_vol1.img 
MD5 (FreeBSD-10.0-RELEASE-amd64-memstick.img) = e8e7cbd41b80457957bd7981452ecf5c
MD5 (zfsfilepool_vol1.img) = a77c3b434b01a57ec091826f81ebbb97
$ truncate -r FreeBSD-10.0-RELEASE-amd64-memstick.img zfsfilepool_vol1.img
$ md5 FreeBSD-10.0-RELEASE-amd64-memstick.img zfsfilepool_vol1.img
MD5 (FreeBSD-10.0-RELEASE-amd64-memstick.img) = e8e7cbd41b80457957bd7981452ecf5c
MD5 (zfsfilepool_vol1.img) = e8e7cbd41b80457957bd7981452ecf5c
$

Booting a VM from volume

Finally got to the most important part. In use a volume as disk device for VM 'devices' section of the domain XML should be updated with something like this:


<disk type='volume' device='disk'>
<source pool='zfsfilepool' volume='vol1'/>
<target dev='vdb' bus='virtio'/>
</disk>

Few notes

Note #1: this code is just a few weeks old, so quite likely there are some rough edges. Feel free to report problems to novel%freebsd.org if you spot any problems.

Note #2: this code is FreeBSD-only for now. However, it should not be hard to make it work on Linux with zfsonlinux.org. Its developers were kind enough to add some useful missing flags in some of the CLI tools. However, these changes are not available in any released version so far. There are some more minor differences between zfs on Linux and FreeBSD, but that should not be hard to address. I was planning to get to it as soon as a new version of zfs on linux with the necessary flags is available. However, if you are interested in that and ready to help with testing -- feel free to poke me so it could be done sooner.


Receive Side Scaling: figuring out how to handle IP fragments

The TL:DR; of this is - IP fragments are annoying.

If everything was awesome and there were never IP fragments, all TCP and UDP frames would always have the TCP/UDP header stamped on them, and the NIC could hash the TCP/UDP header in hardware to calculate the destination queue to receive traffic on.

However, everything isn't awesome and there will be cases where IP frames are fragmented. When this happens, the first frame in the fragment has the IPv4 header and the TCP/UDP header - but the subsequent fragments only have the IPv4 header. That means there's not enough information in the rest of the fragments to hash them to the same hash value and thus hardware queue as the first fragment - only the first has the full IPv4+TCP/UDP information.

The Intel and Chelsio NICs will hash on all packets that are fragmented by only hashing on the IPv4 details. So, if it's a fragmented TCP or UDP frame, it will hash the first fragment the same as the others - it'll ignore the TCP/UDP details and only hash on the IPv4 frame. This means that all the fragments in a given IP datagram will hash to the same value and thus the same queue.

But if there are a mix of fragmented and non-fragmented packets in a given flow - for example, small versus larger UDP frames - then some may be hashed via the IPv4+TCP or IPv4+UDP details and some will just be hashed via the IPv4 details. This means that packets in the same flow will end up being received in different receive queues and thus highly likely be processed out of order.

The Linux intel driver code flipped off IPv4+UDP hashing a while ago - they hash UDP frames by their IPv4 details only and then do whatever other load balancing in the kernel they choose. I found this and updated the FreeBSD drivers to do the same. This should result in less out of order UDP frames for UDP heavy workloads. I'm not sure about the Chelsio driver yet - when I convert it to the RSS framework it'll disable IPv4+UDP hashing if that isn't enabled at boot time. This is a good stop-gap, but it's not the whole story.

TCP is where it gets annoying. People don't want to flip off IPv4+TCP hashing as they're convinced that the TCP MSS negotiation and path-MTU discovery stuff will prevent there from being any IP fragmented TCP frames. But, well, that's not really viable in the real world. There are too many misconfigured networks out there and IP fragmentation does occur. So this is also a problem for TCP. This means that the IPv4 fragmented TCP frames in those sessions will come into another receive queue and CPU and this will show up as out of order data.

So, what's this all have to do with receive side scaling?

With RSS, there's a well defined hash for packets and a configuration for what the operating system and NICs are supposed to be doing. It's entirely possible that we'll configure IPv4+TCP to be hashed and also entirely possible we'll see IP fragments showing up on other CPUs. So in order to have the TCP stack run on the right CPU, the IP fragments need to be assembled on whichever CPU they're received upon and then re-injected into the correct destination queue to run on the correct CPU.

Fortunately the FreeBSD netisr scheme makes this easy.

So what I'm doing in my branch (and what will soon show up in -HEAD) is thus:


  • UDP is still hashed as IPv4-only frames for now. I'll change that later to hash on IPv4+UDP and have things reinjected on the correct destination RSS bucket / netisr queue / CPU.
  • I create one netisr thread, pinned to a CPU, for each RSS CPU that's defined.
    • Ideally I'd create one netisr thread for each RSS bucket and pin that, but that'll come later.
  • IP fragments will be hashed to whatever the IPv4 hash calculates, so fragment reassembly will occur on some CPU;
    • .. and it's the same CPU for all frames in a fragmented datagram.
  • Then when the fragment is reassembled, a software hash is calculated for the newly reassembled frame.
    • If RSS is configured to hash for IPv4 only, then it'll see that the hash on the reassembled datagram matches the configured hash for that packet type and reuse it.
    • So, if it's UDP right now, it'll see that UDP is only hashing on IPv4 details and reuse it.
    • .. but if IPv4+UDP hashing is configured, it'll software hash the packet and assign the new flow type and RSS hash.
  • Then, it'll reinject the frame into netisr to be requeued and reprocessed.
  • .. this uses the nh_m2cpuid function to calculate the destination CPU for the given RSS hash.
    • If it's handled on the same destination CPU then it'll be handled.
    • If it's handled on a different destination CPU then it'll be queued to that netisr and dispatched appropriately.
This works. It's not great, and I'd rather the IP fragment reassembly code was much more efficient, but it's correct. I'm going for correctness here to begin with.

Now, before you ask - yes, IPv6 has fragments and yes, I have to do the same thing for IPv6 flows. Most of the code is written.

Finally - the same thing applies to things like IPv4 tunnels, IPv6-in-IPv4 tunnels, IPSEC tunnels and the like. The NIC hashes the packets on the IPv4 header details but once the packet is de-encapsulated, it needs to be reinjected back into the correct CPU for further processing.

Using the xdev target with qemu-user-static on #FreeBSD

I’ve been playing with building ports for ARM on an AMD64 machine via a bunch of tools.  The duct tape and bailing wire is a bit thick with this method, but if you keep at it, this should work.

1. build armv6 chroot:
make buildworld TARGET=arm TARGET_ARCH=armv6
make installworld TARGET=arm TARGET_ARCH=armv6 DESTDIR=/armv6
make distribution TARGET=arm TARGET_ARCH=armv6 DESTDIR=/armv6

2. build xdev
make xdev TARGET=arm TARGET_ARCH=armv6 NOSHARED=y

3. move xdev into chroot
mv /usr/armv6-freebsd /armv6/usr/

4. add toolchain to make.conf:
CFLAGS+=-integrated-as
CC=/usr/armv6-freebsd/usr/bin/cc
CPP=/usr/armv6-freebsd/usr/bin/cpp
CXX=/usr/armv6-freebsd/usr/bin/c++
AS=/usr/armv6-freebsd/usr/bin/as
NM=/usr/armv6-freebsd/usr/bin/nm
LD=/usr/armv6-freebsd/usr/bin/ld
OBJCOPY=/usr/armv6-freebsd/usr/bin/objcopy
SIZE=/usr/armv6-freebsd/usr/bin/size
STRIPBIN=/usr/armv6-freebsd/usr/bin/strip
5. Install qemu-static-user from ports and copy into jail:
pkg instlal qemu-static-user
mkdir -p /armv6/usr/local/bin
cp /usr/local/bin/qemu-arm /armv6/usr/local/bin/

6. setup binmiscctl to handle armv6 translations:
binmiscctl add armv6 –interpreter “/usr/local/bin/qemu-arm” –magic “x7fx45x4cx46x01x01x01x00x00x00x00x00x00x00x00x00x02x00x28x00″ –mask “xffxffxffxffxffxffxffx00xffxffxffxffxffxffxffxffxfexffxffxff” –size 20 –set-enabled

7. mount devfs and ports if needed
mount -t devfs devfs /armv6/dev
mount -t nullfs /usr/ports /armv6/usr/ports

8. chroot
chroot /armv6

pkg 1.3.0 out!

Hi all,

I’m very please to announce the release of pkg 1.3.0
This version is the result of almost 9 month of hard work

Here are the statistics for the version:
- 373 files changed, 66973 insertions(+), 38512 deletions(-)
- 29 different contributors

Please not that for the first time I’m not the main contributor, and I would
like to particularly thanks Vsevold Stakhov for all the hard work he has done to
allow us to get this release out. I would like also to give a special thanks to
Andrej Zverev for the tons of hours spending on testing and cleaning the bug
tracker!

So much has happened that it is hard to summarize so I’ll try to highlight the
major points:
- New solver, now pkg has a real SAT solver able to automatically handle
conflicts and dynamically discover them. (yes pkg set -o is deprecated now)
- pkg install now able to install local files as well and resolve their
dependencies from the remote repositories
- Lots of parts of the code has been sandboxed
- Lots of rework to improve portability
- Package installation process has been reworked to be safer and handle properly
the schg flags
- Important modification of the locking system for finer grain locks
- Massive usage of libucl
- Simplification of the API
- Lots of improvements on the UI to provide a better user experience.
- Lots of improvements in multi repository mode
- pkg audit code has been moved into the library
- pkg -o A=B that will overwrite configuration file from cli
- The ui now support long options
- The unicity of a package is not anymore origin
- Tons of bug fixes
- Tons of behaviours fixes
- Way more!

Thank you to all contributors:
Alberto Villa, Alexandre Perrin, Andrej Zverev, Antoine Brodin, Brad Davis,
Bryan Drewery, Dag-Erling Smørgrav, Dmitry Marakasov, Elvira Khabirova, Jamie
Landeg Jones, Jilles Tjoelker, John Marino, Julien Laffaye, Mathieu Arnold,
Matthew Seaman, Maximilian Gaß, Michael Gehring, Michael Gmelin, Nicolas Szalay,
Rodrigo Osorio, Roman Naumann, Rui Paulo, Sean Channel, Stanislav E. Putrya,
Vsevolod Stakhov, Xin Li, coctic

Regards,
Bapt on behalf of the pkg@

Application awareness of receive side scaling (RSS) on FreeBSD

Part of testing this receive side scaling work is designing a set of APIs that allow for some kind of affinity awareness. It's not easy - the general case is difficult and highly varying. But something has to be tested! So, where should it begin?

The main tricky part of this is the difference between incoming, outgoing and listening sockets.

For incoming traffic, the NIC has already calculated the RSS hash value and there's already a map between RSS hash and destination CPU. Well, destination queue to be much more precise; then there's a CPU for that queue.

For outgoing traffic, the thread(s) in question can be scheduled on any CPU core and as you have more cores, it's increasingly unlikely to be the right one. In FreeBSD, the default is to direct dispatch transmit related socket and protocol work in the thread that started it, save a handful of places like TCP timers. Once the driver if_transmit() method is called to transmit a frame it can check the mbuf to see what the flowid is and map that to a destination transmit queue. Before RSS, that's typically done to keep packets vaguely in some semblance of in-order behaviour - ie, for a given traffic flow between two endpoints (say, IP, or TCP, or UDP) the packets should be transmitted in-order. It wasn't really done for CPU affinity reasons.

Before RSS, there was no real consistency with how drivers hashed traffic upon receive, nor any rules on how it should select an outbound transmit queue for a given buffer. Most multi-queue drivers got it "mostly right". They definitely didn't try to make any CPU affinity choices - it was all done to preserve the in-order behaviour of traffic flows.

For an incoming socket, all the information about the destination CPU can be calculated from the RSS hash provided during frame reception. So, for TCP, the RSS hash for the received ACK during the three way handshake goes into the inpcb entry. For UDP it's not so simple (and the inpcb doesn't get a hash entry for UDP - I'll explain why below.)

For an outgoing socket, all the information about the eventual destination CPU isn't necessarily available. If the application knows the source/destination IP and source/destination port then it (or the kernel) can calculate the RSS hash that the hardware would calculate upon frame reception and use that to populate the inpcb. However this isn't typically known - frequently the source IP and port won't be explicitly defined and it'll be up to the kernel to choose them for the application. So, during socket creation, the destination CPU can't be known.

So to make it simple (and to make it simple for me to ensure the driver and protocol stack parts are working right) my focus has been on incoming sockets and incoming packets, rather than trying to handle outgoing sockets. I can handle outbound sockets easily enough - I just need to do a software hash calculation once all of the required information is available (ie, the source IP and port is selected) and populate the inpcb with that particular value. But I decided to not have to try and debug that at the same time as I debugged the driver side and the protocol stack side, so it's a "later" task.

For TCP, traffic for a given connection will use the same source/destination IP and source/destination port values. So for a given socket, it'll always hash to the same value. However, for UDP, it's quite possible to get UDP traffic from a variety of different source IP/ports and respond from a variety of different source/IP ports. This means that the RSS hash value that we can store in the inpcb isn't at all guaranteed to be the same for all subsequent socket writes.

Ok, so given all of that above information, how exactly is this supposed to work?

Well, the slightly more interesting and pressing problem is how to break out incoming requests/packets to multiple receive threads. In traditional UNIX socket setups, there are a couple of common design patterns for farming off incoming requests to multiple worker threads:

  • There's one thread that just does accept() (for TCP) or recv() (for UDP) and it then farms off new connections to userland worker threads; or
  • There are multiple userland worker threads which all wait on a single socket for accept() or recv() - and hope that the OS will only wake up one thread to hand work to.
It turns out that the OS may wake up one thread at a time for accept() or recv() but then userland threads will sit in a loop trying to accept connections / packets - and then you tend to find they get called a lot only to find another worker thread that was running stole the workload. Oops.

I decided this wasn't really acceptable for the RSS work. I needed a way to redirect traffic to a thread that's also pinned to the same CPU as the receive RSS bucket. I decided the cheapest way would be to allow multiple PCB entries for the same socket details (eg, multiple TCP sockets listening on *:80). Since the PCBGROUPS code in this instance has one PCB hash per RSS bucket, all I had to do was to teach the stack that wildcard listen PCB entries (eg, *:80) could also exist in each PCB hash bucket and to use those in preference to the global PCB hash.

The idea behind this decision is pretty simple - Robert Watson already did all this great work in setting up and debugging PCBGROUPS and then made the RSS work leverage that. All I'd have to do is to have one userland thread in each RSS bucket and have the listen socket for that thread be in the RSS bucket. Then any incoming packet would first check the PCBGROUP that matched the RSS bucket indicated by the RSS hash from the hardware - and it'd find the "right" PCB entry in the "right" PCBGROUP PCB has table for the "right" RSS bucket.

That's what I did for both TCP and UDP.

So the programming model is thus:

  • First, query the RSS sysctl (net.inet.rss) for the RSS configuration - this gives the number of RSS buckets and the RSS bucket -> CPU mapping.
  • Then create one worker thread per RSS bucket..
  • .. and pin each thread to the indicated CPU.
  • Next, each worker thread creates one listen socket..
  • .. sets the IP_BINDANY or IP6_BINDANY option to indicate that there'll be multiple RSS entries bound to the given listen details (eg, binding to *:80);
  • .. then IP_RSS_LISTEN_BUCKET to set which RSS bucket the incoming socket should live in;
  • Then for UDP - call bind()
  • Or for TCP - call bind(), then call listen()
Each worker thread will then receive TCP connections / UDP frames that are local to that CPU. Writing data out the TCP socket will also stay local to that CPU. Writing UDP frames out doesn't - and I'm about to cover that.

Yes, it's annoying because now you're not just able to choose an IO model that's convenient for your application / coding style. Oops.

Ok, so what's up with UDP?

The problem with UDP is that outbound responses may be to an arbitrary destination setup and thus may actually be considered "local" to another CPU. Most common services don't do this - they'll send the UDP response to the same remote IP and port that it was sent from.

My plan for UDP (and TCP in some instances, see below!) is four-fold:

  • When receiving UDP frames, optionally mark them with RSS hash and flowid information.
  • When transmitting UDP frames, allow userspace to inform the kernel about a pre-calculated RSS hash / flow information.
  • For the fully-connected setup (ie, where a single socket is connect() ed to a given UDP remote IP:port and frame exchange only occurs between the fixed IP and port details) - cache the RSS flow information in the inpcb;
  • .. and for all other situations (if it's not connected, if there's no hint from userland, if it's going to a destination that isn't in the inpcb) - just do a software hash calculation on the outgoing details.
I mostly have the the first two UDP options implemented (ie, where userland caches the information to re-use when transmitting the response) and I'll commit them to FreeBSD soon. The other two options are the "correct" way to do the default methods but it'll take some time to get right.

Ok, so does it work?

I don't have graphs. Mostly because I'm slack. I'll do up some before I present this - likely at BSDCan 2015.

My testing has been done with Intel 1G and 10G NICs on desktop Ivy Bridge 4-core hardware. So yes, server class hardware will behave better.

For incoming TCP workloads (eg a webserver) then yes, there's no lock contention between CPUs in the NIC driver or network stack any longer. The main lock contention between CPUs is the VM and allocator paths. If you're doing disk IO then that'll also show up.

For incoming UDP workloads, I've seen it scale linearly on 10G NICs (ixgbe(4)) from one to four cores. This is with no-defragmentation, 510 byte sized datagrams.

Ie, 1 core reception (ie, all flows to one core) was ~ 250,000 pps into userland with just straight UDP reception and no flow/hash information via recvmsg(); 135,000 pps into userland with UDP reception and flow/hash information via recvmsg().

4 core reception was ~ 1.1 million pps into userland, roughly ~ 255,000 pps per core. There's no contention between CPU cores at all.

Unfortunately what I was sending was markedly different. The driver quite happily received 1.1 million frames on one queue and up to 2.1 million when all four queues were busy. So there's definitely room for improvement.

Now, there is lock contention - it's just not between CPU cores. Now that I'm getting past the between-core contention, we see the within-core contention.

For TCP HTTP request reception and bulk response transmission, most of the contention I'm currently seeing is between the driver transmit paths. So, the following occurs:

  • TCP stack writes some data out;
  • NIC if_transmit() method is called;
  • It tries to grab the queue lock and succeeds;
It then appends the frame to the buf_ring and schedules a transmit out the NIC. This bit is fine.

But then whilst the transmit lock is held, because the driver is taking frames from the buf_ring to push into the NIC TX DMA queue
  • The NIC queue interrupt fires, scheduling the software interrupt thread;
  • This pre-empts the existing running transmit thread;
  • The NIC code tries to grab the transmit lock to handle completed transmissions;
  • .. and it fails, because the code it preempted holds the transmit lock already.
So there's some context switching and thrashing going on there which needs to be addressed.

Ok, what about UDP? It turns out there's some lock contention with the socket receive buffer.

The soreceive_dgram() routine grabs the socket receive buffer (SOCKBUF_LOCK()) to see if there's anything to return. If not, and if it can sleep, it'll call sbwait() that will release the lock and msleep() waiting for the protocol stack to indicate that something has been received. However, since we're receiving packets at such a very high rate, it seems that the receive protocol path contends with the socket buffer lock that is held by the userland code trying to receive a datagram. It pre-empts the user thread, tries to grab the lock and fails - and then goes to sleep until the userland code finishes with the lock. soreceive_dgram() doesn't hold the lock for very long - but I do see upwards of a million context switches a second.

To wrap up - I'm pleased with how things are going. I've found and fixed some issues with the igb(4) and ixgbe(4) drivers that were partly my fault and the traffic is now quite happily and correctly being processed in parallel. There are issues with scaling within a core that are now being exposed and I'm glad to say I'm going to ignore them for now and focus on wrapping up what I've started.

There's a bunch more to talk about and I'm going to do it in follow-up posts.
  • what I'm going to do about UDP transmit in more detail;
  • what about creating outbound connections and how applications can be structured to handle this;
  • handling IP fragments and rehashing packets to be mostly in-order - and what happens when we can't guarantee ordering with the hardware hashing UDP frames to a 4-tuple;
  • CPU hash rebalancing - what if a specific bucket gets too much CPU load for some reason;
  • randomly creating a toeplitz RSS hash key at bootup and how that should be verified;
  • multi-socket CPU and IO domain awareness;
  • .. and whatever else I'm going to stumble across whilst I'm slowly fleshing this stuff out.
I hope to get the UDP transmit side of things completed in the next couple of weeks so I can teach memcached about TCP and UDP RSS. After that, who knows!

Using qemu-user to chroot and bootstrap other architectures on #FreeBSD

My last post spawned enough feedback that I thought I would dump some notes here for those interested in building a chroot on FreeBSD that allows you to test and prototype architectures, e.g. ARMv6 on AMD64.

The FreeBSD buildsys has many targets used for many things, the two we care about here are buildworld and distribution.  We will also be changing the output architecture through the use of TARGET and TARGET_ARCH command line variables.  I’ll assume csh is your shell here, just for simplicity.  You’ll need 10stable or 11current to do this, as it requires the binary activator via binmiscctl(8) which has not appeared in a release version of FreeBSD yet.

Checkout the FreeBSD source tree somewhere, your home directory will be fine and start a buildworld.  This will take a while, so get a cup of tea and relax.

make -s -j <number of cpus on your machine> buildworld TARGET=mips TARGET_ARCH=mips64 MAKEOBJDIRPREFIX=/var/tmp

Some valid combinations of TARGET/TARGET_ARCH are:

mips:mips

mips:mip64

arm:armv6

sparc64:sparc64

powerpc:powerpc

powerpc:powerpc64

i386:i386

amd64:amd64

Once this is done, you have an installable tree in /var/tmp.  You need to be root for the next few steps, su now and execute these steps:

make -s installworld TARGET=mips TARGET_ARCH=mips64 MAKEOBJDIRPREFIX=/var/tmp DESTDIR=/opt/test

DESTDIR is where you intend on placing the installed FreeBSD system.  I chose /opt/test here only because I wanted to be FAR away from anything in my running system.  Just to be clear here, this will crush and destroy your host computer without DESTDIR set.

Next, there are some tweaks that have to be done by the buildsys, so run this command as root:

make -s distribution TARGET=mips TARGET_ARCH=mips64 MAKEOBJDIRPREFIX=/var/tmp DESTDIR=/opt/test

Now we need to install the emulator tools (QEMU) to allow us to use the chroot on our system.  I suggest using emulators/qemu-user-static for this as Juergen Lock has set it up for exactly this purpose.  It will install only the tools you need here.

Once that is installed, via pkg or ports, setup your binary activator module for the architecture of your chroot.  Use the listed options on the QEMU user mode wiki page for the architecture you want.  I know the arguments are not straight forward, but there should be examples for the target that you are looking for.

For this mips/mips64 example:

binmiscctl add mips64elf –interpreter “/usr/local/bin/qemu-mips64-static”
–magic “x7fx45x4cx46x02x02x01x00x00x00x00x00x00x00x00x00x00x02x00x08″
–mask “xffxffxffxffxffxffxffx00xffxffxffxffxffxffxffxffxffxfexffxff”
–size 20 –set-enabled

Copy the binary qemu that you setup in this step *into* the chroot environment:

mkdir -p /opt/tmp/usr/local/bin

cp /usr/local/bin/qemu-mips64-static /opt/tmp/usr/local/bin/

Mount devfs into the chroot:

mount -t devfs devfs /opt/tmp/dev

Want to try building ports in your chroot?  Mount the ports tree in via nullfs:

mkdir /opt/tmp/usr/ports

mount -t nullfs /usr/ports /opt/tmp/usr/ports

And now, through the QEMU and FreeBSD, you can simply chroot into the environment:

chroot /opt/tmp

Hopefully, you can now “do” things as though you were running on a MIPS64 or whatever architecture machine you have as a target.

arm:armv6, mips:mips, mips:mips64 are working at about %80-90 functionality.  powerpc:powerpc64 and powerpc:powerpc are still a work in progress and need more work.  sparc64:sparc64 immediately aborts and probably needs someone with an eye familiar with the architecture to give QEMU a look.  If you are interested in further development of the qemu-user targets, please see my github repo and clone away.

If you are looking to see what needs to be done, Stacey Son has kept an excellent log of open item on the FreeBSD Wiki

DNS improvements in FreeBSD 11

Erwin Lansing just posted a summary of the DNS session at the FreeBSD DevSummit that was held in conjunction with BSDCan 2014 in May. It gives a good overview of the current state of affairs, including known bugs and plans for the future.

I’ve been working on some of these issues recently (in between $dayjob and other projects). I fixed two issues in the last 48 hours, and am working on two more.

Reverse lookups in private networks

Fixed in 11.

In its default configuration, Unbound 1.4.22 does not allow reverse lookups for private addresses (RFC 1918 and the like). NLNet backported a patch from the development version of Unbound which adds a configuration option, unblock-lan-zones, which disables this filtering. But that alone is not enough, because the reverse zones are actually signed (EDIT: the problem is more subtle than that, details in comments); Unbound will attempt to validate the reply, and will reject it because the zone is supposed to be empty. Thus, for reverse lookups to work, the reverse zones for all private address ranges must be declared as insecure:

server:
    # Unblock reverse lookups for LAN addresses
    unblock-lan-zones: yes
    domain-insecure:      10.in-addr.arpa.
    domain-insecure:     127.in-addr.arpa.
    domain-insecure: 254.169.in-addr.arpa.
    domain-insecure:  16.172.in-addr.arpa.
    # ...
    domain-insecure:  31.172.in-addr.arpa.
    domain-insecure: 168.192.in-addr.arpa.
    domain-insecure: 8.e.ip6.arpa.
    domain-insecure: 9.e.ip6.arpa.
    domain-insecure: a.e.ip6.arpa.
    domain-insecure: b.e.ip6.arpa.
    domain-insecure: d.f.ip6.arpa.

FreeBSD 11 now has both the unblock-lan-zones patch and an updated local-unbound-setup script which sets up the reverse zones. To take advantage of this, simply run the following command to regenerate your configuration:

# service local_unbound setup

This feature will be available in FreeBSD 10.1.

Building libunbound writes to /usr/src (#190739)

Fixed in 11.

The configuration lexer and parser were included in the source tree instead of being generated at build time. Under certain circumstances, make(1) would decide that they needed to be regenerated. At best, this inserted spurious changes into the source tree; at worst, it broke the build.

Part of the reason for this is that Unbound uses preprocessor macros to place the code generated by lex(1) and yacc(1) in its own separate namespace. FreeBSD’s lex(1) is actually Flex, which has a command-line option to achieve the same effect in a much simpler manner, but to take advantage of this, the lexer needed to be cleaned up a bit.

Allow local domain control sockets

Work in progress

An Unbound server can be controlled by sending commands (such as “reload your configuration file”, “flush your cache”, “give me usage statistics”) over a control socket. Currently, this can only be a TCP socket. Ilya Bakulin has developed a patch, which I am currently testing, that allows Unbound to use a local domain (aka “unix”) socket instead.

Allow unauthenticated control sockets

Work in progress

If the control socket is a local domain socket instead of a TCP socket, there is no need for encryption and little need for authentication. In the local resolver case, only root on the local machine needs access, and this can be enforced by the ownership and file permissions of the socket. A second patch by Ilya Bakulin makes encryption and authentication optional so there is no need to generate and store a client certificate in order to use unbound-control(8).

Cross building ports with qemu-user and poudriere-devel on #FreeBSD

I’ve spent the last few months banging though the bits and pieces of the work that Stacey Son implemented for QEMU to allow us to more or less chroot into a foreign architecture as though it were a normal chroot.  This has opened up a lot of opportunities to bootstrap the non-x86 architectures on FreeBSD.

Before I get started, I’d like to thank Stacey Son, Ed Maste, Juergen Lock, Peter Wemm, Justin Hibbits, Alexander Kabaev, Baptiste Daroussin and Bryan Drewery for the group effort in getting us the point of working ARMv6, MIPS32 and MIPS64 builds.  This has been a group effort for sure.

This will require a 10stable or 11current machine, as this uses Stacey’s binary activator patch to redirect execution of binaries through QEMU depending on the ELF header of the file.  See binmiscctl(8) for more details.

Mechanically, this is a pretty easy setup.  You’ll need to install ports-mgmt/poudriere-devel with the qemu-user option selected.  This will pull in the qemu-user code to emulate the environment we need to get things going.

I’ll pretend that you want an ARMv6 environment here.  This is suitable to build packages for the Rasberry PI and Beagle Bone Black.  Run this as root:

binmiscctl add armv6 –interpreter “/usr/local/bin/qemu-arm” –magic
“x7fx45x4cx46x01x01x01x00x00x00x00x00x00x00x00x00x02
x00x28x00″ –mask “xffxffxffxffxffxffxffx00xffxffxffxff
xffxffxffxffxfexffxffxff” –size 20 –set-enabled

This magic will load the imgact_binmisc.ko kernel module.  The rest of the command line instructs the kernel to redirect execution though /usr/local/bin/qemu-arm if the ELF header of the file matches an ARMv6 signature.

Build your poudriere jail (remember to install poudriere-devel for now as it has not been moved to stable at the time of this writing) with the following command:

poudriere jail -c -j 11armv632 -m svn -a armv6 -v head

Once this is done, you will be able to start a package build via poudriere bulk as you normally would:

poudriere bulk -j 11armv632 -a

or

poudriere bulk -j 11armv632 -f <my_file_of_ports_to_build>

Currently, we are running the first builds in the FreeBSD project to determine what needs the most attention first.  Hopefully, soon we’ll have something that looks like a coherent package set for non-x86 architectures.

For more information, work in progress things and possible bugs in qemu-user code, see Stacey’s list of things at:

https://wiki.freebsd.org/QemuUserModeHowTo

https://wiki.freebsd.org/QemuUserModeToDo

BSDCan 2014 – Ports and Packages WG

Baptiste Daroussin started the session with a status update on package building. All packages are now built with poudriere. The FreeBSD Foundation sponsored some large machines on which it takes around 16 hours to build a full tree. Each Wednesday at 01:00UTC the tree is snapshot and an incremental build is started for all supported released, the 2 stable branches (9 and 10) and quarterly branches for 9.x-RELEASE and 10.x-RELEASE. The catalogue is signed on a dedicated signing machine before upload. Packages can be downloaded from 4 mirrors (us-west, us-east, UK, and Russia) and feedback so far has been very positive.

He went on to note that ports people need better coordination with src people on ABI breakage. We currently only support i386 and amd64, with future plans for ARM and a MIPS variant. Distfiles are not currently mirrored (since fixed), and while it has seen no progress, it’s still a good idea to build a pkg of the ports tree itself.

pkg 1.3 will include a new solver, which will help 'pkg upgrade' understand that an old packages needs to be replaced with a newer one, with no more need for 'pkg set' and other chicanery. Cross building ports has been added to the ports tree, but is waiting for pkg-1.3. All the dangerous operations in pkg have now been sandboxed as well.

EOL for pkg_tools has been set for September 1st. An errata notice has gone out that adds a default pkg.conf and keys to all supported branches, and nagging delays have been added to ports.

Quarterly branches based on 3 month support cycle has been started on an evaluation basis. We’re still unsure about the manpower needed to maintain those. Every quarter a snapshot of the tree is created and only security fixed, build and runtime fixed, and upgrades to pkg are allowed to be committed to it. Using the MFH tag in a commit message will automatically send an approval request to portmgr and an mfh script on Tools/ makes it easy to do the merge.

Experience so far has been good, some minor issues to the insufficient testing. MFHs should only contain the above mentioned fixes; cleanups and other improvements should be done in separate commits only to HEAD. A policy needs to be written and announced about this. Do we want to automatically merge VuXML commits, or just remove VuXML from the branch and only use the one in HEAD?

A large number of new infrastructure changes have been introduces over the past few months, some of which require a huge migration of all ports. To speed these changes up, a new policy was set to allow some specific fixes to be committed without maintainer approval. Experience so far has been good, things actually are being fixed faster than before and not many maintainers have complained. There was agreement that the list of fixes allowed to be committed without explicit approval should be a specific whitelist published by portmgr, and not made too broad in scope.

Erwin Lansing quickly measured the temperature of the room on changing the default protocol for fetching distils from MASTER_SITE_BACKUP from ftp to http. Agreement all around and erwin committed the change.

Ben Kaduk gave an introduction and update on MIT’s Athena Environment with some food for thought. While currently not FreeBSD based, he would like to see it become so. Based on debian/ubuntu and rolled out on hundreds of machines, it now has it’s software split into about 150 different packages and metapackages.

Dag-Erling Smørgrav discussed changes to how dependencies are handled, especially splitting dependencies that are needed at install time (or staging time) and those needed at run time. This may break several things, but pkg-1.3 will come with better dependency tracking solving part of the problem.

Ed Maste presented the idea of “package transparency”, loosely based on Google’s Certificate Transparency. By logging certificate issuance to a log server, which can be publicly checked, domain owners can search for certificates issued for their domains, and notice when a certificate is issued without their authority. Can this model be extended to packages? Mostly useful for individually signed packages, while we currently only sign the catalogue. Can we do this with the current infrastructure?

Stacy Son gave an update on Qemu user mode, which is now working with Qemu 2.0.0. Both static and dynamic binaries are supported, though only a handful of system call are supported.

Baptiste introduced the idea of having pre-/post-install scripts be a library of services, like Casper, for common actions. This reduces the ability of maintainers to perform arbitrary actions and can be sandboxed easily. This would be a huge security improvement and could also enhance performance.

Cross building is coming along quite well and most of the tree should be able to be build by a simple 'make package'. Major blockers include perl and python.

Bryan Drewery talked about a design for a PortsCI system. The idea is that committer easily can schedule a build, be it an exp-run, reference, QAT, or other, either via a web interface or something similar to a pull request, which can fire off a build.

Steve Wills talked about using Jenkins for ports. The current system polls SVN for commits and batches several changes together for a build. It uses 8 bhyve VMs instances, but is slow. Sean Bruno commented that there are several package building clusters right now, can they be unified? Also how much hardware would be needed to speed up Jenkins? We could duse Jenkins as a fronted for the system Bryan just talked about. Also, it should be able to integrate with phabricator.

Erwin opened up the floor to talk about freebsd-version(1) once more. It was introduced as a mechanism to find out the version of user land currently running as uname -r only represents the kernel version, and would thus miss updates of the base system that do no touch the kernel. Unfortunately, freebsd-version(1) cannot really be used like this in all cases, it may work for freebsd-update, but not in general. No real solution was found this time either.

The session ended with a discussion about packaging the base system. It’s a target for FreeBSD 11, but lots of questions are still to be answered. What granularity to use? What should be packages into how many packages? How to handle options? Where do we put the metadata for this? How do upgrades work? How to replace shared libraries in multiuser mode? This part also included the quote of the day: “Our buildsystem is not a paragon of configurability, but a bunch of hacks that annoyed people the most.”

Thanks to all who participated in the working group, and thanks again to DK Hostmaster for sponsoring my trip to BSDCan this year, and see you at the Ports and Packages WG meet up at EuroBSDCon in Sofia in September.

FreeBSD Developer Summit – BSDCan 2014 – Ports and Packages WG

Baptiste Daroussin started the session with a status update on package building. All packages are now built with poudriere. The FreeBSD Foundation sponsored some large machines on which it takes around 16 hours to build a full tree. Each Wednesday at 01:00UTC the tree is snapshot and an incremental build is started for all supported released, the 2 stable branches (9 and 10) and quarterly branches for 9.x-RELEASE and 10.x-RELEASE. The catalogue is signed on a dedicated signing machine before upload. Packages can be downloaded from 4 mirrors (us-west, us-east, UK, and Russia) and feedback so far has been very positive.

He went on to note that ports people need better coordination with src people on ABI breakage. We currently only support i386 and amd64, with future plans for ARM and a MIPS variant. Distfiles are not currently mirrored (since fixed), and while it has seen no progress, it’s still a good idea to build a pkg of the ports tree itself.

pkg 1.3 will include a new solver, which will help 'pkg upgrade' understand that an old packages needs to be replaced with a newer one, with no more need for 'pkg set' and other chicanery. Cross building ports has been added to the ports tree, but is waiting for pkg-1.3. All the dangerous operations in pkg have now been sandboxed as well.

EOL for pkg_tools has been set for September 1st. An errata notice has gone out that adds a default pkg.conf and keys to all supported branches, and nagging delays have been added to ports.

Quarterly branches based on 3 month support cycle has been started on an evaluation basis. We’re still unsure about the manpower needed to maintain those. Every quarter a snapshot of the tree is created and only security fixed, build and runtime fixed, and upgrades to pkg are allowed to be committed to it. Using the MFH tag in a commit message will automatically send an approval request to portmgr and an mfh script on Tools/ makes it easy to do the merge.

Experience so far has been good, some minor issues to the insufficient testing. MFHs should only contain the above mentioned fixes; cleanups and other improvements should be done in separate commits only to HEAD. A policy needs to be written and announced about this. Do we want to automatically merge VuXML commits, or just remove VuXML from the branch and only use the one in HEAD?

A large number of new infrastructure changes have been introduces over the past few months, some of which require a huge migration of all ports. To speed these changes up, a new policy was set to allow some specific fixes to be committed without maintainer approval. Experience so far has been good, things actually are being fixed faster than before and not many maintainers have complained. There was agreement that the list of fixes allowed to be committed without explicit approval should be a specific whitelist published by portmgr, and not made too broad in scope.

Erwin Lansing quickly measured the temperature of the room on changing the default protocol for fetching distils from MASTER_SITE_BACKUP from ftp to http. Agreement all around and erwin committed the change.

Ben Kaduk gave an introduction and update on MIT’s Athena Environment with some food for thought. While currently not FreeBSD based, he would like to see it become so. Based on debian/ubuntu and rolled out on hundreds of machines, it now has it’s software split into about 150 different packages and metapackages.

Dag-Erling Smørgrav discussed changes to how dependencies are handled, especially splitting dependencies that are needed at install time (or staging time) and those needed at run time. This may break several things, but pkg-1.3 will come with better dependency tracking solving part of the problem.

Ed Maste presented the idea of “package transparency”, loosely based on Google’s Certificate Transparency. By logging certificate issuance to a log server, which can be publicly checked, domain owners can search for certificates issued for their domains, and notice when a certificate is issued without their authority. Can this model be extended to packages? Mostly useful for individually signed packages, while we currently only sign the catalogue. Can we do this with the current infrastructure?

Stacy Son gave an update on Qemu user mode, which is now working with Qemu 2.0.0. Both static and dynamic binaries are supported, though only a handful of system call are supported.

Baptiste introduced the idea of having pre-/post-install scripts be a library of services, like Casper, for common actions. This reduces the ability of maintainers to perform arbitrary actions and can be sandboxed easily. This would be a huge security improvement and could also enhance performance.

Cross building is coming along quite well and most of the tree should be able to be build by a simple 'make package'. Major blockers include perl and python.

Bryan Drewery talked about a design for a PortsCI system. The idea is that committer easily can schedule a build, be it an exp-run, reference, QAT, or other, either via a web interface or something similar to a pull request, which can fire off a build.

Steve Wills talked about using Jenkins for ports. The current system polls SVN for commits and batches several changes together for a build. It uses 8 bhyve VMs instances, but is slow. Sean Bruno commented that there are several package building clusters right now, can they be unified? Also how much hardware would be needed to speed up Jenkins? We could duse Jenkins as a fronted for the system Bryan just talked about. Also, it should be able to integrate with phabricator.

Erwin opened up the floor to talk about freebsd-version(1) once more. It was introduced as a mechanism to find out the version of user land currently running as uname -r only represents the kernel version, and would thus miss updates of the base system that do no touch the kernel. Unfortunately, freebsd-version(1) cannot really be used like this in all cases, it may work for freebsd-update, but not in general. No real solution was found this time either.

The session ended with a discussion about packaging the base system. It’s a target for FreeBSD 11, but lots of questions are still to be answered. What granularity to use? What should be packages into how many packages? How to handle options? Where do we put the metadata for this? How do upgrades work? How to replace shared libraries in multiuser mode? This part also included the quote of the day: “Our buildsystem is not a paragon of configurability, but a bunch of hacks that annoyed people the most.”

Thanks to all who participated in the working group, and thanks again to DK Hostmaster for sponsoring my trip to BSDCan this year, and see you at the Ports and Packages WG meet up at EuroBSDCon in Sofia in September.

FreeBSD Developer Summit – BSDCan 2014 – DNS WG

The DNS Working Group at the FreeBSD Developer Summit at BSDCan this year was off to a good start by noticing that DNSSEC validation could not work on the University of Ottawa’s wireless network. The university’s resolvers added additional records to the root zone, thus failing validation at the root. This led to some discussion on how to provide a user-friendly way to explain this in an understandable way to the user and giver the user a choice of turning off validation or find another network. This certainly is going to be a major problem when turning on validation by default as broken resolvers are very common at hotels, coffee shops, etc. etc.

On a more positive note, all the FreeBSD projects zones are DNSSEC signed and all project-owned servers have SSHFP records in the zone. Dog food was eaten.

Dag-Erling Smørgrav started off by giving an overview of the current state of affairs. ldns and unbound are imported into base in HEAD and 10.x. unbound is meant to act as a local resolver only and as it is not linked to libevent, it will not scale to anything else. For a network-wide resolver or any other configuration, it is recommended to install unbound from ports. DES further went into some of the implementation details on how the base unbound is installed to make sure it does not conflict with an unbound installed from ports.

DES explained some issues he encountered with local and RFC1918 zones which are filtered by default by unbound. Others reported no issues with the right configuration options, so more investigation is needed.

Some people reported having difficulty getting patches accepted upstream by NLNetLabs, which gave some cause for concern as we clearly want a good and active working relationship with our DNS vendor. Others reported no problem working with NLNetLabs, quite the opposite, they are very interested to see the work going on in operation systems, so we’ll just need to build upon that relationship and make sure to invite them to the next WG meeting. Patches that are currently being worked on, DES has some code cleanups, Björn a DNS64 feature, should be submitted through the “normal” submission process and review with NLNetLabs and we’ll see how that goes.

Erwin Lansing started the brainstorm session on future work. Some command line tools would be nice to have; drill does most things one wants, but people are too used to writing dig and dig has many more options; Peter Wemm would like to see contrib scripts line ldns-dane, which are just really easy to use; the control socket should be a unix socket, there’s a patch floating around and should be submitted upstream.

The “Starbucks” problem came up again, with a proposal to turn on val-permissive-mode by default. Another solution may be by looking at how unbound-trigger does its magic.

After a coffee break, Peter Losher, ISC, went over some of the recent changes at ISC. BIND10 development has been handed over to a new project and ISC will concentrate on BIND9 and a stand-alone project for the DHCP component. BIND 9.10 was recently released and plans are in place for 9.11. ISC is open to suggestions and feature requests.

Peter brought up the topic of clientID for which a IETF draft (draft-edns0-client-subnet) is available. This would help client find the nearest CDN node, etc. ISC wants this to be an opt-out in operating systems as it will peel off a layer of anonymisation, and should be controllable by the user.

Next up was Michael Bentkofsky, Verisign, who, while not involved in the project himself, gave an introduction into the getDNS API, which is a replacement for getaddrinfo and allows the stub resolver to get validation information down at the client level. It’s available in ports. The discussion went into more of a brainstorm on how applications should get DNS and DNSSEC information and who gets to make decisions about its security. There should be a clear separation between policy and mechanism, where application programmers should not have to worry about this; it should be a system policy. There should be a higher level API where an application basically can ask the operating system for a “connection” and the operation system takes care of everything behind the scenes, DNS, DNSSEC, SSL, DANE, etc. and just return a socket, with some information on how the connection was established and which security mechanisms were used. In FreeBSD, it would make sense to let the Casper daemon hand out the different sub-tasks to ensure all lookups, cryptography, etc. are properly compartmentalised. One potential problem with passing on additional information is that all DNS lookups currently go through nsswitch, which would need to grow knowledge about that data as well. Are people still using other mechanisms for hostname lookups besides the hosts file and DNS? We can probably just remove nsswitch for the hostname lookups.

The session ended with some aims for the 11.0 release. We’ll need to have a wider discussion about the aforementioned removal of nsswitch out of the hostname lookups. We’ll also need a better understanding of what API capabilities applications may need. Can Casper provide all these? Can it run unbound behind the scenes to do all the DNS “stuff” for it? Can we capsicumize unbound and will that be accepted upstream? Enough food for thought and even more for writing code.

Thanks again to DK Hostmaster for sponsoring my trip to BSDCan this year, and see you at the DNS WG meet up at EuroBSDCon in Sofia in September.

2014Q3 Branched

The 2014Q3 branch has just been branched and the package builder has been
updated to use that branch. This means that the next update on the quarterly
packages will be on the 2014Q3 branch.

What happened during the last 3 months:
- 177 different committers have participated
- 9918 commits happened
- diffstat says: 23646 files changed, 554070 insertions(+), 577210 deletions(-)

What does that means for users:
- default Java is now 1.7
- massive conversion to stagedir (93% of the ports are now properly staged)
- massive improvement of the usage of libtool (which reduces a lot overlinking)
- new USES: mono, objc, drupal, gecko, cpe, gssapi, makeinfo
- new Keywords for plist: @sample, @shell
- LibreOffice has been updated to 4.2.5
- Firefox has been updated to 30.0
- Firefox-esr has been updated to 24.6
- Default postgresql has moved from 9.0 to 9.2
- nginx has been updated to 1.6.0
- Default lua is 5.2
- subversion has been split into multiple ports for each features
- On FreeBSD 9-STABLE and 10-STABLE the default xorg 1.12.4 (for default binary
packages it is still 1.7.7)
- Improved QA checking in the infrastructure
- Info files are handle correctly even if base has been built WITHOUT_INFO
- Ancient emacs version has been cleaned out

fossil in prison

Since my last post about how I do host fossil I have been asked write about the new setup I do have

The jail content

I have created a minimal jail:

$ find /usr/local/jails/fossil -print
/usr/local/jails/fossil/var
/usr/local/jails/fossil/var/tmp
/usr/local/jails/fossil/libexec
/usr/local/jails/fossil/libexec/ld-elf.so.1
/usr/local/jails/fossil/bin
/usr/local/jails/fossil/bin/sh
/usr/local/jails/fossil/bin/fossil
/usr/local/jails/fossil/lib
/usr/local/jails/fossil/lib/libc.so.7
/usr/local/jails/fossil/lib/libssl.so.7
/usr/local/jails/fossil/lib/libreadline.so.8
/usr/local/jails/fossil/lib/libz.so.6
/usr/local/jails/fossil/lib/libcrypto.so.7
/usr/local/jails/fossil/lib/libncurses.so.8
/usr/local/jails/fossil/lib/libedit.so.7
/usr/local/jails/fossil/data
/usr/local/jails/fossil/dev

/bin/sh is necessary to get the exec.start jail argument to work /var/tmp is necessary to get fossil to open his temporary files (I created it with 1777 credential) /data is a empty directory where the fossil files will be stored

Jail configuration

The configuration file is the following:

fossil {
	path = "/usr/local/jails/fossil";
	host.hostname = "fossil.etoilebsd.net";
	mount.devfs;
	ip4.addr="127.0.0.1";
	exec.start = "/bin/fossil server -P 8084 --localhost --files *.json,*.html,*.js,*.css,*.txt --notfound /index.html /data &";
	exec.system_jail_user = "true";
	exec.jail_user = "www";
	exec.consolelog = "/var/log/jails/fossil.log" ;
}

More about fossil itself

In /data I created an index.html which is an almost empty html with a bit of Javascript.

When loading the javascript will request a list.txt file.

This file contain the list of repositories I want to show publically (one per line).

For each of them the javascript will use the json interface of fossil (meaning your fossil has to be built with json) and gather the name and the description of the repo to print them on the index.

Starting/Stopping the service

2 simple command are necessary to manage the service:

Starting up:

# jail -c fossil

Stopping:

# jail -r fossil

The service is only listening on the localhost, it is up to you to create your reverse proxy, in my case I do use nginx with the following config:

server {
	server_name fossil.etoilebsd.net;
	listen       [::]:443 ssl;
	listen       443 ssl;
	ssl_certificate     ssl/fossil.crt;
	ssl_certificate_key ssl/fossil.key;

	location / {
		client_max_body_size 10M;
		proxy_buffering off;
		proxy_pass http://127.0.0.1:8084/;
		proxy_set_header HTTPS on;
		proxy_set_header   Host             $host;
		proxy_set_header   X-Real-IP        $remote_addr;
		proxy_set_header   X-Forwarded-For  $proxy_add_x_forwarded_for;
	}
}

Hacking on Receive Side Scaling (RSS) on FreeBSD

RSS is a Microsoft invention that tries to keep a given TCP or UDP flow (and I think IP, but I haven't yet tried that) on a given CPU core. The idea is to try and keep both flow-local data and flow-local locking on a single CPU core, increasing the chances that data is hot in the CPU core cache and reducing the chance of lock overhead.

You can find the RSS overview and programming details here:

http://msdn.microsoft.com/en-us/library/windows/hardware/ff567236(v=vs.85).aspx

RSS and supporting technology has been making its way into FreeBSD for quite some time but it's not in any real shape that application developers can take advantage of.

Firstly, there's "PCBGROUPS", which looks to group PCB (protocol control block) data for a connection local to a CPU. Instead of there being one global PCB table for the system (well, VIMAGE for FreeBSD - each virtual image instance has its own PCB table) with one lock protecting it, there's now multiple PCB tables, one per "thing". Here, the thing is whatever the kernel developer thinks is worth grouping them by.

http://www.ece.rice.edu/~willmann/pubs/paranet_usenix.pdf

Now, until the RSS work went in, this code was in FreeBSD but sat unused. A kernel developer could provide the hooks needed to map TCP (and maybe UDP later) flows to a "thing" and have that map to a PCB group table - but it required some glue to stamp incoming connections and outgoing packets with some identifier (which we call a "flowid" in FreeBSD) with something that can map to said "thing". Then whenever a PCB lookup was needed, it would first try the lookup in the table mapped to by the mapping between the "flowid" and "thing" - if it was successful, it wouldn't have to use the global PCB table to do the lookup.

This is only good for established connections - creating and destroying a connection still requires manipulating that global PCB table and the single PCB table lock. I'm going to ignore fixing that for now, as that is a bigger issue.

Then Robert Watson added the RSS work done under contract to Juniper Networks, Inc. RSS provides one kind of mapping between the flowid from the NIC and which CPU to run work on. So that part worked great - but there wasn't any way for the application user to take advantage of it. Additionally, there's no driver awareness of it yet - I'll discuss this shortly.

So I grabbed a bunch of this work whilst at Netflix and tried to make sense of it. It turns out that if you can keep the work local to a CPU, a lot of the lock contention in the networking stack melts away. Here's what's going on:

  • The receive thread(s) in the NIC driver processing packets are typically doing direct dispatch to the network stack - so they're running the receive side of the TCP stack;
  • .. and the receive side of the network stack includes ACKs, which triggers the transmit side of the network stack;
  • There's typically some deferred thread(s) in the NIC driver transmitting packets to each NIC queue;
  • There's also application threads trying to queue data to the TCP socket, which also can dig into the socket and TCP stack state, which involves grabbing locks;
  • And there's also timers firing to update state, and doing this involves grabbing locks.
Without RSS and without lining everything up on CPU cores, all the above can run on different cores. Whenever any of them try running at the same time, lock contention can occur and that particular task can stop. If the lock contention blocks the transmit or receive NIC threads, then not only is that connection affected - the whole NIC processing is affected.

There's still lock contention in the network stack - especially if you're doing a lot of new, short connections. The good folk at Verisign are working on that particular corner of the problem so I'm happy to defer to them.

So, I ended up doing a bunch of little pieces to get this lined up right:
  • The per-CPU timer callwheels can now be optionally pinned to their CPU cores, so timer events running on CPU X actually do run on CPU X (yes, that was amusing to find..);
  • There's support in the TCP stack for per-CPU timers, but it's not enabled by default;
  • ... and it also didn't query RSS, netisr or anything to figure out how to map a flowid to a given CPU to run a timer on;
  • Then to make matters worse, incoming TCP sessions didn't have a flowid assigned to the PCB until after the first data packet was read - which meant that the initial timer work would all assume CPU 0 and any queries on that particular PCB would return flowid=0 - so it would not find it in the right PCBGROUP.
So those are fixed in FreeBSD-HEAD. The per-CPU TCP timer and pinned-CPU timers aren't enabled by default - I'll only flip that on when I'm confident that the RSS stuff is working.

So that lets all the RSS stuff correctly work. But there wasn't a nice way to query the per-connection flowid or RSS information. So I then extended netstat to have 'R' as a flag - it returns the flowid and the flowid type. I'll add RSS information once I have a nice way to extract it out in bulk. It's still a good diagnostic tool to ensure that the IPv4/IPv6 hashing is working correctly.

Then I had to teach a driver about RSS so I could actually test it all out. I have some igb(4) hardware at home, so I did the minimal work required to teach it about the RSS key and assigning things to the correct CPUs. It's still incomplete but it's good enough to get off the ground. I'll go into more details about the driver requirements in a follow-up blogpost.

Finally, how are application developers supposed to use it? I'll cover that particular bit in another follow-up blog post as there's quite a lot to cover there.

Playing nice with others. git(1) and patches on #FreeBSD

I’ve been spending a lot of time massaging a branch of patches and other assorted bits and pieces for QEMU user mode on github

This led me down the path of being a good git user and contributor, so I’ll leave these notes for myself and others in the event you come into a situation where you need FreeBSD to play nice with people who are very git(1) centric.

After an update by [email protected] to the devel/git port, you can now install git(1) and have it work out of the box.  The most frustrating thing, after using git for like 5 minutes, is to figure out how to extract a patch out of it and send it all pretty-like to the mailing list(s) that would be consuming the patch.

In its simplest incarnation, you can simply reference a commit hash and us it to generate a patch via git format-patch, but this will give you the entire commit diff between the referenced version and HEAD.  This, in my case generated approximately 3000 patch files.

e.g. git format-patch –output-directory ~/patches –to=”[email protected]” c60a1f1b2823a4937535ecb97ddf21d06cfd3d3b

What I want, is a diff of one revision, which requires a start and ending hash:

format-patch –output-directory ~/patches –to=”[email protected]” c60a1f1b2823a4937535ecb97ddf21d06cfd3d3b…c6ad44bb288c1fe85d4695b6a48d89823823552b

Now I send this to the mailing lists via my client.  Here is where I kind of head-desked a bit.  If you are like me and run a mail server yourself and you use SSL with self-signed certs, then this little bit if for you.  I lost about an hour trying to figure this little bit out.

The way to dump patches from your patch director (~/patches) is to use:

git send-email patches/*

This will use the following variables in your git environment:

sendemail.smtpserver=mail.ignoranthack.me
sendemail.smtpencryption=ssl
sendemail.smtpuser=[email protected]
sendemail.smtpserverport=465
sendemail.smtpsslcertpath=
sendemail.annotate=yes

Notice the empty “sendemail.smtpcertpath” variable.  Without that set to EMPTY, git would repeatedly fail on the self-signed cert that I use.  So, I’m pretty sure something still isn’t setup correctly.  However, it must be set to EMPTY and not undefined.  Else, you will repeatedly fail with certificate validation errors.

Getting to know your portmgr@ – Steve Wills

It is my pleasure to introduce Steve Wills, the newest member of the portmgr team. Steve has done a tremendous work on the ports tree, especially in the field of testing and quality. Here is a short interview to get to know him better.

Name
Steve Wills

Committer name
swills

Inspiration for your IRC nick
Boring, it’s my userid.

TLD of origin
.us

Current TLD (if different from above)
Same.

Occupation
Sysadmin.

When did you join portmgr@
2014

Blog
Used to have one, use twitter more now (@swills)

Inspiration for using FreeBSD
Simplicty and learning.

Who was your first contact in FreeBSD
Can’t recall, it was ages ago.

Who was your mentor(s)
pgollucci

What was your most embarrassing moment in FreeBSD
Trying to migrate Ruby default version from 1.8 to 1.9 and having to roll back.

Boxers / Briefs / other
Heh, question assume survey taker is male, which I am, but I think we need to
work on diversity (but not in that “hey, let’s work on diversity and get some
women” way, but more in that we make something everyone wants to use)

What is your role in your circle of friends
The FreeBSD user. ;)

vi(m) / emacs / other
vi(m)

What keeps you motivated in FreeBSD
New users, new committers.

Favourite musician/band
I listen to a decent variety of stuff, but I suppose the thing I come back to
most is NIN.

What book do you have on your bedside table
I have an iPad by my bed, which I bought to read, but mostly I browse news on
it.

coffee / tea / other
Don’t drink caffeine, so don’t drink coffee much. I do drink good beer tho.

Do you have a guilty pleasure
Good dark chocolate. :)

How would you describe yourself
Mostly standard in many ways, husband, father, FreeBSD hacker, sysadmin, in
that order.

sendmail / postfix / other
Sendmail, tho dma is nice too.

Do you have a hobby outside of FreeBSD
Used to play guitar, still have one, don’t find time to pick it up much any
more.

What is your favourite TV show
Futurama

Claim to Fame
Ported Acidwarp from DOS to svgalib.

What did you have for breakfast today
Everything bagel with plain cream chese.

What sports team do you support
The only sport I watch is University of North Carolina Basketball.

What else do you do in the world of FreeBSD
ruby ports, perl ports sometimes

What can you tell us about yourself that most people don’t know
I was an employee at Red Hat way way back

Any parting words you want to share
Not really.

What is your .sig at the moment
Null.

Steve

Frederic Culot takes over as portmgr-secretary@

It is with great pleasure that the FreeBSD Ports Management Team announces that Frederic (culot@) Culot will take over responsibilities of team secretary effective immediately.

Frederic became a ports committer in October 2010, and joined the ranks of portmgr-lurkers@ in March 2014 as the shadow secretary.

Please drop him a note and congratulate him (or offer condolences).

 

Thomas
on behalf of portmgr@

 

Have you never been to BSDCan?

I remember a time when I’d never been to a conference related to my passions. Once I went, things changed. I realized that making strong working relationships with others who share my passion is important. Not only does this solidify the community of which you are a member, it also helps you personally. Every conference […]

The Short List #8: Using #lldb with a core file on #FreeBSD

Debugging qemu this evening and it took me a minute or two to figure out the syntax for debugging a core file with lldb.

lldb mips-bsd-user/qemu-mips -c /mipsbuild/qemu-mips.core

Make sure you have permissions to access both the binary and the core, else you get a super unhelpful error of:

error: Unable to find process plug-in for core file ‘/mipsbuild/qemu-mips.core’

But, after that, you can start poking around:

Core file ‘/mipsbuild/qemu-mips.core’ (x86_64) was loaded.

Process 0 stopped

* thread #1: tid = 0, 0x00000000601816fa qemu-mips`_kill + 10, name = ‘qemu-mips’, stop reason = signal SIGILL

frame #0: 0x00000000601816fa qemu-mips`_kill + 10

qemu-mips`_kill + 10:

-> 0x601816fa: jb 0x60182f5c ; .cerror

0x60181700: ret

0x60181701: nop

0x60181702: nop

(lldb) bt

* thread #1: tid = 0, 0x00000000601816fa qemu-mips`_kill + 10, name = ‘qemu-mips’, stop reason = signal SIGILL

* frame #0: 0x00000000601816fa qemu-mips`_kill + 10

frame #1: 0x000000006003753b qemu-mips`force_sig(target_sig=<unavailable>) + 283 at signal.c:352

frame #2: 0x00000000600376dc qemu-mips`queue_signal(env=<unavailable>, sig=4, info=0x00007ffffffe8878) + 380 at signal.c:395

frame #3: 0x0000000060035566 qemu-mips`cpu_loop [inlined] target_cpu_loop(env=<unavailable>) + 1266 at target_arch_cpu.h:239

frame #4: 0x0000000060035074 qemu-mips`cpu_loop(env=<unavailable>) + 20 at main.c:201

frame #5: 0x00000000600362ae qemu-mips`main(argc=1623883776, argv=0x00007fffffffd898) + 2542 at main.c:588

frame #6: 0x000000006000030f qemu-mips`_start + 367