FreeBSD Lectures Captioning Project Complete

Murray Stokely has completed his captioning project, which was funded by the FreeBSD Foundation, and provides the following update:

A pilot project to improve the machine generated captions of technical conference lectures from the BSD Conferences YouTube channel has been completed. The 73 videos in this channel have been viewed over 200,000 times since the channel launched in late 2008, and the addition of human-edited transcripts to some of our most popular videos makes this content more accessible to people around the world.

In addition to the benefits to the hearing impaired, captions are very useful for international viewers as well as for the improved discoverability of this content by search engines. The improved quality of the English language transcripts also improves the quality of the automated translation of the captions into over 45 different languages. It is also now possible to search for words and phrases in the audio transcripts and get a link directly to videos that contain spoken content of that word or phrase.

For example, try searching for a famous line from one of Dr. Kirk McKusick's FreeBSD Kernel Internal Lectures. The above link will take you to the Google Video Search Result page where one of Dr. McKusick's lectures containing the phrase as long as dinosaurs and mainframes is the first result, along with a snippet of the transcript from his lecture, just as you would see the snippet from text content on a web page. A dozen of our most popular videos of FreeBSD technical content are now captioned and fully indexed allowing users to search for very technical terms and get access to lecture material from BSD Conferences.

The captions were improved by two passes of human editing paid for hire through Amazon Mechanical Turk.

 

 

New Director

The FreeBSD Foundation is pleased to announce that Erwin Lansing has joined the Board of Directors. For those of you who haven't met Erwin, here is his bio:

Erwin previously worked for an rapidly expanding webhosting startup and now holds a position as Network Systems Engineer at the Danish incumbent ISP, TDC. He joined the FreeBSD Ports Development Team in 2003 and has been a member of the Ports Management Team since 2005. He is mainly working on the package building cluster, creating and distributing ready-to-install binary packages of 3rd party software for FreeBSD, in addition to regression testing the integration of FreeBSD with 3rd party software projects.

 

 

Google’s new RE engine

I stumbled over Google’s new RE engine. Unfortunately it is not handling backreferences, so it is not a drop-in replacement for the regular expressions code in FreeBSD. It has a POSIX mode, but this only seems to be enough for the egrep syntax. For people which need backreferences, they refer to the Google Chrome’s RE engine irregexp which in turn references a paper from 2007 which is titled Regular Expression Matching Can Be Simple And Fast.

The techniques in the paper can not be applied to the irregexp engine, but maybe could help to speed up awk, egrep and similar programs.

I think it would be interesting to compare those recent developments to what we have in FreeBSD, and if they are faster, to see if it is possible to improve the FreeBSD implementation based upon them (either by writing new code, or by importing existing code, depending on the corresponding license and the language the code is written in).

Maybe a candidate for the GSoC?

Share/Bookmark

 

 

Reflections after one year

Today marks my 1 year anniversary as a FreeBSD developer. I opened my first Problem Report in 2006 and after roughly three years of hacking on the ports system, wxs@ offered to mentor me and on March 11th, 2009 I received an email saying that the port-mgr@ team approved his request for a commit bit for me. I happened to be on vacation in Mexico when I got the email, and just like that a good day turned even better.

The first port I ever created was for mail/p5-WWW-Hotmail. I was working for an Internet Startup and I took the job simply because they were using FreeBSD and I never had an opportunity to use FreeBSD in a production environment. One of the tasks that landed in my lap was to automate the process of checking to make sure that our newsletter was not being delivered to the spam folder of the 3 big emails providers. I took a day or so to learn how to make ports and packages to make it easier for me to roll out all the perl modules I needed.

I wasn’t too happy working at that company, and after a while hacking on ports became a form of therapy for me and I started to get more and more involved with the FreeBSD project and I eventually ended up here.

In the past year I managed to make 148 commits and introduced several new ports into the tree. As of today, there are 21,636 ports available, and it feels pretty good to be a small part of that. While I would have liked to have been able to dedicate more time, other things kept getting in the way. My 1 year resolution is to figure out a way to better manage my time and try and set aside a few hours per week to hack on ports.

I want to say thank you to wxs@, because without him, I wouldn’t be a part of this. I was the first person he mentored, and I consider myself very fortunate to have gotten the chance to work with him. He is extremely bright and very patience and just an all around good guy. Even today when I paint myself into a corner, I can always ask him for help and every time he has managed to guide me in the direction I wanted to go. I owe a great deal to him and consider him to be a very valuable addition to the FreeBSD developer community as a whole.

 

 

Call for Testing X.org 7.5 for FreeBSD

Howdy! We’re happy to announce that Xorg 7.5 is ready for public testing. The ATI and Intel drivers were patched to work with the new server, please report any problems to us! The drivers for Vesa, NV and NVIDIA have been tested thoroughfully and seem to work fine. A note to FreeBSD 6.X users: Unfortunately you’ll have to compile gcc 4.2+ first [...]

 

 

Unit Testing Uncovers Bugs

As part of the ‘utility’ library in one of the projects we are using at work, I wrote two small wrappers around strtol() and strtoul(). These two functions support a much more useful error reporting mechanism than the plain atoi() and atol() functions, but getting the error checking right in all the places they are called is a bit boring and cumbersome. This is probably part of the reason why there are still programs out there that use atoi() and atol().

For example here’s how I usually check for errors in calls to the strtol() and strtoul() functions:

{
        char *endp;
        long x;

        endp = NULL;
        errno = 0;
        x = strtol(str, &endp, base);
        if (errno != 0 || (endp != NULL && *endp != '\0' &&
            (isdigit(*endp) != 0 || isspace(*endp) == 0)))
                /* Return 'endp' if possible. */
                return -1;
        }
        /* At this point 'x' contains the parsed value. */
}

This is a lot of code for parsing a single long value. For one or two input strings it may be ok to repeat the code in the places where the numeric parsing code is needed. For more than a couple of input strings it really feels boring to repeat this code again and again.

When I set out to write the wrapper code for strtol() and strtoul() my goal was to make it very easy to parse input strings. A typical call to the parsing function should be a single line of code; it should be very clear if the parsing attempt succeeded or failed; it should also be possible to get both the parsing success or failure and the numeric value we just parsed; it should also be possible to get hold of the last character we managed to parse, so that strings like “100 200 300″ can be parsed efficiently without having to manually find where the textual representation of the first number ends or the second one starts.

That’s quite a list of goals for a single function, but the function call style I envisioned looked something like this:

{
        long value;
        char *endp = NULL;

        if (parselong("0x12345678", &endp, 16, &value) != 0) {
                err(1, "parse error");
        }
}

The return value of parselong() makes it very clear if the parsing attempt succeeded or failed. A return value of zero means success. Any other return value means failure.

The parsed value is returned through the &value pointer. If the parsing attempt has failed parselong() can leave the value unmodified to avoid inflicting spurious side-effects to its calling code because of a failed attempt to parse an input string.

If the parsing attempt has succeeded, &endp may be set to point right after the last character that was successfully parsed. This is actually part of the documented interface of strtol() and strtoul(), so it comes for free by wrapping these functions.

Finally, parsing a long value is a single function call. It is a lot easier to call the parsing function without having to repeat all the error checking boilerplate at each calling site. It’s even easy to “chain” multiple parsing attempts using a style similar to:

{
        long value1, value2, value3;

        if (parselong("0x12345678", NULL, 16, &value1) != 0 ||
            parselong("0xdeadbeef", NULL, 16, &value2) != 0 ||
            parselong("0xf00fc0de", NULL, 16, &value3) != 0)
                err(1, "parse error");
}

Not that this is a good style of reporting errors, but it is possible, just because it’s now easy to parse a value and check if it was parsed correctly with a single line of code.

The Unit Tests Fail on Linux

Several months passed after I wrote the initial parselong() and parseulong() functions. In the meantime I had to port the program using them to other platforms. The initial target platform was FreeBSD.

This is a bug that lurked for a few months in the initial code of parselong() until I had to port the function to another platform and started writing unit tests to verify that it works the way I expected it to work on all possible systems. In retrospect I should have started by writing the unit tests, but that’s something I can say now because I finally got around to doing it and they did serve a very useful purpose.

When I had to port my ‘utility’ functions to work on several Linux versions too, I wrote a collection of unit tests for parselong() and parseulong(). The testing framework I used was CUnit because of the way it nicely integrates with plain ANSI C code.

One of the test functions I wrote was supposed to check for failures returned by parselong() for invalid input strings. The bulk of the test function was something like this:

#include "CUnit/Basic.h"

void
test_parselong_failures(void)
{
        long value = TEST_VALUE_ULONG_MAGIC;

        CU_ASSERT_EQUAL(parselong("xxx", NULL, 0, &value), -1);
        CU_ASSERT_EQUAL(value, TEST_VALUE_ULONG_MAGIC);

        CU_ASSERT_EQUAL(parselong("+", NULL, 0, &value), -1);
        CU_ASSERT_EQUAL(value, TEST_VALUE_ULONG_MAGIC);

        CU_ASSERT_EQUAL(parselong("-", NULL, 0, &value), -1);
        CU_ASSERT_EQUAL(value, TEST_VALUE_ULONG_MAGIC);
        ...
        CU_PASS("parselong() failures for invalid values look ok");
}

Running the unit tests on FreeBSD seemed to work fine. After all the initial version of the parselong() function had been manually tested with the same input strings earlier.

When I tried running the same test cases on Linux though, they failed. Apparently parselong() was not detecting that strtol() failed to parse the input string “xxx” or any other input strings from the ones tested in the test_parselong_failures() function!

The Bug Uncovered

Adding a couple of debugging printf() calls to parselong() itself showed that on Linux parselong() was returning zero for invalid input strings when strtol() could parse no character at all from the input string.

The initial version of the error checking code for strtol() was similar to:

{
        char *endp;
        long x;

        endp = NULL;
        errno = 0;
        x = strtol(str, &endp, base);
        if (errno != 0 || (endp != NULL && endp != str && *endp != '\0' &&
            (isdigit(*endp) != 0 || isspace(*endp) == 0)))
                /* Return 'endp' if possible. */
                return -1;
        }
        /* At this point 'x' contains the parsed value. */
}

The highlighted part (endp != str) of the error checking code assumes that strtol() will move the ‘endp’ pointer at least one character after the start of the input string. Apparently on Linux this is not the case. The strtol() function of Linux does not move ‘endp’ at all if it cannot parse even a single character of the input string. This seems to be the correct behavior for strtol(), but it was hidden for a while, lurking in the original parselong() code, until I ran the unit tests of the function on Debian GNU/Linux.

The CUnit driver program that I used to run the test cases failed on Linux with error messages like:

  1. test_parselong.c:63  - CU_ASSERT_EQUAL(parselong("xxx", NULL, 0, &value),-1)
  2. test_parselong.c:64  - CU_ASSERT_EQUAL(value, TEST_VALUE_ULONG_MAGIC)
  3. test_parselong.c:66  - CU_ASSERT_EQUAL(parselong("+", NULL, 0, &value), -1)
  4. test_parselong.c:67  - CU_ASSERT_EQUAL(value, TEST_VALUE_ULONG_MAGIC)

The culprit for these test case failures was the assumption that Linux would set errno to a non-zero value for an invalid input string… Apparently, it doesn’t. The following small program prints different output on BSD vs. Linux:

$ cat -n strtest.c
     1  #include <errno.h>
     2  #include <limits.h>
     3  #include <stdio.h>
     4  #include <stdlib.h>
     5
     6  int
     7  main(void)
     8  {
     9          long value;
    10          const char *input = "xxx";
    11          char *endp = NULL;
    12
    13          errno = 0;
    14          value = strtol(input, &endp, 0);
    15          printf("str = %p = \"%s\"\n", input, input);
    16          printf("endp = %p \"%s\"\n", endp, endp ? endp : "(null)");
    17          if (endp != NULL) {
    18                  printf("endp[0] = '%c' (%d 0%03o #x%02x)\n",
    19                    *endp, *endp, *endp, *endp);
    20          }
    21          printf("errno = %d\n", errno);
    22          printf("value = %ld 0%lo #x%lx\n", value, value, value);
    23          return EXIT_SUCCESS;
    24  }

On FreeBSD the output of this program includes an errno value of EINVAL:

freebsd$ cc strtest.c
freebsd$ ./a.out
str = 0x8048604 = "xxx"
endp = 0x8048604 "xxx"
endp[0] = 'x' (120 0170 #x78)
errno = 22
value = 0 00 #x0
freebsd$ fgrep 22 /usr/include/sys/errno.h
#define EINVAL          22              /* Invalid argument */
freebsd$

On a recent update of Debian GNU/Linux “testing” the output is slightly different:

debian$ cc strtest.c
debian$ ./a.out
str = 0x8048630 = "xxx"
endp = 0x8048630 "xxx"
endp[0] = 'x' (120 0170 #x78)
errno = 0
value = 0 00 #x0
debian$

This means that the only indication we have that the Linux version of strtol() failed to parse some of the input text is the value of ‘endp’: it’s the same as the input string. The error-checking code of the original parselong() wrapper was:

        x = strtol(str, &endp, base);
        if (errno != 0 || (endp != NULL && endp != str && *endp != '\0' &&
            (isdigit(*endp) != 0 || isspace(*endp) == 0)))
                error(...);

But on Linux both of the following are true:

  • errno is not set to a non-zero value.
  • If strtol() could not parse even one input character, endp == str.

This caused parselong() to bypass the error checking code, and try to return a ‘valid’ result even tough the Linux strtol() version has failed. Hence the failure of the unit tests.

Removing the (endp != str) conditional expression means that the error checking code works equally well on Linux and BSD. The BSD version of strtol() returns a non-zero errno value, triggerring the first part of the error checking code. The Linux version returns an endp pointer that is non-null and fails the ‘\0′ check later on. The new parselong() function is slightly shorter and it passes the unit tests on both BSD and Linux.

Conclusions

There is something thrilling about fixing bugs by removing code. This bug was one of the few cases I’ve come across during the last couple of months where removing code was an improvement. There’s probably a joke about “writing too much code” and the bug-resolving debt each line of new code introduces. I think I’ll leave that for another time though.

The most important conclusion of today’s bug hunting session was that Unit Testing really does work and it pays back in real, quite tangible ways. Had I not spent a bit of time to think about what the parselong() and parseulong() functions are supposed to do, when they are supposed to fail and how they are allowed to fail, I would not spent the time to write test cases for them. Had I not written the test cases, I wouldn’t notice there is a failing test case on Linux. Had I not seen that I wouldn’t realize some times the two functions were returning completely bogus results on Linux systems.

The central place the unit testing code has in this story is an important and serious lesson for me:

KEEP TESTING!

Filed under: Computers, FreeBSD, GNU/Linux, Linux, Programming, Software Tagged: Computers, FreeBSD, GNU/Linux, Linux, Programming, Software, testing

 

 

Returning committer: Niels Heinen (ports)

 

 

Partial ports thaw

The ports tree is now tagged and partially thawed. Until 7.3 is released, sweeping commits still need explicit approval from portmgr to assure that tags can be slipped for potential security issues. For more information what constitutes a sweeping change, see the portmgr web pages.

Related posts:

  1. Partial ports thaw The ports tree is now tagged and partially thawed.  Until...
  2. Partial ports thaw The ports tree is now tagged and partially thawed. Until...
  3. Ports feature freeze now enforced As an experiment, there will not be a complete ports...

Related posts brought to you by Yet Another Related Posts Plugin.

 

 

Sun V210 Donation

Thanks to a generous donation by Nathan Whitehorn, a Sun SunFire v210 is sitting on the floor of my office waiting to have FreeBSD installed on it. Since this is the fastest sparc64 machine we have, Mark Linimon and I are planning on using it as a package building machine. However, if there is any other developer who would like to use it as a reference platform, please get in touch with me and I will set you up with access.

In the past, due to our limited access to the sparc64 platform, we were not able to support this architecture as well as we would have liked to. Packages available for sparc64 have fallen behind packages for other architectures such as i386 and amd64. However, once this machine is up and running, I have a strong feeling it will become a valuable resource to the FreeBSD developers working on making sparc64 a Tier-1 architecture

 

 

FreeBSD 7.3-RC2 Available

The second Release Candidate build for the FreeBSD-7.3 release cycle is now available. ISO images for Tier-1 architectures are now available on most of the FreeBSD mirror sites.

 

 

New committer: Neel Natu (src)

 

 

PC-BSD 8.0 Released

PC-BSD 8.0 has been released. PC-BSD is a successful desktop operating system based on FreeBSD that focuses on providing an easy to use desktop system for casual computer users. A list of new features/updates since the last version can be found here.

 

 

Accepting Travel Grant Applications for BSDCan

The FreeBSD Foundation is now accepting travel grant applications for BSDCan 2010. If interested, please fill out the Travel Grant application by April 9, 2010 to apply for this grant.

 

 

Improved Conference Captions from Amazon Mechanical Turk (2)


After my initial experiments last month, I applied to the FreeBSD Foundation for funds to pay for additional human editing of the YouTube machine generated transcripts. The screenshot on the left shows an example HIT (Human Intelligence Task) available on Amazon Mechanical Turk.

The task description on the left is based on a template I created with three variables: $VIDEO_URL, $VIDEO_TITLE, and $CAPTIONS_URL. New HITs are then created by uploading a CSV file with three columns for each of those variables, e.g.

VIDEO_URL,VIDEO_TITLE,CAPTIONS_URL
http://www.youtube.com/watch?v=mMmbjJI5su0,"BSD v. GPL, Jason Dixon, NYCBSDCon 2008",http://people.FreeBSD.org/~murray/improved-captions-bsdvsgpl.sbv
http://www.youtube.com/watch?v=Pe8LdJpBGJ4,"Isolating Cluster Jobs for Performance and Predictability, Brooks Davis (DCBSDCon 2009",http://people.FreeBSD.org/~murray/improved-captions-isolatingcluster.sbv


Using this method I created 12 HITs for the first pass of editing for which I offered between $9 and $14 per video. A slightly modified template with the same three variables was used to pay ~$7 per video for a second pass to further improve the transcripts improved in the first pass.

The template has gotten more detailed over the past month in response to all of the minor ways that workers submitted less than perfect transcripts. The actual SBV file format used by YouTube captions is not formally specified anywhere as far as I can tell, but the 60 character maximum width and simple format can be verified in submitted transcripts with a few emacs macros.

The transcript files have been checked into the FreeBSD Doc CVS Repository. The full list of videos with human-edited English language transcripts is:

 

 

MongoDB and durability

It looks like my post about MongoDB got a lot more popular than usual, and also provoked a sort-of official response from the MongoDB developer(s). It is fair to metion them together to allow people finding one part of the story to find the other. Since my original post talks about multiple issues and the comments wander through various topics I want to summarize the part of the discussion about durability here.

Read more...

 

 

HAST Project is Complete!

Late yesterday, Pawe? Jakub Dawidek committed HAST to HEAD, marking the completion of this Foundation sponsored project. We asked Pawel to write a few words about the project. He says:

HAST is ready!

I'm very happy to report to FreeBSD users that the HAST project I was working on for the last three months is ready for testing and already committed to the HEAD branch.

I'll describe what HAST does in few words. HAST allows for synchronous block-level replication of any storage media (called GEOM providers, using FreeBSD nomenclature) over a TCP/IP network for fast failure recovery. HAST provides storage using the GEOM infrastructure, meaning it is file system and application independent and can be combined with any existing GEOM class. In case of a primary node failure, the cluster will automatically switch to the secondary node, check and mount the UFS file system or import the ZFS pool, and continue to work without missing a single bit of data.

I must admit the project was quite challenging, not only from the technical point of view, but also because it was sponsored by the FreeBSD Foundation. The FreeBSD Foundation has a great reputation and is known to select the projects it funds very carefully. I felt strong pressure that should I fail, the FreeBSD Foundation's reputation might be hurt. Of course, not a single dollar would be spent on a failed project, but the FreeBSD community's expectations were very high and I really wanted to do a good job.

During the work a number of people contacted me privately offering help, explaining how important HAST is for FreeBSD and giving me the motivation to soldier on.

I hope that HAST will meet the community's expectations and I myself am looking forward to using it :)

Once again, I'd like to thank the HAST sponsors: the FreeBSD Foundation, OMCnet Internet Service GmbH, and TransIP BV.

 

 

Preliminary Arduino Port for FreeBSD

One of the things that has been on my TODO list for quite some time was to port the Arduino IDE over to FreeBSD. Fortunately, Warren Block took the time to sit down work on a port and he is please to announce that a preliminary version of it is ready for testing.

I’ll be testing it out over the next few days and I encourage you to do the same. As always, any feedback or patches will be much appreciated. If all goes well, I will be committing it to the tree in the very near future.

Download: tarball || shar

 

 

Call for Help Xorg Team need Fresh Blood!

Howdy All, How you all know is Robert Noland our X guy but he lose most of his time for his new job and x11 is to many for one people. Robert is dealing most time with x stuff on the src site and we need now some people to help him on the ports side. Beat@ and I [...]

 

 

ports feature freeze now in effect

In preparation for 7.3-RELEASE, the ports tree is now in feature freeze.

Normal upgrade, new ports, and changes that only affect other branches are allowed without prior approval but with the extra Feature safe: yes tag in the commit message. Any commit that is sweeping, i.e. touches a large number of ports, infrastructural changes, commits to ports with unusually high number of dependent ports, and any other commit that requires the rebuilding of many packages is not allowed without prior explicit approval from portmgr after that date.

When in doubt, please do not hesitate to contact portmgr.

Related posts:

  1. ports feature freeze starts in February 8 In preparation for 7.3-RELEASE, the ports tree will be in...
  2. Ports feature freeze now enforced As an experiment, there will not be a complete ports...
  3. Ports freeze to start April 13 The ports freeze for the upcoming release of 7.2 will...

Related posts brought to you by Yet Another Related Posts Plugin.

 

 

FreeBSD 7.3-RC1 Available

The first Release Candidate build for the FreeBSD-7.3 release cycle is now available. ISO images for Tier-1 architectures are now available on most of the FreeBSD mirror sites.