August 5th, 2008 by konrad
I spend last weeks on importing interesting parts of Apple’s libc to FreeBSD. I ended up with this patch: http://versus.ath.cx/patches/fbsd_7.0_collation.patch, which I than applied to the CURRENT libc, and commited as my p4 libc branch.
It wasn’t however without accident - I suffered two serious laptop breakages, ending with en essentially dead one. First the display broke down - stopped showing anything - I think it was the backlight. I was lucky - I had external monitor enabled and configured for Dual Head. When I connected it however, I started to get IDE timeout problems when booting FreeBSD. So my hard drive also died. I managed to boot off live-cd, and copy and my gsoc data, and that was it.
After installing everything on my new-bought laptop, I got to finishing my work on libc, and was able to get in into - I think - fully working state. There are still some little things I want to fix/enhance, but general functionality is there.
Now I’m focusing on writing regression tests, than will come manpages. At the end will be a call for testers.
Posted in Uncategorized | 3 Comments »
July 12th, 2008 by konrad
Last time I said that it would be nice to use some Apple’s work in the area of collation. I exchanged a few emails with Jordan K. Hubbard and it seems we can user their code without problems - all interesting parts are still on the BSD licence. That’s because this is still our code, only extended by Apple. Even the copyrights weren’t changed a bit (so we don’t know who did the extending).
Anyway, as the code is fairly mature, I decided to use it. The libc part of the code is the one I am most interested in, but to see how it works, I first had to port the userland tool - Apple’s version of colldef. Doing this I extended it a little - to not choke on the expansions. As I don’t have the locale data that Apple is using, I made the tool work on my data - at the same time making it more posix compliant. There were many little issues while porting the code - and I wanted it to work perfectly before I submitted it - so it took me more than a week to complete the porting. I even made it compile with “-ansi -Wall -Wextra -pedantic”, thing I always do with my code.
Now, as the tool is completed and I did a final cleanup, I will test it on a bigger amount of data, and then proceed to port the libc part. I’m really excited to see how it works. When those two things are completed I will have to make a few more extensions to Apple’s code to make if fully compliant with UCA.
Posted in Uncategorized | No Comments »
June 16th, 2008 by konrad
Last time I’ve written about the purpose of collation. Now is the time to write a little about how I want to deal with it.
I’ve been a little busy with my exams lately (who hasn’t), but I have the last one June 26′th. Anyway, I’ve managed to gather some basic building blocks, which will support the rest of my project:
- imported “Common Unicode Data Repository” - the source of all locale data that you will ever need - into my p4 tree
- written converter scripts to change the symbolic character names as found in this repository into UTF-8 sequences
- written a program called colldef.c that uses the data output from the scripts and builds the binary collation table, doing some fancy compression/reduction on the way, so that all character weighs fit within one byte.
The next steps that I will take will be writing the libc part - the one that uses the binary table and does the sorting/collation. I will have to rewrite most of the string/strcoll.c and locale/collate.c.
I’ve been contacted by Alexander Leidinger recently, who told me that Apple has already done full conversion to UTF-8 of their base system. I skimmed through their strcoll.c and collate.c and I can confirm this. It would be nice if we could use some part of this work.
Posted in Uncategorized | No Comments »
June 12th, 2008 by konrad
I’d like to introduce newcomers to the topic - what is the collation and why do we need it - eg. - why not just strcmp.
In the simplest form - comparing English words - we don’t need collation at all (save case differences) - the binary character encodings (called codepoints in unicode) are all we need. However, when we have to deal with, for example, accents, our task is more difficult - differences in accents should be ignored in most languages if there are any differences in the base letters - even in the base letters which are _later_ in the string. Then there come the differences in case, which should be even less important than differences in accent, and at the end are differences in punctuation.
This way we end with 3 or 4 comparison levels, the first one is always conducted, and the others conditional, only if the earlier level showed no difference in string. Add to this contractions: when two characters have to be treated as one - and expansions - when one character should behave in sorting as two - and you have some basic idea of what collation is.
On top of this, each language off course has it’s own rules, so we need to tailor the collation to the current locale - we basically have to have data files for all supported languages.
Posted in soc2008 | No Comments »