Archive for June, 2008

Gathering the basic elements

Monday, June 16th, 2008

Last time I’ve written about the purpose of collation. Now is the time to write a little about how I want to deal with it.

I’ve been a little busy with my exams lately (who hasn’t), but I have the last one June 26′th. Anyway, I’ve managed to gather some basic building blocks, which will support the rest of my project:

  1. imported “Common Unicode Data Repository” - the source of all locale data that you will ever need - into my p4 tree
  2. written converter scripts to change the symbolic character names as found in this repository into UTF-8 sequences
  3. written a program called colldef.c that uses the data output from the scripts and builds the binary collation table, doing some fancy compression/reduction on the way, so that all character weighs fit within one byte.

The next steps that I will take will be writing the libc part - the one that uses the binary table and does the sorting/collation. I will have to rewrite most of the string/strcoll.c and locale/collate.c.

I’ve been contacted by Alexander Leidinger recently, who told me that Apple has already done full conversion to UTF-8 of their base system. I skimmed through their strcoll.c and collate.c and I can confirm this. It would be nice if we could use some part of this work.

A few words about collation

Thursday, June 12th, 2008

I’d like to introduce newcomers to the topic - what is the collation and why do we need it - eg. - why not just strcmp.

In the simplest form - comparing English words - we don’t need collation at all (save case differences) - the binary character encodings (called codepoints in unicode) are all we need. However, when we have to deal with, for example, accents, our task is more difficult - differences in accents should be ignored in most languages if there are any differences in the base letters - even in the base letters which are _later_ in the string. Then there come the differences in case, which should be even less important than differences in accent, and at the end are differences in punctuation.

This way we end with 3 or 4 comparison levels, the first one is always conducted, and the others conditional, only if the earlier level showed no difference in string. Add to this contractions: when two characters have to be treated as one - and expansions - when one character should behave in sorting as two - and you have some basic idea of what collation is.

On top of this, each language off course has it’s own rules, so we need to tailor the collation to the current locale - we basically have to have data files for all supported languages.