Gathering the basic elements
Monday, June 16th, 2008Last time I’ve written about the purpose of collation. Now is the time to write a little about how I want to deal with it.
I’ve been a little busy with my exams lately (who hasn’t), but I have the last one June 26′th. Anyway, I’ve managed to gather some basic building blocks, which will support the rest of my project:
- imported “Common Unicode Data Repository” - the source of all locale data that you will ever need - into my p4 tree
- written converter scripts to change the symbolic character names as found in this repository into UTF-8 sequences
- written a program called colldef.c that uses the data output from the scripts and builds the binary collation table, doing some fancy compression/reduction on the way, so that all character weighs fit within one byte.
The next steps that I will take will be writing the libc part - the one that uses the binary table and does the sorting/collation. I will have to rewrite most of the string/strcoll.c and locale/collate.c.
I’ve been contacted by Alexander Leidinger recently, who told me that Apple has already done full conversion to UTF-8 of their base system. I skimmed through their strcoll.c and collate.c and I can confirm this. It would be nice if we could use some part of this work.