This is about an experience with internationalization and Unicode I’ve had recently.
My uncle has our family tree in an old version of Family Tree Maker (FTM). I figured family trees should be on the Net, like Google Documents, because they are common to so many people. A quick search revealed the existence of at least three family tree hosting sites. However, our tree is in Russian, in an 8-bit character encoding, so I had to take care of i18n.
Of the three sites, two used the Latin-1 character set, and so wouldn’t show Russian at all. Actually, I could import the tree to one of those sites, then make my browser display a page in the Cyrillic (Windows) encoding, and it would render correctly. But then I would have to change the encoding every time I loaded a new page. Annoying.
The third site used UTF-8, so I tried converting my tree to that encoding. The standard format for exchanging genealogical data between programs, GEDCOM, is text-based. As an aside, it was devised by the Church of Latter-Day Saints, better known as the Mormons – apparently polygamy makes genealogy quite confusing. According to the spec, GEDCOM files may be in UTF-8. So I uploaded the same file in UTF-8, but no luck there. The text got all garbled. I wrote to their tech support, and they answered Unicode wasn’t supported.
I was rather at a loss, so I just tried entering some Russian text into the garbled tree on the UTF-8 site. When I entered it from a page on the site itself, it was displayed correctly. Just for the sake of it, I tried the same on one of the Latin-1 sites. Amazingly, it worked! I was at a loss. I tried to “View selection source” and saw Cyrillic letters in the source – impossible! However, saving the page and opening it in an editor revealed that what I had entered was encoded like this: Ӓ where 1234 is the code point. So here is what I learned:
- Numeric character references work even in 8-bit encodings.
- When I enter text in an edit box in the browser, it’s encoded into NCRs where needed.
- Firefox’s source view window is buggy in that it doesn’t show NCRs.
The next thing I did was encode all Cyrillic letters in the GEDCOM file as NCRs and upload it once more. It mostly worked, but the solution wasn’t perfect. On one site, most of the info was shown OK, but the names were clipped because their length was limited. Note that for each character its NCR takes 7 bytes. Hex saves one digit, but adds the x, so it’s still 7. On the other site there was no clipping, but when I had to enter a name to go to, the drop-down list of suggestions showed me the NCRs.
So I wrote a program in Perl to transliterate the names. I also made some more changes I needed, for example, Family Tree Maker had saved some free-form data as “Christening” and “Burial” instead of “Notes”, so fixed this, too. On the way I learned to use Unicode in Perl.
I looked for a place to post the program, but couldn’t find any. CodeProject doesn’t list Perl in its choice of languages. This means not that posting Perl is prohibited there, but that none of the site’s readers will be interested. Putting code at SourceForge implies maintenance responsibility. So I’ll just post the code here on my blog as soon as I’ve cleaned it up.