Midgard and international URL transliteration

cover image for Midgard and international URL transliteration

Midgard has been an early supporter of internationalization in open source CMSs, adding UTF-8 support already in 1999. Today I however got an innocent request:

One thing to be considered is i18n and Unicode support, since the community wiki is the perfect place to host translated docs.

I was quite confident that things would work out OK, but knew that the Wiki component had had only minimal internationalization testing and so could have had issues. So I went and created some pages in Russian, Georgian and Arabic. So far so good:

Wiki-Russian

All Wiki functionality worked as you would expect. Wiki links, Markdown formatting, backlinks, even version comparisons:

Wiki-Arabic-Diff-1

One issue remained, however: MidCOM has functionality for generating nice, readable URL names from object titles. This functionality depended on the PECL translit extension which, in our tests, proved to be troublesome with some languages.

After a bit of googling we ran into the PHP UTF-8 project, and more specifically to its utf8_to_ascii tool that is a PHP port of the Perl Unidecode package. This library was small enough to be bundled into MidCOM itself, removing the dependency of an additional PHP extension, and seemed to cope with various languages much better.

The results were not perfect, of course, but at least West European and Scandinavian languages, Russian, Polish, Greek, Maori and Amharic worked perfectly. Arabic, Hebrew, Chinese, Korean, Thai and Viet produced results that were possibly correct. Japanese (hiragana and katagana), Devanagari and Georgian did not work at all. A good start nevertheless. Here are some tests:

Utf8-Transliteration-Tests

Technorati Tags: , ,


Read more Midgard posts.