Revision [1460]

This is an old revision of DartarI18N made by DotMG on 2004-09-28 05:41:33.

 

DarTar's approach to I18N


 


I'd like to share with you some thoughts on a straightforward way to have both internationalization (translation of kernel/action messages in other languages) together with UTF-8 multilanguage support (possibility to display/edit content with other charsets). This is meant as a partial answer to DotMG's WikkaInternationalization problem with character "ç", which is treated as an htmlentity.

My idea basically consists in two steps:

  1. Make the wiki engine UTF-8 compliant. This is done by following AndreaRossato's HandlingUTF8 Instructions. A working version of a Wikka Wiki supporting UTF-8 can be found here. Together with the Mod040fSmartPageTitles Smart-title feature, this gives beautiful page titles for wikka pages typed in different languages.
  1. Handle the language-specific strings directly from Wikka pages.

I'll assume that step 1 is already done and show how one can easily manage the translation of wikka strings from internal wikka pages (step 2).


A. Build language description pages

A language description page [LDP] is a wikka page containing a list of translated kernel/action messages. The name of a LDP might be - for ease of reference - the ISO 639 code of the corresponding language. Kernel/action messages are identified by a unique key. The syntax is elementary

E.g.
key1: translated string1
key2: translated string2
 


The Russian and Chinese LDP, for example, will look like this:

 (image: http://www.openformats.org/images/ru.jpg)
html

 (image: http://www.openformats.org/images/ch.jpg)
html

Note: Apologies for the bad choice of key names (ru1, ru2 etc.). Keys identify messages independently from a specific language, so for a given key, every LDP will have a different representation.

B. Build a LDP parser

We then need to parse a LDP and make every translated string available through its key.
An example of how to do this via a few lines of code is the following action (I will call it actions/getlang.php):

<?php
//get LDP
$page = $this->LoadPage($lang);
if ($page) {
//parse page
    $output = $this->Format($page["body"]);
    $decl = explode("\n" , $output);  
    foreach ($decl as $row) {
            $l = explode(": " , $row);
            // set key
            $l[0] = strip_tags($l[0]);
            // set translated string
            $l[1] = strip_tags($l[1]);
            print $this->Format("Variable: **".$l[0]."**    has value: '".$l[1]."' ---");
        }
} else {
    print $this->Format("Sorry, language definition was not specified!");
}
?>


This sample action (to be used as {{getlang lang="LDP tag"}}) gives respectively, for Russian and Chinese, the following output:

 (image: http://www.openformats.org/images/ru_parsed.jpg)
html

 (image: http://www.openformats.org/images/ch_parsed.jpg)
html

Note: The examples above show, by the way, that ":" is probably not the best field separator for LDP: ru3 in the Russian LDP is truncated' after the first ":". Other suggestions are welcome.


With some minor modifications, a similar parser can be implemented as a kernel function (let's call it TranslateString()) which will load a LDP, build an array with all the translated strings associated to the corresponding keys once a language is specified (see below) and print the required string.


C. Replace any occurrence of english kernel/action messages with calls to the translation function

For instance, instead of :

  $newerror = "Sorry, you entered the wrong password.";


we will have something like

  $newerror = $this->TranslateString["wp"];


where wp is the key associated with the translations of "Sorry, you entered the wrong password." in the different LDP.

D. Let the user choose his/her preferred language

Once this big replacement work is done in wikka.php, handlers/*, formatters/* and actions/* and the first LDPs are built (DotMG has already done a big translation work), a user will have in his/her personal setting the possibility of choosing a specific LDP as the wiki main language.
This option (stored in a dedicated column of the wikka_users table, or alternatively set as a default by Wikka Admins in the configuration file) will tell the TranslateString() function which LDP has to be used for generating the translated kernel/action strings.

That's all folks!

The implementation of a multilanguage/localized version of Wikka, following the above instruction, should be quite straightforward. The benefits of this approach consist in the fact that translators can contribute their strings by directly typing them in the correponding wikka pages from their browsers (no need to bother with external files and problems of text encoding: all the encoding work is done through Andrea's conversion functions). Complete LDP might then be distributed together with the default install.
Now, the big question: what is the impact on general performance of a call to the database every time a page is generated?

Your thoughts and comments are welcome

-- DarTar

I am not very keen on UTF-8.
For me, the best way to perform i18n is to let the charset used generated dynamically for every page. One page may be iso-8859-1, another UTF-8. If we set it statically to UTF-8, the page won't allow ç or à, and we must translate every page to be UTF-8 compliant. Won't that decrease significantly performance?
let's take openformats.org as an example. Suppose it have french translation and another chinese translation. Chinese words won't appear in a french page nor french words in chinese pages. So, me can set charset to iso-8859-1 for french translation (and page will contain ç or à), and chinese charset for chinese pages.
-- DotMG
There is one comment on this page. [Display comment]
Valid XHTML :: Valid CSS: :: Powered by WikkaWiki