In order to improve the I18N of Wikka, it could be a good idea to add support for different interface translations. The best way to handle this kind of translations is using gettext which is a mature, widely used (wordpress, phpwiki) localization framework. It is pretty much the defacto standard in the open source/free software realm.

Download
You can download the testing package with gettext support, some directories have been removed in order to reduce the file size.
It should be installed like the normal Wikka, the only additional option in the setup process is the language selection. There are 2 languages available right now: Catalan and English. You can try to install it in catalan to see that the final result is a Wikka site, but in catalan (there are some missing strings).

wikka-1.1.5.3-gettext.tar.gz (140k) Updated on October 6, 2004

Tasks
  1. gettext support -- done
  1. different domains (wikka and wikka-setup) -- done
  1. locale selection from the installer -- done
  1. script to update po files -- done
  1. make strings translatable -- 142 strings so far

Known problems
Hardcoded WikiPages
actions/textsearch.php:22 TextSearchExpand
actions/footer.php:3 TextSearch
actions/usersettings.php:118 MyPages
actions/usersettings.php:119 MyChanges
actions/category.php:8 Category Category

If we want to let users choose their preferred language from usersettings, we can't translate hardcoded WikiPages. These are always the same, since they are on the database and were translated in the setup process. Possible solutions:

Php Gettext, a class that can read directly mo files
At http://savannah.nongnu.org/projects/php-gettext/ , you can find a class consisting of 2 files that can read mo files even if you don't have gettext support installed. It doesn't require anything other than PHP, and its size is absolutely small. I (-- DotMG --) was not a fan of Gettext until I found this class. It is GPL-licensed.

FAQ
How to make strings translatable with gettext?
It is simple, just use the _() or gettext() function.

Example (simple):
<?php
    print(_("Hello Wikka!"));
?>

Example (formatted):
<?php
    $here = $this->Format('[[PasswordForgotten | '._("here").']]');
    printf(_("If you need a temporary password, click %s."), $here);
?>


What do we gain with gettext?

Categories
CategoryDevelopmentI18n
Comments
Comment by DarTar
2004-10-05 16:41:20
Jorda,
I've downloaded and installed your distribution, but it speaks... plain english, even if the locale is set to Ca_ES. Am I missing something?

BTW it would be nice to have the user be able to choose his/her preferred language from usersettings.
Comment by JordaPolo
2004-10-05 18:29:27
Thanks for downloading it, at least there is someone interested here! ;)

It should be ca_ES, not Ca_ES.

Can you try to add the following line in wikka.php, under "set gettext configuration":
putenv("LANG=ca_ES");

Or perhaps using "./locale" instead of "locale" ("locale_path" value, wikka.config.php).

If this does not work, can you tell me something about your configuration? Operating System?
Comment by JordaPolo
2004-10-05 18:32:38
I have just created #wikka on irc.freenode.net. You can join to talk about Wikka (and Gettext, of course).
Comment by LaurentBurgbacher
2004-10-06 05:32:37
Hello, I've not had time to test it, but I'm waiting on a gettextized version since I'm using wikka. So thanks a lot!

My question is: will this gettext version be integrated in the standard wikka version?
Comment by JordaPolo
2004-10-06 05:49:25
I hope so, though we will have to fix some problems first.
I have already mailed Jason... no response yet.
Comment by JsnX
2004-10-06 16:47:20
I like the idea of using gettext in the standard Wikka distribution, but I wanted to wait a few days to find out if anyone would come forward with objections. There haven't been any objections. However, if anyone has objections to using gettext, now is the time to mention them....
Comment by DarTar
2004-10-06 17:34:52
JordaPolo, I'd like to know whether your gettext-based approach is extensible to L10n in charsets other than ISO-8859-1, in particular:
1) What is the appropriate charset for string-translation files?
2) How is your approach dealing (if it does) with translated strings encoded in different charsets? If one day Wikka is internationalized so as to accept content in different charsets, it would be nice to have l10n strings in the same charsets without needing to reinvent l10n from scratch. This might require being able to detect if a file is using ISO-8859-1, ISO-8859-8 etc. Or maybe translated strings should be saved as unicode?
Comment by JsnX
2004-10-06 18:04:42
Oh yeah, now I remember that DarTar did raise a few questions (objections). ;)
Comment by JordaPolo
2004-10-06 18:10:26
DarTar, It is possible to define the charset in the .po files.

But I think migrating to UTF8 would be the best way to handle a proper internationalization. It is also the easiest way: iconv .php files and change the charset (though that would probably require to drop the support for MySQL 3).
Comment by DotMG
2004-10-07 07:20:09
I totally ignore about gettext. How does it convert a string between languages. My problem is that these are "big" languages like french, english which won't have any problem; and some other "little" languages like malagasy (spoken by only 16 million people) that only a little number of computer scientist can manage the conversion. As far as I know, these are still NO malagasy application using gettext.
And what about a wikka used in an intranet on windows machines not connected to internet?
Comment by JordaPolo
2004-10-07 08:05:56
DotMG, I don't see the problem. Malagasy is supported by utf-8, and gettext supports utf-8.

AFAIK, gettext supports almost any language, see KDE i18n statistics, there are many minority languages.
http://i18n.kde.org/stats/gui/HEAD/index.php

Oh, and catalan (my main language) is only spoken by 10 million people ;)
Comment by JsnX
2004-10-08 00:04:00
Are we getting closer to a consensus on gettext?
Comment by JavaWoman
2004-10-08 07:08:00
UTF-8 support would be great - but if that means at the same time dropping the support for MySQL 3.x, you'll be leaving a great many users behind who don't have version 4.x available on their hosts (I know I don't).
There should be a way in the program to get around that limitation in MySQL 3.x - for instance by encoding the text. URL-encode would work (I'm using this in a site of my own); you do the encoding then only depending on MySQL version.
Comment by LaurentBurgbacher
2004-10-08 07:22:23
But we're speaking about encoding wikka in UTF-8, not the content. The content can still be saved in ascii + html entities in mysql, this way we keep mysql 3 support. Is this a problem? See HandlingUTF8. Am I wrong?
Comment by JavaWoman
2004-10-08 09:17:24
LaurentBurgbacher,
Sorry, I'm getting confused now.
"But we're speaking about encoding wikka in UTF-8, not the content." - What's "wikka" and what's "content" in that statement? For me, what I'm looking at in my browser (or listening to, as the case may be) is the "content" (while the HTML head section is not content, and this not rendered). (IOW, the text on that button that reeads "Delete Comment" is content - no matter where it comes from.)

Anyhow:
- I wholeheartedly support using UTF-8 (Unicode is the only solution for multi-language pages)
- but *only* if this (in Wikka) does *not* require MySQL 4+ (or the latest-and-greatest version of PHP for that matter - we already have dependencies now that shouldn't be there IMO).
Comment by JordaPolo
2004-10-08 11:18:02
An option would be to add support for gettext first (and get more feedback from the users), and then decide how we want to support UTF8.
Comment by LaurentBurgbacher
2004-10-08 14:29:49
JavaWoman, sorry to confuse you. I'll try to be more explicit. We have (in my opinion) two sources of text to generate a wikka page:
1) the content, what the user wrote, which goes to the database, and can be encoded in ASCII + html entities (this covers the UTF8 charset, right?) which does not require more than MySQL 3.0
2) the wikka text, aka text of the buttons, header, footer... which comes from the gettext files (actually the php files) and which can be encoded in UTF8 with no influence on the "content" (1)

About the other points, like you I'm for the UTF8 solution (because I often use characters which are not in latin-1).

JordaPolo, I think we should go for UTF8 now, because it's the only we to do I18N. Uniwakka could be a good source of inspiration. I had no time to check their current state, but they claim that they support UTF8 (I think...).
Comment by JavaWoman
2004-10-08 17:31:57
LaurentBurgbacher,
Thanks, that's a little clearer now - something along the lines of "user-written content" vs. "system-generated user interface" coming from PHP programs.

I see a number of gray areas here though:
- pages that are "created" on system installation: they have content like "user-written content", but form a part of the user interface of the system as a whole
- actions used in "user-written content" that may have parameters (is the parameter content translatable?)

Looking at the archive I downloaded I see that the "key" texts can have variables - that's good. But there are other issues with i18n, for instance:
- homonyms (English "key" means "sleutel" or "toets" in Dutch, depending on context/semantics; or verb forms that are teh sma ein English for infinitive and command, but ddifferent in other languages)
- synomyms
- capitalization rules that are different for various languages
- punctuation rules that are different for various languages (for instance, if you have prompts in a form that in English would end with a colon, the colon should be part of the translatable text, since English would place the colon immediately after the text, French would precede it with a space, and some languages might not normally use a colon at all)
How are such issues handled? Homonyms and capitalization can be major issues; punctuation will be unless punctuation is taken as part of translatable content from the start, and consistently. Consistency of wording used in different functions is also a major concern (especially in combination with the homonyms problem).

Finally, can a user choose (configure) a run-time language?
If so, obviously, this should apply to the "system-generated user interface" (which should not be a problem) as well as the *content* of system-generated pages. The latter would be a problem unless page "content" can also make use of translatable strings. Can it?

I would prefer that there would be a user-configurable run-time language; but this would ideally require an end user interface to translatable strings, or (at least) creation of the system-generated pages in all available langauges and easy extensibility when new languages become available.

Platform support: I note two scripts, translate.sh and translate-setup.sh; what about Windows support?

I'm all for i18n, but having been intimately involved with localization of a (commercial) software application, I would caution against being too hasty in implementing a particular solution. Let's look carefully at _all_ issues and tick them off as either "taken care of", "can be done later" (how?) or "not supported".

It's important. But it's *not* easy, however easy a tool may make it seem.
Comment by JordaPolo
2004-10-08 19:01:32
JavaWoman,
Pages that are "created" on system installation are already translatable. We have 2 domains: wikka and wikka-setup. The first one contains user interface, errors... all the strings that appear on actions/*.php, handlers/*.php... The second one, wikka-setup, contains the pages that are created on system installation.

You can change the language, but only the strings on wikka domain will be changed since wikka-setup strings are already stored in the database.
Comment by JavaWoman
2004-10-08 22:15:18
JordaPolo,
"You can change the language, but only the strings on wikka domain will be changed since wikka-setup strings are already stored in the database."

The system pages that are created on setup are, as I indicated, actually part of the system's user interface. So you'll have a user interface that's only partly in the user's language? In fact, in only one single language, instead of as many languages as are available?
Not good.

There should be an mechanism then to include *dynamically* translatable text into pages' content. Perhaps a special formatter action.
Comment by JordaPolo
2004-10-09 06:23:24
JavaWoman,
The idea behind WikkaGettext is to create internationalized versions of Wikka, not a multilingual version. Yes, it is possible to partially change the user's language, but I'm not interested in that. 1 wikka -> 1 language, but you can choose it.
Comment by JavaWoman
2004-10-09 07:03:30
JordaPolo,
Actually, the idea behind (GNU) *gettext* is internationalization; it seems what you're doing with WikkaGettext is not actually internationalization (which means supporting multiple languages and locales), but *localization*.

The idea is nice, but while we're at it, why not internationalize it?

Apart from that, all of the other issues I've raised haven't been addressed yet...

And then there is the fact that you need a version of PHP that actually supports gettext - something that may not be the case for all hosted versions of PHP.
Comment by JordaPolo
2004-10-09 08:24:06
JavaWoman,
About the other issues:
- punctuation: is taken as part of translatable content
- capitalization: I don't see the problem here, translators can choose how to capitalize every string...
- homonyms, synonims: it could be possible to use some kind of contextualization on problematic strings, for example: context|string
- scripts: only needed by translator coordinator, translators use tools like poedit, gtranslator, kbabel, vim...
- php gettext: it is possible to create our own _() function if gettext is not installed (as it is done in phpwiki)

I have been doing both internationalization (adding support for multiple languages) and localization (operation by which an internationalized program is adapted) to catalan. The only problem (as you see it) is that you can't change the language on the fly. And that could be fixed but...

Well, if you want to change the language of wikka on the fly, then I think gettext is not the way to go. Why? It would require the server to have all the locales, and that could be a real problem on some servers.
Comment by JavaWoman
2004-10-10 08:45:38
Just a short note - mainly to JsnX who wanted to know if "we getting closer to a consensus on gettext":

1) I'm all in favor of enabling Wikka for l10n and/or i18n. And enabling UTF-8 as well.
2) I'm definitely opposed to doing it *now*.
Not to detract from JordaPolo's work in creating this demo (which nicely shows that it /can/ work - thanks for all the work involved!) but in my opnion the code base of Wikka (as it is now) is simply not ready for proper i18n (and this demo actually illustrates that, if you look closely enough). The problem is not with JordaPolo's work, but more with the lack of coding standards that should _enable_ i18n.

The issues I've raised in my comments on this page actually only scratch the surface. I'll start a separate page to present my ideas more clearly than I can do in comments here - but have a little patience, I'm still working on my email stuff and that takes an amount of concentration. ;-)

In short: yes, let's do it. But PLEASE let's not do it *now*.
Comment by DotMG
2004-10-11 11:09:12
>>
- scripts: only needed by translator coordinator, translators use tools like poedit, gtranslator, kbabel, vim...
- php gettext: it is possible to create our own _() function if gettext is not installed (as it is done in phpwiki)
<<

If that means gettext can be used in non-Unix environment, I'm OK. Actually, I have this error message on Windows platform : "Fatal error: Call to undefined function: bindtextdomain() in path_to\wikka.php on line 910"

Creating our own _() function is a good idea, as I will use Wikka on an Intranet machines running under Windows. We (I) also need instructions on how to perform the translator's tasks.

I absolutely agree with JavaWoman : yes, let's do it. But PLEASE let's not do it *now*
Comment by JavaWoman
2004-10-11 11:26:39
DotMG,
Try a Google with ["Call to undefined function: bindtextdomain()"] (where [] indicates the textbox); it's a common problem. Put simply, you need to 1) install the windows version of gettext (runtime _and_ tools) which you can get from the GNU ftp site; and 2) enable it for PHP (php.ini).
If you're running on an Intranet that should not be a problem - for hosted sites it might be.
At the very least Wikka with gettext support should come with a clear manual on how to enable it on various platforms, as well as what to do when it isn't (can't be) supported. There are many more gotchas than just this "little" one as I found in my researches last weekend; the documentation at http://php.net/manual/en/ref.gettext.php (and all the functions!) - will give you a taste - in particular the comments on those pages; and that's for starters.
Comment by JordaPolo
2004-10-12 07:11:28
It seems like a good idea to clear things up first.
Again, depending on what kind of I18N do you want to have on Wikka, gettext might be (or not) a good solution.
Comment by AndreaRossato
2004-10-12 08:46:04
I must confess I do not like the gettext approach:
- it requires php to be compiled with a non-default option,
- AFAIK it is far from being simple to setup.

Well, I should not rise the first objection, since in UniWakka, the wakka fork I'm developing, I'm using mbstring to handle utf-8 (there is a test version that does not require it, but I did not have time to test in depth and so I did not include it in the released version - check it out in http://wikka.jsnx.com/HandlingUTF8).
The second objection could be due to the fact that I'm not familiar with gettex...;-)

Anyway, the approach I would like to take is the one adopted by CooCooWakka:
http://www.hsfz.net.cn/coo/wiki/HomePage
Get the source and have a look. They support chinese and english (They pretend to support utf-8 but I did not find any bit related to it and did not test it).

I18n without proper utf-8 support is quite useless, IMHO. I18n makes sense as far as the application can support a wide variety of languages, and multi-language too. With the CooCooWakka approach every user can choose the UI locale. With utf-8 support you do not need to take into account charset definition and stupid user will not mess everything up. Have a look here to understand what I mean: http://wiki.cs.cityu.edu.hk/CitiWiki/Comment
CitiWiki defined utf-8 as the default charset, but wakka (CitiWiki is a wakka fork too) does not support it! If you let user define the charset this kind of issues will show up, that's for sure. Users will blame the developer. And they are right, even if stupid...;-)

Now, I've tried to demonstrate that real multi-language support without having to drop mysql 3.x.x support is possible. AFAIK DarTar ported my API to wikka, and he said it is working just fine. I refined it so that the memory consumption is not so burdensome, and, without requiring mbstring php support, it can be easily adopted. Further improvements are always possible.

As usual, my 2 cents.
Comment by AndreaRossato
2004-10-12 09:16:17
On utf-8 with mysql 3.x.x: there are downsides, obviously. If you store multi-byte characters as single bite unicode entities, you cannot use mysql FULLTEXTSEARCH capabilities, and recreating these capabilities with a complex query is far from being simple. I did not even bother to try (also because these capabilities are quite obscure to me).
That's a trade off and I decided that multi-language support with mysql 3.x.x was worth the price. In the far east there are many guys involved in wiki technologies. They need multi-byte support. Even though mysql and php support for multi-byte sucks. Letting them use UniWakka was one of my goal.
When mysql-4.1.x will be common enough this approach can be changed.
Comment by JavaWoman
2004-10-30 09:43:21
Just an interesting blog post I just stumbled over, making an excellent point about i18n:
Computing in Zulu -
http://itre.cis.upenn.edu/~myl/languagelog/archives/001596.html
Valid XHTML :: Valid CSS: :: Powered by WikkaWiki