HandlingUTF8:Wikka

Real Multilanguage Support

Please note that Wikka 1.4 will be released with full UTF-8 support, and is currently available for testing at http://wush.net/svn/wikka/trunk

See also:

WikkaLocalization

List of sites powered by Wikka in 35 languages.

Current i18n/l10n development pages.

Test page for multilanguage support

Here's some code to provide real multilanguage support.
The first 3 functions are used within the functions that do the real enconding conversions.
str2utf8, str2ascii and str2iso8859 can take any encodend string and convert it into the desired encoding: ascii plus unicode entities for html output, iso8859-1 plus unicode entities for database storage and utf8 for forms.
Unfortunately the ascii and iso8859 output is not compatible with htmlspecialchars. This is the reason of a valid_xml function. It has the same scope of htmlspecialchars , but will correctly handle &.
How to use this function? For istance,

in formatters/wakka.php you should use:

print($this->str2ascii($text));

in wikka.php, function SavePage you should use:

"body = '".mysql_escape_string(trim($this->str2iso8859($body)))."'");

in handlers/page/edit.php you should use:

"<textarea rows=\"40\" cols=\"60\" onkeydown=\"fKeyDown()\" name=\"body\" style=\"width: 100%; height: 400px\">".$this->valid_xml($this->str2utf8($body))."</textarea><br />\n"

And so on....

Update I changed the functions that do the conversion to improve speed and reduce memory usage 2004-08-14
--AndreaRossato

Check it out here.

The bits:

<?php
//Multilanguage support. We will use: utf-8 for user input, iso8859-1 + unicode for database storage and ascii + unicode for printing
function utf8_to_unicode($str) {
$unicode = array();
$values = array();
$lookingFor = 1;
for ($i = 0; $i < strlen($str); $i++ ) {
$thisValue = ord( $str[$i] );
if ( $thisValue < 128 ) $unicode[] = $thisValue;
else {
if ( count( $values ) == 0 ) $lookingFor = ( $thisValue < 224 ) ? 2 : 3;
$values[] = $thisValue;
if ( count( $values ) == $lookingFor ) {
$number = ( $lookingFor == 3 ) ?
( ( $values[0] % 16 ) * 4096 ) + ( ( $values[1] % 64 ) * 64 ) + ( $values[2] % 64 ):
( ( $values[0] % 32 ) * 64 ) + ( $values[1] % 64 );
$unicode[] = $number;
$values = array();
$lookingFor = 1;
}
}
}
return $unicode;
}
function deCP1252 ($str) {
$str = str_replace("€", "€", $str);
$str = str_replace("", "", $str);
$str = str_replace("‚", "‚", $str);
$str = str_replace("ƒ", "ƒ", $str);
$str = str_replace("„", "„", $str);
$str = str_replace("…", "…", $str);
$str = str_replace("†", "†", $str);
$str = str_replace("‡", "‡", $str);
$str = str_replace("ˆ", "ˆ", $str);
$str = str_replace("‰", "‰", $str);
$str = str_replace("Š", "Š", $str);
$str = str_replace("‹", "‹", $str);
$str = str_replace("Œ", "Œ", $str);
$str = str_replace("‘", "‘", $str);
$str = str_replace("’", "’", $str);
$str = str_replace("“", "“", $str);
$str = str_replace("”", "”", $str);
$str = str_replace("•", "•", $str);
$str = str_replace("–", "-", $str);
$str = str_replace("—", "—", $str);
$str = str_replace("˜", "˜", $str);
$str = str_replace("™", "™", $str);
$str = str_replace("š", "š", $str);
$str = str_replace("›", "›", $str);
$str = str_replace("œ", "œ", $str);
$str = str_replace("Ÿ", "Ÿ", $str);
return $str;
}
function code2utf($num){
if($num<128)return chr($num);
if($num<2048)return chr(($num>>6)+192).chr(($num&63)+128);
if($num<65536)return chr(($num>>12)+224).chr((($num>>6)&63)+128).chr(($num&63)+128);
if($num<2097152)return chr(($num>>18)+240).chr((($num>>12)&63)+128).chr((($num>>6)&63)+128). chr(($num&63)+128);
return '';
}
//to print in a form
function str2utf8($str) {
mb_detect_order("ASCII, UTF-8, ISO-8859-1");
if (mb_detect_encoding($str) == "UTF-8") {
preg_match_all("/&#([0-9]*?);/", $str, $unicode);
foreach( $unicode[0] as $key => $value) {
$str = preg_replace("/".$value."/", $this->code2utf($unicode[1][$key]), $str);
}
return $str;
} else {
$mystr = $str;
$str = "";
for ($i = 0; $i < strlen($mystr); $i++ ) {
$code = ord( $mystr[$i] );
if ($code >= 128 && $code < 160) {
$str .= "&#".$code.";";
} else {
$str .= $this->code2utf($code);
}
}
$str = $this->deCP1252($str);
preg_match_all("/&#([0-9]*?);/", $str, $unicode);
foreach( $unicode[0] as $key => $value) {
$str = preg_replace("/".$value."/", $this->code2utf($unicode[1][$key]), $str);
}

return $str;
}
}

//to print html
function str2ascii ($str) {
mb_detect_order("ASCII, UTF-8, ISO-8859-1");
$encoding = mb_detect_encoding($str);
switch ($encoding) {

case "UTF-8":
preg_match_all("/&#([0-9]*?);/", $str, $unicode);
foreach( $unicode[0] as $key => $value) {
$str = preg_replace("/".$value."/", $this->code2utf($unicode[1][$key]), $str);
}
$unicode = $this->utf8_to_unicode($str);
$entities = '';
foreach( $unicode as $value ) {
$entities .= ( $value > 127 ) ? '&#' . $value . ';' : chr( $value );
}
return $this->deCP1252($entities);
break;

case "ISO-8859-1":
for ($i = 0; $i < strlen($str); $i++ ) {
$value = ord( $str{$i} );
if ($value <= 127)
$constr .= chr( $value );
else $constr .= '&#' . $value . ';';
}//for

return $this->deCP1252($constr);
break;
case "ASCII":
return $this->deCP1252($str);
break;
}
}

//for database storage
function str2iso8859 ($str) {
mb_detect_order("ASCII, UTF-8, ISO-8859-1");
$encoding = mb_detect_encoding($str);
switch ($encoding) {

case "UTF-8":
preg_match_all("/&#([0-9]*?);/", $str, $unicode);
foreach( $unicode[0] as $key => $value) {
$str = preg_replace("/".$value."/", $this->code2utf($unicode[1][$key]), $str);
}
$unicode = $this->utf8_to_unicode($str);
$entities = '';
foreach( $unicode as $value ) {
if ($value <= 127)
$entities .= chr( $value );
elseif ($value > 159 && $value <= 255 )
$entities .= chr( $value );
else $entities .= '&#' . $value . ';';
}
return $this->deCP1252($entities);
break;

case "ISO-8859-1":
for ($i = 0; $i < strlen($str); $i++ ) {
$value = ord( $str{$i} );
if ($value > 127 && $value <= 160 )
$constr .= chr( $value );
else $constr .= '&#' . $value . ';';
}//for

return $this->deCP1252($constr);
break;
case "ASCII":
for ($i = 0; $i < strlen($str); $i++ ) {
$value = ord( $str{$i} );
if ($value > 159 && $value <= 255 )
$constr .= chr( $value );
elseif ($value > 127 && $value <= 160 )
$constr .= '&#' . $value . ';';
else $constr .= chr( $value );
}return $this->deCP1252($str);
break;
}
}

function valid_xml ($str) {
$str = str_replace("\"", """, $str);
$str = str_replace("<", "<", $str);
$str = str_replace(">", ">", $str);
$str = preg_replace("/&(?![a-zA-Z0-9#]+?;)/", "&", $str);
return $str;
}
?>

--AndreaRossato

hmm... i may have the solution but i need to understand the problem ;)

first i don't know why not to take the utf8-decode and utf8-encode functions to handle the conversion itself (but maybe there is a reason i didn't think about) (the correct functions would have been http_entity_decode($string, ENT_QUOTES, 'UTF-8') and httpentities($string, ENT_QUOTES, 'UTF-8'), but these functions aren't able to handle multybyte-chars yet. the mb-string-lib might give a more straight and performant solution. andrea's sample code should be valuable to understand what happens but i am still looking for a variant that don't "contaminate" the code too much and keeps it maintainable. a good start might be to introduce two functions "Formstring()" and "DBstring()" which do all conversion stuff including mysql_escape_string and such and to maintain the conversion stuff in one central place in future steps)

second it's not perfectly clear to me, how to treat clients that don't accept utf-8 encoding. i haven't had much time to get into the stuff, but so far i think the following tasks have to be managed:

determine the most convinient charset (that's easy, just have a look at $HTTP_ACCEPT_CHARSET)

set the apropriate http-header in header.php and - if needed - set a flag $this->config["use_utf8"] = true;

do the conversions on form-data if use_utf8 is set (this sounds like a busy task)

convert the $_POST data back to iso-8859-1 (the charset we'll internally work with)

leave the formatter untouched, which should be fed with iso-data (and entities), if i have no fault in the points above. instead use the buffered output which is stored in the variable $output at $wakka->includebuffered to convert it at once, namely to utf-8 what is expected by the client if it sends utf-8-formdata.

what i don't understand yet is:

what to do with the wikiword-recognition, which is designed for the latin alphabet. i think at least the [[forced | links]] should work in every language.

will the diff-engine work (not worse than now), when it's fed with html-entities and nothing but entities (this will happen with a page that only stores the quotation of an aramean bible-text)

how will the fulltext-search behave

and of course am i right with the tasklist above. something's missing? something's wrong?

btw: what wakka-forks already exist, that are redesigned for the needs of a foreign charset? isn't wackowiki a russian spin off? do we have some cyrillic speaking wikka-fans out there? ;)

-- dreckfehler

There's a Wakka fork redesigned to support multi-language: UniWakka -:).

I'll try to clarify the problem, as far as I can ;)
The problem with character encoding is that UTF-8 is a multi-byte encoding. Ascii and UTF-8 are actually the same stuff, since the first 128 character in UTF-8 are plain 8-bit. The problem is the remaining characters that are encoded with more than 1 byte...

Now, there are two different approaches:
1. you can use 8-bit encoding (iso-8859-*). That is to say: if you have cyrillic characters you can use iso-8859-5 (or cp-1252, as far as I remember). Ascii characters are the same, bur above chr(128) you have cyrillic chars. In this case you can use cyrillic but not, for instance, french accented letters (these are not included in iso-8859-5).
This approach lets you use charset metatags to define the encoding. PHP will be able to handle it, since the characters are plain 8-bit. This cannot be called multi-language support: you can only use a very limited set of languages at a time. Period.
This is the Wacko approach.

2. If you want to have cyrillic letters and Italian (or French) accented letters in the same wiki, then you need UTF-8, that is to say, multi-byte characters. PHP will not able to handle strings with multi-byte encodings: preg_match, preg_replace will not work.
You need to convert those strings into single-byte characters. The only way I was able to find to manipulate those strings is to use iso-8859-1 plus unicode entities.

WikiWords must be plain ascii, as every URI.
I did not study WikkaWiki diff engine. But there shouldn't be any problem as far as you use unicode entities above ascii (or iso-8859-1) characters.
The same applies to full-text search. The string to be searched is converted into iso-8859-1 plus unicode entities. And unicode entities can be searched. Have a try here.
http_entity_decode and httpentities work only with single byte characters, as every php functions. As you said, for multi-byte you need to use mb-string-lib. But if you want to use the lib you are going to rewrite every wakka-derived wiki, and you cannot use perl regular expressions. And this is not going to avoid "contamination" of the code.

Moreover, I would like to ask you to indicate some user-agent that do not support UTF-8. IE, gecko derived browser, Konqueror, Opera do support it. As far as I know Google pages are utf-8 encoded.

--AndreaRossato

"Modern" user agents support UTF-8 - but as far as I know only the graphical ones (i.e., not Lynx or Links - or maybe they do on Unix, but certainly not on Windows); IE at least as far back as 5.01 - don't know about 4.0 (yes there are people that use this); Netscape 4.x has I think only limited support (if at all), and as you say the Gecko-based browsers are OK, as is Opera (6+ at least, not sure about 5).
-- JavaWoman

The "We don't like mbstring" version of the code

<?php
function is_utf8($Str) {
for ($i=0; $i<strlen($Str); $i++) {
if (ord($Str[$i]) < 0x80) continue;
elseif ((ord($Str[$i]) & 0xE0) == 0xC0) $n=1;
elseif ((ord($Str[$i]) & 0xF0) == 0xE0) $n=2;
elseif ((ord($Str[$i]) & 0xF8) == 0xF0) $n=3;
elseif ((ord($Str[$i]) & 0xFC) == 0xF8) $n=4;
elseif ((ord($Str[$i]) & 0xFE) == 0xFC) $n=5;
else return false;
for ($j=0; $j<$n; $j++) {
if ((++$i == strlen($Str)) || ((ord($Str[$i]) & 0xC0) != 0x80))
return false;
}
}
return true;
}

//to print in a form
function str2utf8($str) {
if ($this->is_utf8($str)) {
preg_match_all("/&#([0-9]*?);/", $str, $unicode);
foreach( $unicode[0] as $key => $value) {
$str = preg_replace("/".$value."/", $this->code2utf($unicode[1][$key]), $str);
}
return $str;
} else {
$mystr = $str;
$str = "";
for ($i = 0; $i < strlen($mystr); $i++ ) {
$code = ord( $mystr[$i] );
if ($code >= 128 && $code < 160) {
$str .= "&#".$code.";";
} else {
$str .= $this->code2utf($code);
}
}
$str = $this->deCP1252($str);
preg_match_all("/&#([0-9]*?);/", $str, $unicode);
foreach( $unicode[0] as $key => $value) {
$str = preg_replace("/".$value."/", $this->code2utf($unicode[1][$key]), $str);
}

return $str;
}
}

//ascii for xhtml
function str2ascii ($str) {
if ($this->is_utf8($str)) {

preg_match_all("/&#([0-9]*?);/", $str, $unicode);
foreach( $unicode[0] as $key => $value) {
$str = preg_replace("/".$value."/", $this->code2utf($unicode[1][$key]), $str);
}
$unicode = $this->utf8_to_unicode($str);
$entities = '';
foreach( $unicode as $value ) {
$entities .= ( $value > 127 ) ? '&#' . $value . ';' : chr( $value );
}
return $this->deCP1252($entities);
} else {
for ($i = 0; $i < strlen($str); $i++ ) {
$value = ord( $str{$i} );
if ($value <= 127)
$constr .= chr( $value );
else $constr .= '&#' . $value . ';';
}//for

return $this->deCP1252($constr);

}
}
//iso8859 for database storage (so we do not need mysql 4.1)
function str2iso8859 ($str) {
if ($this->is_utf8($str)) {
preg_match_all("/&#([0-9]*?);/", $str, $unicode);
foreach( $unicode[0] as $key => $value) {
$str = preg_replace("/".$value."/", $this->code2utf($unicode[1][$key]), $str);
}
$unicode = $this->utf8_to_unicode($str);
$entities = '';
foreach( $unicode as $value ) {
if ($value <= 127)
$entities .= chr( $value );
elseif ($value > 159 && $value <= 255 )
$entities .= chr( $value );
else $entities .= '&#' . $value . ';';
}
return $this->deCP1252($entities);
} else {
for ($i = 0; $i < strlen($str); $i++ ) {
$value = ord( $str{$i} );
if ($value > 159 && $value <= 255 )
$constr .= chr( $value );
elseif ($value > 127 && $value <= 160 )
$constr .= '&#' . $value . ';';
else $constr .= chr( $value );
}
return $this->deCP1252($str);

}
}

--AndreaRossato

Links to information on other sites:

The Absolute Minimum Every Software Developer Must Know About Unicode and Character Sets

CategoryDevelopmentI18n

Comments

Comment by DarTar

2004-11-11 10:13:02

An interesting implementation of Wikka supporting chinese charset:
http://notes.siuying.net/wikka.php?wakka=HomePage

Comment by JavaWoman

2004-11-11 13:47:46

No "Chinese charset" there - the pages are simply served with charset=utf-8 which /enables/ Chinese as well as a multitude of other languages.

I noted before teh site it had links with Chinese characters (something Wikka doesn't have (yet) - see my remarks on the bugs page) but had seen no comments yet (also a problem in Wikka with UTF-8 support); I just added a comment to the SandBox page there (http://notes.siuying.net/wikka.php?wakka=SandBox) - and saw the characters (pasted from the Home Page) appearing there, too.

I'm sure it must be only a few simple changes to enable this in Wikka - and seeing all the tests in *our* SandBox it's obviously a feature many people are looking for!

Comment by DarTar

2004-11-11 19:15:08

JW, sorry: I actually wanted to say "chinese characters" and hadn't realized that this site had already been mentioned ;-)

I've noticed too an increasing number of tests in asiatic languages on openformats.org, a site I administer which BTW _already_ supports UTF-8 encoded content, comments and links (see this page: http://www.openformats.org/TestUTF8).

A general-purpose, UTF-8-supporting, php-based and light wiki-engine might have quite a success. Maybe we should postpone the delicate localization issues and accelerate the multilanguage support development. As I have already said somewhere, it took me about 1 hour to patch a Wikka distribution and make it UTF-8 compliant following Andrea's suggestions...

Comment by JavaWoman

2004-11-11 20:05:58

@DarTar,
"Maybe we should postpone the delicate localization issues and accelerate the multilanguage support development."

Yes, I think that would be a good approach; and meanwhile we could also work on gradually "sanitizing" the code so it will better support internationalization. We'll need some coding guidelines for that, I think. (I'm trying to work out an approach while I'm working on my own software modules...)

(How's Leiden? :))

Comment by DarTar

2004-11-12 08:54:48

We might consider, for example, splitting the development of Wikka into a main branch (wikka core unchanged, bug correction, new 'local' features) and an experimental branch with UTF-8 support regularly synchronized with the main branch.

I agree that coding guidelines are needed as well as a word from Jason about the whole plan.
As for coding standards, a good starting point would be to begin writing a small documentation on wikka core, any voluntary? ;-) http://wikka.jsnx.com/WikkaCore/edit

Leiden is fine, check your mailbox to learn more ;-)

Comment by JavaWoman

2004-11-12 16:01:42

I like the idea of an experimental "fork" - as long as it's temporary and with the intention of ploughing successful experiments back into the Wikka stable main branch. That could possibly speed up some developments by not having to worry about breaking what's stable now.

Hello JsnX - Ping Jason?

Comment by JavaWoman

2004-12-03 12:38:55

Just came across this PHP class that might be helpful here:
Class: UTF8
http://www.phpclasses.org/browse/package/1974.html
(Nominee for the November 2004 innovation awards)

Comment by JavaWoman

2004-12-06 17:42:55

"The Absolute Minimum Every Software Developer Must Know About Unicode and Character Sets"
Great article! Bookmarked.

Comment by YanB

2005-02-27 14:19:10

Question: any progress on a stable (& 'official') wikka release that supports utf-8?

Comment by NilsLindenberg

2005-02-28 15:49:34

hi YanB. At the moment there is some discussion about a better file structure for Wikka (on WikkaCodeStructure) and some of us are working on setting up the/a csv. After that we can try out things like the eperimental fork mentioned above.

Comment by UserFolderol

2005-08-18 00:38:53

Hello, the instructions at the top state that in handlers/page/edit.php the <textarea> should be initialized with $this->valid_xml($this->str2utf8($body)) instead of htmlspecialchars($body). However, when following that advice, the page looks ok anywhere except in the <textarea>. It doesn't show unicode entities (number-like) anymore (which is a good thing), but instead of displaying the japanese characters, it showed nonsense symbols like this: æœ...

I found that using either str2iso8859 or str2ascii instead of str2utf8 there fixes the <textarea> as it seems at first, but some more testing revealed that by not using str2utf8 as indicated, some pages seem to lose all formatting when saving, as others saved ok, very strange...

Comment by UserFolderol

2005-08-18 02:23:25

Ok, great! Thanks to the article about unicode and character sets (great article btw), I started looking and found out that the code here is completely correct after all and str2utf8 is to be used as indicated.

However, Firefox and IE kept thinking the page was ISO-8859-1 encoded. That explained the junk characters.. and about twice as many as there should've been japanese characters. *even though* I modified the line in the <head> as follows:

<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />

After some toying around, I found out I could actually force utf-8 encoding this way (the first three lines of my header.php now look like this):

<?php
header('Content-type: text/html; charset=utf-8');
$message = $this->GetRedirectMessage();

Only now, the page seems to be correctly interpreted as being UTF-8 encoded and all seems OK now. Seems that just changing the <meta> tag wasn't enough somehow...

I hope this helps any other people that may encounter this problem.

Comment by DarTar

2005-08-18 08:39:30

UserFolderol, thanks for sharing your experiences: I'll add the header modification you suggest to WikkaLocalization.

Comment by TonZijlstra

2005-12-04 11:40:21

How would I go about using both Cyrillic and Roman character sets in the same wiki?
At http://blogwalk.interdependent.biz we are now looking at adding pages in Russian for an event that will take place in Moscow. All other pages are in English, and should remain so.

Comment by DarTar

2005-12-04 13:37:17

For Cyrillic+Latin you can use ISO-8859-5 or UTF-8.

Comment by TonZijlstra

2005-12-05 19:48:12

Ok, I added above functions to wikka.php
then proceeded to edit formatters and handlers as instructed.
Also changed the header with the info of UserFolderol in the comments above.

It now works for me supporting both Cyrillic and Latin scripts.
Sadly forced links do not work. I tried meddling with the wakka2callback function a bit, but that didn't work. As you can't call $this-> from there, which makes sense.
Now forced Cyrillic links work in the edit window, but not in the screen version of a page. There character codes are shown.

Comment by DarTar

2005-12-05 20:44:54

I have't tested the latest version of Andrea's code. Andrea's previous hack, though, worked fine for my needs (I used it at openformats.org - see http://openformats.org/TestUTF8). There you can have forced links like [[zh 简体中文]].
Can you give an example of what you call a forced link that won't work?
If it's [[Правда]], this is understandable. But forced links with a latin tag like [[pr Правда]] should work, if you just follow the instructions given at http://wikka.jsnx.com/WikkaLocalization#hn_Using_different_charsets

Comment by TonZijlstra

2005-12-06 10:20:49

It's the [[pr Правда]] thing that doesn't work, but shows character codes on screen.
I used the second version of Andrea gave of several functions.
I'll have another look.

Comment by TonZijlstra

2005-12-08 21:06:53

Had another look. The first version of Andrea's code I cannot use at this point:
mb_detect_order is not supported in my PHP version.

I'll see if upgrading the PHP version works.

Comment by AduC812

2010-11-14 14:07:19

At My Site http://romantic-ustu.ru I use utf-8, and I made that the page names can be in Russian. It seems that the later versions of php and mysql allows that with some corrections in regexps (more on my page here). URI is no longer has to be a plain ASCII - there are even domains being registered in any language.
If somebody is interested in my way of internationalization, I can contribute. If somebody finds that my way is not good (bad security or something??) - please tell me.

Comment by BrianKoontz

2010-11-14 22:45:54

AduC812, on behalf of the dev team, we accept! We've made quite a bit of I18N progress on version 1.3 of Wikka. In fact, if you're interested, maybe you can help us debug a gettext version of 1.3 by providing a Russian translation. BTW, 1.3 does support UTF-8 enabled URIs.

Comment by AduC812

2010-11-21 04:43:25

At my page here I have put a diff file, which adds into 1.3-gettext version a link markup of camelcase WikiNames, plain URLs and interwiki links in unicode, such as ВикиИмя, http://кремль.рф; and ВикиПедия:Формат. Hope this will help you in development. I know that camelcase requirement is deprecated, however, it is handy and I would not advice to remove camelcase links support in future.
Maybe somewhat later I will start providing a Russian translation, however, I cannot promise to make it reasonably fast.
And one more thing: if we now use unicode in Wikka, it can be useful to convert all source text files into unicode. Sometimes there are some latin-1 specific characters there, they wont work with unicode engine. And moreover, the non-european users will not be able to read them in source, if they have no latin-1 encoding avaliable.

Comment by BrianKoontz

2010-11-24 10:14:51

Thanks! I'll take a look at this and try to pull it into the upcoming release. Now that I re-read your comment, I realize I misunderstood: Support for UTF-8 is enabled now for pages, and you can embed these into URLs, but you were talking about UTF-8 support for addresses. Sorry about that.

Wikka : HandlingUTF8

Real Multilanguage Support

The "We don't like mbstring" version of the code

Links to information on other sites: