HandlingUTF8:Wikka

Revision [796]

This is an old revision of HandlingUTF8 made by DreckFehler on 2004-07-29 12:44:44.

Real Multilanguage Support

Here's some code to provide real multilanguage support.
The first 3 functions are used within the functions that do the real enconding conversions.
str2utf8, str2ascii and str2iso8859 can take any encodend string and convert it into the desired encoding: ascii plus unicode entities for html output, iso8859-1 plus unicode entities for database storage and utf8 for forms.
Unfortunately the ascii and iso8859 output is not compatible with htmlspecialchars. This is the reason of a valid_xml function. It has the same scope of htmlspecialchars , but will correctly handle &.
How to use this function? For istance,

in formatters/wakka.php you should use:

print($this->str2ascii($text));

in wakka.php, function SavePage you should use:

"body = '".mysql_escape_string(trim($this->str2iso8859($body)))."'");

in handlers/page/edit.php you should use:

"<textarea rows=\"40\" cols=\"60\" onkeydown=\"fKeyDown()\" name=\"body\" style=\"width: 100%; height: 400px\">".$this->valid_xml($this->str2utf8($body))."</textarea><br />\n"

And so on....

Check it out here.

The bits:

<?php
//Multilanguage support. We will use: utf-8 for user input, iso8859-1 + unicode for database storage and ascii + unicode for printing
function utf8_to_unicode($str) {
$unicode = array();
$values = array();
$lookingFor = 1;
for ($i = 0; $i < strlen($str); $i++ ) {
$thisValue = ord( $str[$i] );
if ( $thisValue < 128 ) $unicode[] = $thisValue;
else {
if ( count( $values ) == 0 ) $lookingFor = ( $thisValue < 224 ) ? 2 : 3;
$values[] = $thisValue;
if ( count( $values ) == $lookingFor ) {
$number = ( $lookingFor == 3 ) ?
( ( $values[0] % 16 ) * 4096 ) + ( ( $values[1] % 64 ) * 64 ) + ( $values[2] % 64 ):
( ( $values[0] % 32 ) * 64 ) + ( $values[1] % 64 );
$unicode[] = $number;
$values = array();
$lookingFor = 1;
}
}
}
return $unicode;
}
function deCP1252 ($str) {
$str = str_replace("&#128", "€", $str);
$str = str_replace("&#129", "", $str);
$str = str_replace("", "‚", $str);
$str = str_replace("", "ƒ", $str);
$str = str_replace("", "„", $str);
$str = str_replace("", "…", $str);
$str = str_replace("", "†", $str);
$str = str_replace("", "‡", $str);
$str = str_replace("", "ˆ", $str);
$str = str_replace("", "‰", $str);
$str = str_replace("", "Š", $str);
$str = str_replace("", "‹", $str);
$str = str_replace("", "Œ", $str);
$str = str_replace("", "‘", $str);
$str = str_replace("", "’", $str);
$str = str_replace("", "“", $str);
$str = str_replace("", "”", $str);
$str = str_replace("", "•", $str);
$str = str_replace("", "–", $str);
$str = str_replace("", "—", $str);
$str = str_replace("", "˜", $str);
$str = str_replace("", "™", $str);
$str = str_replace("", "š", $str);
$str = str_replace("", "›", $str);
$str = str_replace("", "œ", $str);
$str = str_replace("", "Ÿ", $str);
return $str;
}
function code2utf($num){
if($num<128)return chr($num);
if($num<2048)return chr(($num>>6)+192).chr(($num&63)+128);
if($num<65536)return chr(($num>>12)+224).chr((($num>>6)&63)+128).chr(($num&63)+128);
if($num<2097152)return chr(($num>>18)+240).chr((($num>>12)&63)+128).chr((($num>>6)&63)+128). chr(($num&63)+128);
return '';
}
//to print in a form
function str2utf8($str) {
mb_detect_order("ASCII, UTF-8, ISO-8859-1");
if (mb_detect_encoding($str) == "UTF-8") {
preg_match_all("/&#([0-9]*?);/", $str, $unicode);
foreach( $unicode[0] as $key => $value) {
$str = preg_replace("/".$value."/", $this->code2utf($unicode[1][$key]), $str);
}
return $str;
} else {
$mystr = $str;
$str = "";
for ($i = 0; $i < strlen($mystr); $i++ ) {
$code = ord( $mystr[$i] );
if ($code >= 128 && $code < 160) {
$str .= "&#".$code.";";
} else {
$str .= $this->code2utf($code);
}
}
$str = $this->deCP1252($str);
preg_match_all("/&#([0-9]*?);/", $str, $unicode);
foreach( $unicode[0] as $key => $value) {
$str = preg_replace("/".$value."/", $this->code2utf($unicode[1][$key]), $str);
}

return $str;
}
}

//to print html
function str2ascii ($str) {
$str = $this->str2utf8($str);
$unicode = $this->utf8_to_unicode($str);
$entities = '';
foreach( $unicode as $value ) {
$entities .= ( $value > 127 ) ? '&#' . $value . ';' : chr( $value );
} //foreach
return $this->deCP1252($entities);
}
//for database storage
function str2iso8859 ($str) {
$str = $this->str2utf8($str);
$unicode = $this->utf8_to_unicode($str);
$entities = "";
foreach( $unicode as $value ) {
if ($value <= 127)
$entities .= chr( $value );
elseif ($value > 159 && $value <= 255 )
$entities .= chr( $value );
else $entities .= '&#' . $value . ';';
} //foreach
return $this->deCP1252($entities);
}
function valid_xml ($str) {
$str = str_replace("\"", """, $str);
$str = str_replace("<", "<", $str);
$str = str_replace(">", ">", $str);
$str = preg_replace("/&(?![a-zA-Z0-9#]+?;)/", "&", $str);
return $str;
}
?>

--AndreaRossato

hmm... i may have the solution but i need to understand the problem ;)

first i don't know why not to take the utf8-decode and utf8-encode functions to handle the conversion itself (but maybe there is a reason i didn't think about). second it's not perfectly clear to me, how to treat clients that don't accept utf-8 encoding. i haven't had much time to get into the stuff, but so far i think the following tasks have to be managed:

determine the most convinient charset (that's easy, just have a look at $HTTP_ACCEPT_CHARSET)

set the apropriate http-header in header.php and - if needed - set a flag $this->config["use_utf8"] = true;

do the conversions on form-data if use_utf8 is set (this sounds like a busy task)

convert the $_POST data back to iso-8859-1 (the charset we'll internally work with)

leave the formatter untouched, which should be fed with iso-data (and entities), if i have no fault in the points above. instead use the buffered output which is stored in the variable $output at $wakka->includebuffered and convert it at once, namely to utf-8 what is expected by the client if it sends utf-8-formdata.

what i don't understand yet is:

what to do with the wikiword-recognition, which is designed for the latin alphabet. i think at least the [[forced links]] should work in every language.

will the diff-engine work (not worse than now), when it's fed with html-entities and nothing but entities (this will happen with a page that only stores the quotation of an aramean bible-text)

how will the fulltext-search behave

and of course am i right with the tasklist above. something's missing? something's wrong?

btw: what wakka-forks already exist, that are redesigned for the needs of a foreign charset? isn't wackowiki a russian spin off? do we have some cyrillic speaking wikka-fans out there? ;)

-- dreckfehler

Wikka : HandlingUTF8

Revision [796]

Real Multilanguage Support