Wiki source for HtmlSpecialChars
===Htmlspecialchars===
//discussion before the release of 1.1.6.0//
**htmlspecialchars_unicode()**:
looking at the code I'm sure this will not work correctly. It wil "accept" numerical entity references but not named entity references - so those still won't work. (And actually, its operation doesn't have anything to do with **Unicode**, only with entities - which //may// encode Unicode characters but entities are not themselves Unicode.)
~//See my item 'Non-breaking space in forced links' in the SuggestionBox: this is one case that actually **shows** the function htmlspecialchars_unicode() does not work correctly!//
Also, there is no option for the ##quote_style## and ##charset## parameters as in the PHP original, so we lose functionality here, too. It's probably better to have a "wrapper" function around the PHP one, which (after applying the PHP function, passing on the extra parameters) merely reverts all ampersands that are for any entity references (numerical or named); thus the wrapper function should accept all parameters the PHP function does. And since we are supposed to produce XHTML, ENT_QUOTES should probably be the default for ##quote_style##; maybe also UTF-8 should be the default for ##charset##?
Note: the INI code formatter used ENT_QUOTES and this has now disappeared (it was there for a reason!). But an entity used in **code** should be visible as an entity, not as the character it encodes. Conclusion: a code formatter //should// use the PHP function htmlspecialchars() so that any entities are "escaped".
~Here's the solution: **a new function ##htmlspecialchars_ent()##** to replace the proposed (beta) ##htmlspecialchars_unicode()##:
~%%(php) /**
* Wrapper around PHP's htmlspecialchars() which preserves (repairs) entity references.
*
* The function accepts the same parameters as htmlspecialchars() in PHP and passes them on
* to that function.
*
* One defaults here is different here from that in htmlspecialchars() in PHP:
* charset is set to UTF-8 so we're ready for UTF-8 support (and as long as we don't support
* that there should be no difference with Latin-1); on systems where the charset parameter
* is not available or UTF-8 is not supported this will revert to Latin-1 (ISO-8859-1).
*
* The function first applies htmlspecialchars() to the input string and then "unescapes"
* character entity references and numeric character references (both decimal and hexadecimal).
* Entities are recognized also if the ending semicolon is omitted at the end or before a
* newline or tag but for consistency the semicolon is always added in the output where it was
* omitted.
*
* NOTE:
* Where code should be rendered _as_code_ the original PHP function should be used so that
* entity references are also rendered as such instead of as their corresponding characters.
*
* @access public
* @since wikka 1.1.6.0
* @version 1.0
* @todo (later) support full range of situations where (in SGML) a terminating ; may legally
* be omitted (end, newline and tag are merely the most common ones).
*
* @param string $text required: text to be converted
* @param integer $quote_style optional: quoting style - can be ENT_COMPAT (default, escape
* only double quotes), ENT_QUOTES (escape both double and single quotes) or
* ENT_NOQUOTES (don't escape any quotes)
* @param string $charset optional: charset to use while converting; default UTF-8
* (overriding PHP's default ISO-8859-1)
* @return string converted string with escaped special characted but entity references intact
*/
function htmlspecialchars_ent($text,$quote_style=ENT_COMPAT,$charset='UTF-8')
{
// define patterns
$alpha = '[a-z]+'; # character entity reference
$numdec = '#[0-9]+'; # numeric character reference (decimal)
$numhex = '#x[0-9a-f]+'; # numeric character reference (hexadecimal)
$terminator = ';|(?=($|[\n<]|<))'; # semicolon; or end-of-string, newline or tag
$entitystring = $alpha.'|'.$numdec.'|'.$numhex;
$escaped_entity = '&('.$entitystring.')('.$terminator.')';
// execute PHP built-in function, passing on optional parameters
$output = htmlspecialchars($text,$quote_style,$charset);
// "repair" escaped entities
// modifiers: s = across lines, i = case-insensitive
$output = preg_replace('/'.$escaped_entity.'/si',"&$1;",$output);
// return output
return $output;
}
%%
~~I created a test harness for it (I tested with ##htmlspecialchars()## in the Link() function in ##wikka.php## replaced by ##htmlspecialchars_ent()##: ---
~~~1)""[[HomePage | word & notherword]]"" (to be escaped)---=> [[HomePage | word & notherword]]
~~~1)""[[JavaWoman | Java Woman]]"" (alpha: no-breaking space)---=> [[JavaWoman | Java Woman]]
~~~1)""[[JavaWoman | Ähnlich]]"" (alpha with uppercase)---=> [[JavaWoman | Ähnlich]]
~~~1)no terminating ; before tag "<span style="color:blue;">blue</span>" (alpha: text, not link)---=> ""no terminating ; before tag "<span style="color:blue;">blue</span>"""
~~~1)""[[CategoryDevelopmentTest | test no terminating ; before end "]]"" (alpha)---=> [[CategoryDevelopmentTest | test no terminating ; before end "]]
~~~1)""[[Docs:FormattingRules | <b>no ; before tag '</b>]]"" (numeric decimal)---=> [[FormattingRules | <b>no ; before tag '</b>]]
~~~1)""[[SandBox missing ; before ϵ""---
~~~""|<- newline]]"" (numeric hex)---=> [[SandBox missing ; before ϵ
|<- newline]]
~~There are problems (now) with test cases 2, 3, 5, 6 and 7; with my proposed solution all testcases are converted correctly.
~~**Note:** The formatter does not recognize forced links with newline in description text (test case 7)! There is nothing wrong with having a newline inside a link description. => This requires small fix in wakka formatter (**##./formatters/wakka.php##**), as follows:
~~replace:
~~%%(php) // forced links
// \S : any character that is not a whitespace character
// \s : any whitespace character
else if (preg_match("/^\[\[(\S*)(\s+(.+))?\]\]$/", $thing, $matches))
%%
~~by:
~~%%(php) // forced links
// \S : any character that is not a whitespace character
// \s : any whitespace character
else if (preg_match("/^\[\[(\S*)(\s+(.+))?\]\]$/s", $thing, $matches)) # s modifier: recognize forced links across lines
%%
~~Finally: the call in **##./formatters/ini.php##** should be restored to:
~~%%(php)$text = htmlspecialchars($text, ENT_QUOTES);%%
~~and in **##./formatters/code.php##** we should also use %%(php)htmlspecialchars($text, ENT_QUOTES);%% (i.e, add the ENT_QUOTES parameter).
~~All other calls to htmlspecialchars_unicode() (as well as the definition) should be replaced by htmlspecialchars_ent() above.
----
CategoryDevelopmentCore CategoryDevelopmentFormatters CategoryDevelopmentTest
//discussion before the release of 1.1.6.0//
**htmlspecialchars_unicode()**:
looking at the code I'm sure this will not work correctly. It wil "accept" numerical entity references but not named entity references - so those still won't work. (And actually, its operation doesn't have anything to do with **Unicode**, only with entities - which //may// encode Unicode characters but entities are not themselves Unicode.)
~//See my item 'Non-breaking space in forced links' in the SuggestionBox: this is one case that actually **shows** the function htmlspecialchars_unicode() does not work correctly!//
Also, there is no option for the ##quote_style## and ##charset## parameters as in the PHP original, so we lose functionality here, too. It's probably better to have a "wrapper" function around the PHP one, which (after applying the PHP function, passing on the extra parameters) merely reverts all ampersands that are for any entity references (numerical or named); thus the wrapper function should accept all parameters the PHP function does. And since we are supposed to produce XHTML, ENT_QUOTES should probably be the default for ##quote_style##; maybe also UTF-8 should be the default for ##charset##?
Note: the INI code formatter used ENT_QUOTES and this has now disappeared (it was there for a reason!). But an entity used in **code** should be visible as an entity, not as the character it encodes. Conclusion: a code formatter //should// use the PHP function htmlspecialchars() so that any entities are "escaped".
~Here's the solution: **a new function ##htmlspecialchars_ent()##** to replace the proposed (beta) ##htmlspecialchars_unicode()##:
~%%(php) /**
* Wrapper around PHP's htmlspecialchars() which preserves (repairs) entity references.
*
* The function accepts the same parameters as htmlspecialchars() in PHP and passes them on
* to that function.
*
* One defaults here is different here from that in htmlspecialchars() in PHP:
* charset is set to UTF-8 so we're ready for UTF-8 support (and as long as we don't support
* that there should be no difference with Latin-1); on systems where the charset parameter
* is not available or UTF-8 is not supported this will revert to Latin-1 (ISO-8859-1).
*
* The function first applies htmlspecialchars() to the input string and then "unescapes"
* character entity references and numeric character references (both decimal and hexadecimal).
* Entities are recognized also if the ending semicolon is omitted at the end or before a
* newline or tag but for consistency the semicolon is always added in the output where it was
* omitted.
*
* NOTE:
* Where code should be rendered _as_code_ the original PHP function should be used so that
* entity references are also rendered as such instead of as their corresponding characters.
*
* @access public
* @since wikka 1.1.6.0
* @version 1.0
* @todo (later) support full range of situations where (in SGML) a terminating ; may legally
* be omitted (end, newline and tag are merely the most common ones).
*
* @param string $text required: text to be converted
* @param integer $quote_style optional: quoting style - can be ENT_COMPAT (default, escape
* only double quotes), ENT_QUOTES (escape both double and single quotes) or
* ENT_NOQUOTES (don't escape any quotes)
* @param string $charset optional: charset to use while converting; default UTF-8
* (overriding PHP's default ISO-8859-1)
* @return string converted string with escaped special characted but entity references intact
*/
function htmlspecialchars_ent($text,$quote_style=ENT_COMPAT,$charset='UTF-8')
{
// define patterns
$alpha = '[a-z]+'; # character entity reference
$numdec = '#[0-9]+'; # numeric character reference (decimal)
$numhex = '#x[0-9a-f]+'; # numeric character reference (hexadecimal)
$terminator = ';|(?=($|[\n<]|<))'; # semicolon; or end-of-string, newline or tag
$entitystring = $alpha.'|'.$numdec.'|'.$numhex;
$escaped_entity = '&('.$entitystring.')('.$terminator.')';
// execute PHP built-in function, passing on optional parameters
$output = htmlspecialchars($text,$quote_style,$charset);
// "repair" escaped entities
// modifiers: s = across lines, i = case-insensitive
$output = preg_replace('/'.$escaped_entity.'/si',"&$1;",$output);
// return output
return $output;
}
%%
~~I created a test harness for it (I tested with ##htmlspecialchars()## in the Link() function in ##wikka.php## replaced by ##htmlspecialchars_ent()##: ---
~~~1)""[[HomePage | word & notherword]]"" (to be escaped)---=> [[HomePage | word & notherword]]
~~~1)""[[JavaWoman | Java Woman]]"" (alpha: no-breaking space)---=> [[JavaWoman | Java Woman]]
~~~1)""[[JavaWoman | Ähnlich]]"" (alpha with uppercase)---=> [[JavaWoman | Ähnlich]]
~~~1)no terminating ; before tag "<span style="color:blue;">blue</span>" (alpha: text, not link)---=> ""no terminating ; before tag "<span style="color:blue;">blue</span>"""
~~~1)""[[CategoryDevelopmentTest | test no terminating ; before end "]]"" (alpha)---=> [[CategoryDevelopmentTest | test no terminating ; before end "]]
~~~1)""[[Docs:FormattingRules | <b>no ; before tag '</b>]]"" (numeric decimal)---=> [[FormattingRules | <b>no ; before tag '</b>]]
~~~1)""[[SandBox missing ; before ϵ""---
~~~""|<- newline]]"" (numeric hex)---=> [[SandBox missing ; before ϵ
|<- newline]]
~~There are problems (now) with test cases 2, 3, 5, 6 and 7; with my proposed solution all testcases are converted correctly.
~~**Note:** The formatter does not recognize forced links with newline in description text (test case 7)! There is nothing wrong with having a newline inside a link description. => This requires small fix in wakka formatter (**##./formatters/wakka.php##**), as follows:
~~replace:
~~%%(php) // forced links
// \S : any character that is not a whitespace character
// \s : any whitespace character
else if (preg_match("/^\[\[(\S*)(\s+(.+))?\]\]$/", $thing, $matches))
%%
~~by:
~~%%(php) // forced links
// \S : any character that is not a whitespace character
// \s : any whitespace character
else if (preg_match("/^\[\[(\S*)(\s+(.+))?\]\]$/s", $thing, $matches)) # s modifier: recognize forced links across lines
%%
~~Finally: the call in **##./formatters/ini.php##** should be restored to:
~~%%(php)$text = htmlspecialchars($text, ENT_QUOTES);%%
~~and in **##./formatters/code.php##** we should also use %%(php)htmlspecialchars($text, ENT_QUOTES);%% (i.e, add the ENT_QUOTES parameter).
~~All other calls to htmlspecialchars_unicode() (as well as the definition) should be replaced by htmlspecialchars_ent() above.
----
CategoryDevelopmentCore CategoryDevelopmentFormatters CategoryDevelopmentTest