Wikka : HtmlSpecialChars

HomePage :: Categories :: Index :: Changes :: Comments :: Documentation :: Blog :: Login/Register

Htmlspecialchars

discussion before the release of 1.1.6.0

htmlspecialchars_unicode():
looking at the code I'm sure this will not work correctly. It wil "accept" numerical entity references but not named entity references - so those still won't work. (And actually, its operation doesn't have anything to do with Unicode, only with entities - which may encode Unicode characters but entities are not themselves Unicode.)
See my item 'Non-breaking space in forced links' in the SuggestionBox: this is one case that actually shows the function htmlspecialchars_unicode() does not work correctly!
Also, there is no option for the quote_style and charset parameters as in the PHP original, so we lose functionality here, too. It's probably better to have a "wrapper" function around the PHP one, which (after applying the PHP function, passing on the extra parameters) merely reverts all ampersands that are for any entity references (numerical or named); thus the wrapper function should accept all parameters the PHP function does. And since we are supposed to produce XHTML, ENT_QUOTES should probably be the default for quote_style; maybe also UTF-8 should be the default for charset?
Note: the INI code formatter used ENT_QUOTES and this has now disappeared (it was there for a reason!). But an entity used in code should be visible as an entity, not as the character it encodes. Conclusion: a code formatter should use the PHP function htmlspecialchars() so that any entities are "escaped".
Here's the solution: a new function htmlspecialchars_ent() to replace the proposed (beta) htmlspecialchars_unicode():
    /**
    * Wrapper around PHP's htmlspecialchars() which preserves (repairs) entity references.
    *
    * The function accepts the same parameters as htmlspecialchars() in PHP and passes them on
    * to that function.
    *
    * One defaults here is different here from that in htmlspecialchars() in PHP:
    * charset is set to UTF-8 so we're ready for UTF-8 support (and as long as we don't support
    * that there should be no difference with Latin-1); on systems where the charset parameter
    * is not available or UTF-8 is not supported this will revert to Latin-1 (ISO-8859-1).
    *
    * The function first applies htmlspecialchars() to the input string and then "unescapes"
    * character entity references and numeric character references (both decimal and hexadecimal).
    * Entities are recognized also if the ending semicolon is omitted at the end or before a
    * newline or tag but for consistency the semicolon is always added in the output where it was
    * omitted.
    *
    * NOTE:
    * Where code should be rendered _as_code_ the original PHP function should be used so that
    * entity references are also rendered as such instead of as their corresponding characters.
    *
    * @access    public
    * @since    wikka 1.1.6.0
    * @version    1.0
    * @todo    (later) support full range of situations where (in SGML) a terminating ; may legally
    *            be omitted (end, newline and tag are merely the most common ones).
    *
    * @param    string    $text required: text to be converted
    * @param    integer    $quote_style optional: quoting style - can be ENT_COMPAT (default, escape
    *            only double quotes), ENT_QUOTES (escape both double and single quotes) or
    *            ENT_NOQUOTES (don't escape any quotes)
    * @param    string    $charset optional: charset to use while converting; default UTF-8
    *            (overriding PHP's default ISO-8859-1)
    * @return    string    converted string with escaped special characted but entity references intact
    */

    function htmlspecialchars_ent($text,$quote_style=ENT_COMPAT,$charset='UTF-8')
    {
        // define patterns
        $alpha  = '[a-z]+';                            # character entity reference
        $numdec = '#[0-9]+';                        # numeric character reference (decimal)
        $numhex = '#x[0-9a-f]+';                    # numeric character reference (hexadecimal)
        $terminator = ';|(?=($|[\n<]|&lt;))';        # semicolon; or end-of-string, newline or tag
        $entitystring = $alpha.'|'.$numdec.'|'.$numhex;
        $escaped_entity = '&amp;('.$entitystring.')('.$terminator.')';

        // execute PHP built-in function, passing on optional parameters
        $output = htmlspecialchars($text,$quote_style,$charset);
        // "repair" escaped entities
        // modifiers: s = across lines, i = case-insensitive
        $output = preg_replace('/'.$escaped_entity.'/si',"&$1;",$output);
        // return output
        return $output;
    }
I created a test harness for it (I tested with htmlspecialchars() in the Link() function in wikka.php replaced by htmlspecialchars_ent():
  1. [[HomePage word & notherword]] (to be escaped)
    => word & notherword
  2. [[JavaWoman Java&nbsp;Woman]] (alpha: no-breaking space)
    => Java Woman
  3. [[JavaWoman &Auml;hnlich]] (alpha with uppercase)
    => Ähnlich
  4. no terminating ; before tag &quot<span style="color:blue;">blue</span>&quot; (alpha: text, not link)
    => no terminating ; before tag "blue"
  5. [[CategoryDevelopmentTest test no terminating ; before end &quot]] (alpha)
    => test no terminating ; before end "
  6. [[Docs:FormattingRules <b>no ; before tag &#039</b>]] (numeric decimal)
    => <b>no ; before tag '</b>
  7. [[SandBox missing ; before &#x3f5
|<- newline]] (numeric hex)
=> missing ; before ϵ |<- newline
There are problems (now) with test cases 2, 3, 5, 6 and 7; with my proposed solution all testcases are converted correctly.
Note: The formatter does not recognize forced links with newline in description text (test case 7)! There is nothing wrong with having a newline inside a link description. => This requires small fix in wakka formatter (./formatters/wakka.php), as follows:
replace:
        // forced links
        // \S : any character that is not a whitespace character
        // \s : any whitespace character
        else if (preg_match("/^\[\[(\S*)(\s+(.+))?\]\]$/", $thing, $matches))
by:
        // forced links
        // \S : any character that is not a whitespace character
        // \s : any whitespace character
        else if (preg_match("/^\[\[(\S*)(\s+(.+))?\]\]$/s", $thing, $matches))        # s modifier: recognize forced links across lines
 
Finally: the call in ./formatters/ini.php should be restored to:
$text = htmlspecialchars($text, ENT_QUOTES);
and in ./formatters/code.php we should also use
htmlspecialchars($text, ENT_QUOTES);
(i.e, add the ENT_QUOTES parameter).
All other calls to htmlspecialchars_unicode() (as well as the definition) should be replaced by htmlspecialchars_ent() above.


CategoryDevelopmentCore CategoryDevelopmentFormatters CategoryDevelopmentTest
There is one comment on this page. [Display comment]
Valid XHTML 1.0 Transitional :: Valid CSS :: Powered by WikkaWiki
Page was generated in 0.3293 seconds