Revision [9141]

This is an old revision of ImprovedFormatter made by JavaWoman on 2005-06-12 11:44:05.

 

Improved Formatter


This is the development page for an improved version of "the Formatter", specifically, the code in ./formatters/wakka.php (as opposed to the AdvancedFormatter page which deals with "advanced" formatting in other ways as well, such as standardized code generation utilities).
 

Why?


While our current (version 1.1.6.0) Formatter is quite capable, it has some quirks and bugs, doesn't always generate valid XHTML (though it tries hard), and misses a few things that would be nice to have or that would enable things that would be nice to have (such as a TableofcontentsAction page TOCs). The improved version presented here tries to address some of these issues (with more likely to follow).


What?


Here is a short summary of what has changed (details below):

The code presented below is still considered a beta version and as such contains many lines of (comented-out) debug code. These will of course be removed before final release. Any reference to line numbers is (for now) to the new (beta) code since this is a complete drop-in replacement for the original file.


Closing open tags


The current version (Wikka 1.1.6.0) of the Formatter has a bit of code contributed by DotMG to close any left-open tags at the very end of a page. While that can solve some problems with rendering and including pages, the code was incomplete in which open tags were closed. A particular problem was still-open lists and indents which weren't handled at all (see "List parsing bug?" on WikkaBugs). Also, this code would directly echo output instead of returning a string as the rest of the Formatter's main function does.

The new version addresses all of these problems.

Closing of indents and (open) lists was already happening when encountering a newline that doesn't start with a TAB or a ~, so this bit is separated out as a function:

  1. if (!function_exists('close_indents'))
  2. {
  3.     function close_indents(&$indentClosers,&$oldIndentLevel,&$oldIndentLength,&$newIndentSpace)
  4.     {
  5.         $result='';
  6.  
  7.         $c = count($indentClosers);
  8.         for ($i = 0; $i < $c; $i++)
  9.         {
  10.             $result .= array_pop($indentClosers);
  11.             $br = 0;
  12.         }
  13.         $oldIndentLevel = 0;
  14.         $oldIndentLength= 0;
  15.         $newIndentSpace=array();
  16.  
  17.         return $result;
  18.     }
  19. }

The section that handles newlines now only needs to call this function:
  1.             $result .= close_indents($indentClosers,$oldIndentLevel,$oldIndentLength,$newIndentSpace);
  2.  
  3.             $result .= ($br) ? "<br />\n" : "\n";
  4.             $br = 1;
  5.             return $result;


To close open tags at the end of the page, the new code now calls this function first, and then handles all other open tags, in an order to at least minimize incorrect tag nesting (but see "Not a compete solution!" below):

  1.         if ((!is_array($things)) && ($things == 'closetags'))
  2.         {
  3.             $result .= close_indents($indentClosers,$oldIndentLevel,$oldIndentLength,$newIndentSpace);
  4.  
  5.             if ($trigger_bold % 2) $result .= '</strong>';
  6.             if ($trigger_italic % 2) $result .= '</em>';
  7.             if ($trigger_keys % 2) $result .= '</kbd>';
  8.             if ($trigger_monospace % 2) $result .= '</tt>';
  9.  
  10.             if ($trigger_underline % 2) $result .= '</span>';
  11.             if ($trigger_notes % 2) $result .= '</span>';
  12.             if ($trigger_strike % 2) $result .= '</span>';
  13.             if ($trigger_inserted % 2) $result .= '</span>';
  14.             if ($trigger_deleted % 2) $result .= '</span>';
  15.  
  16.             if ($trigger_center % 2) $result .= '</div>';
  17.             if ($trigger_floatl % 2) $result .= '</div>';
  18.             if ($trigger_floatr % 2) $result .= '</div>';                   # JW added
  19.             for ($i = 1; $i<=5; $i ++)
  20.             {
  21.                 if ($trigger_l[$i] % 2) $result .= ("</h$i>");
  22.             }
  23.  
  24.             $trigger_bold = $trigger_italic = $trigger_keys = $trigger_monospace = 0;
  25.             $trigger_underline = $trigger_notes = $trigger_strike = $trigger_inserted = $trigger_deleted = 0;
  26.             $trigger_center = $trigger_floatl = $trigger_floatr = 0;
  27.             $trigger_l = array(-1, 0, 0, 0, 0, 0);
  28.             return $result;
  29.         }
  30.         else
  31.         {
  32.             $thing = $things[1];
  33.         }

This is now used like this:
  1. $text .= wakka2callback('closetags');                   # JW changed logic


Not a complete solution!
A big problem remains, however: in order to produce valid (X)HTML, open tags cannot just be closed anywhere: there are rules for which elements can contain which other elements. For instance, an inline element (like <em>) can never contain a block element (like a list). So if the inline element is left open (which happens if someone types // to start emphasized text but doesn't close it before starting an indent or list), closing the generated opening <em> tag at the end of the page may prevent display problems in some browsers, but the result is still not valid (X)HTML. This type of problem can only be really addressed with completely different mechanism for a formatter. This should definitely be tackled at some time, but is outside the scope of the current improvements which are designed to work within the current Formatter's mechanism.

Escaping single ampersands


While there are a few cases where it's actually allowed to use a plain & in HTML, in most cases where an ampersand is not part of an entity reference it needs to be escaped as &amp;. The current (version 1.1.6.0) Formatter escapes the < and > special characters, but not &, so the result may be invalid XHTML.

We need to find the ampersands that are not part of an entity reference. So we first build a RegEx to recorgnize the part of an entity reference that follows the ampersand that starts it; it can be a named entity, or a decimal or a hex numerical entity; and it can be terminated by a semicolon (;) in most cases, but there are a few cases where it's legal to leave off the terminating semicolon. To make it easier to read, we build the RegEx to express all that from its constituent parts:
  1. // define entity patterns
  2. // NOTE most also used in wikka.php for htmlentities_ent(): REGEX library!
  3. $alpha  = '[a-z]+';                         # character entity reference
  4. $numdec = '#[0-9]+';                        # numeric character reference (decimal)
  5. $numhex = '#x[0-9a-f]+';                    # numeric character reference (hexadecimal)
  6. $terminator = ';|(?=($|[\n<]|&lt;))';       # semicolon; or end-of-string, newline or tag
  7. $entitypat = '('.$alpha.'|'.$numdec.'|'.$numhex.')('.$terminator.')';   # defines entity pattern without the starting &
  8. $entityref = '&'.$entitypat;                # entity reference

So now we can define a 'lone' ampersand as one that is not followed by the expression $entitypat:
  1. $loneamp = '&(?!'.$entitypat.')';               # ampersand NOT part of an entity


This then becomes part of the big expression that's used in the preg_replace_callback() near the end of the file, as the last thing to consider before a newline:
  1.     '<|>|'.                                                                             # HTML special chars - after wiki markup!
  2.     $loneamp.'|'.                                                                       # HTML special chars - ampersand NOT part of an enity
  3.     '\n'.                                                                               # new line


Now we can "escape" all HTML special characters, as we should:
  1.         // convert HTML thingies (including ampersand NOT part of entity)
  2.         if ($thing == '<')
  3.             return '&lt;';
  4.         else if ($thing == '>')
  5.             return '&gt;';
  6.         else if ($thing == '&')
  7.             return '&amp;';


Nesting floats


I happened to find that the code for a left float (<<) would terminate a right float (>>) and vice versa. Which would of course likely leave unclosed tags. It turned out that by solving that it actually became possible to nest unlike floats - one level deep, at least. No great feature, but it could be handy at times.The solution is actually quite simple: there was just a single "trigger" to keep track of start and end of a float; keeping a separate trigger for left and right floats (an not generating newlines) is all that's needed:

  1.         // JW 2005-05-23: changed floats handling so they can be nested (one type within another only)
  2.         // float box left
  3.         else if ($thing == '<<')
  4.         {
  5.             #return (++$trigger_floatl % 2 ? '<div class="floatl">'."\n" : "\n</div>\n");
  6.             return (++$trigger_floatl % 2 ? '<div class="floatl">' : '</div>'); # JW changed (no newline)
  7.         }
  8.         // float box right
  9.         else if ($thing == '>>')
  10.         {
  11.             #return (++$trigger_floatl % 2 ? '<div class="floatr">'."\n" : "\n</div>\n");
  12.             return (++$trigger_floatr  % 2 ? '<div class="floatr">' : '</div>');    # JW changed (trigger, no newline)
  13.         }

Note line 111 where we now use a $trigger_floatr instead of $trigger_floatl: this solves the bug and creates a new micro-feature at the same time.

Ids in embedded code

follows

Heading ids

Creating ids for headings is (you guessed it) the first (and necessary) piece of the puzzle to enable generating TableofcontentsAction page TOCs, but other bits will be needed for that as well, such as actually gathering the references to headings (and their levels), and the ability to link to page fragments (something our WikkaCore current core does not support yet). So: we cannot generate TOCs - yet - but we are getting there; the code is also designed to make it possible to extend it to generate TOCs not just for headings, but also for things like images, tables and code blocks.

A method for generating a TOC has not been decided yet (we may even provide alternatives), but one thing we certainly need is ids for headings (see TableofcontentsAction for more background on this); and even if we do not (yet) generate a TOC, being able to link to a page fragment (the obvious next step) will be useful in itself.

Some thought went into the method of generating the ids: Ideally they should be 'recognizable' so creating links to a page fragment with a heading wil be easy, and they should be as 'constant' as possible so a link to a section remains a link to that section, even if that is moved to a different position on the page, or another is inserted before it. This implies that all methods that simply generate a sequential id will not fulfill our requirements. We also don't burden the writer with coming up with ids (or even needing to think about them): they should be able to just concentrate on the content. Instead, we use following approach:


The result is an id that is almost always derived directly from the heading content, giving a high chance that it will remain constant even if the page content is re-arranged: thus it provides a reliable target for a link.

The Code


Here's the code (all of it). This replaces the file ./formatters/wakka.php
follows
There are 7 comments on this page. [Show comments]
Valid XHTML :: Valid CSS: :: Powered by WikkaWiki