Wiki source for ImprovedFormatter
=====Improved Formatter=====
//Installed as a [[WikkaBetaFeatures beta feature]] on this server as of 2005-06-12.//
>>**see also:**
~-AdvancedFormatter
~-GenerateUniqueId
~-GrabCodeHandler
~-TableofcontentsAction
>>This is the development page for an improved version of "the Formatter", specifically, the code in ##./formatters/wakka.php## (as opposed to the AdvancedFormatter page which deals with "advanced" formatting in other ways as well, such as standardized code //generation// utilities).::c::
====Why?====
While our current (//version 1.1.6.0//) Formatter is quite capable, it has some quirks and bugs, doesn't always generate valid XHTML (though it tries hard), and misses a few things that would be nice to have or that would //enable// things that would be nice to have (such as a [[TableofcontentsAction page TOC]]s). The improved version presented here tries to address some of these issues (with more likely to follow).
====What?====
Here is a short summary of what has changed (details below):
~- using single quotes wherever possible making [[RegEx]]es and generated HTML easier to read;
~- better closing of open tags at end of document, including open indents and lists (a long-standing bug!) ''Now improved''
~- better handling of nested lists so change of list "type" is actually detected and coded correctly; also produces nicely-formatted HTML code for lists and indents now, especially more readable for nested lists. ''New!''
~- escaping single & (not part of an entity) (another long-standing problem);
~- ability to nest one type of float within another (so a right float can contain a left float and vice versa)
~- handling ids (and making them unique) as provided in embedded code, using the ##""makeId()""## method;
~- creating ids for headings based on content ('afterburner' type formatting so this //includes// originally embedded code); this code not only uses the ##""makeId()""## method but also the ##html_entity_decode()## method in PHP versions older than 4.3.
''The code presented below is still considered a beta version and as such contains many lines of (commented-out) debug code. These will of course be removed before final release. Any reference to line numbers is (for now) to the new (beta) code since this is a complete drop-in replacement for the original file.''
===Closing open tags===
The current version (Wikka //1.1.6.0//) of the Formatter has a bit of code contributed by DotMG to close any left-open tags at the very end of a page. While that can solve some problems with rendering and including pages, the code was incomplete in //which// open tags were closed. A particular problem was still-open lists and indents which weren't handled at all (see "List parsing bug?" on WikkaBugs). Also, this code would directly ##echo## output instead of returning a string as the rest of the Formatter's main function does.
The new version addresses all of these problems.
Closing of indents and (open) lists was already happening when encountering a newline that **doesn't** start with a TAB or a **##~##**, so this bit is separated out as a function. ''Improved now by removing superfluous variables and corresponding parameters.''
%%(php;10)if (!function_exists('close_indents'))
{
function close_indents(&$indentClosers,&$oldIndentLevel) # JW 2005-07-11 removed superfluous variables
{
$result='';
$c = count($indentClosers);
for ($i = 0; $i < $c; $i++)
{
$result .= array_pop($indentClosers);
$br = 0;
}
$oldIndentLevel = 0;
return $result;
}
}%%
The section that handles newlines now only needs to call this function:
%%(php;487) $result .= close_indents($indentClosers,$oldIndentLevel); # JW 2005-07-11 removed superfluous variables
$result .= ($br) ? "<br />\n" : "\n";
$br = 1;
return $result;%%
To close open tags at the end of the page, the new code now calls this function first, and then handles all other open tags, in an order to at least minimize incorrect tag nesting (but see "**Not a compete solution!**" below):
%%(php;61) if ((!is_array($things)) && ($things == 'closetags'))
{
$result .= close_indents($indentClosers,$oldIndentLevel); # JW 2005-07-11 removed superfluous variables
if ($trigger_bold % 2) $result .= '</strong>';
if ($trigger_italic % 2) $result .= '</em>';
if ($trigger_keys % 2) $result .= '</kbd>';
if ($trigger_monospace % 2) $result .= '</tt>';
if ($trigger_underline % 2) $result .= '</span>';
if ($trigger_notes % 2) $result .= '</span>';
if ($trigger_strike % 2) $result .= '</span>';
if ($trigger_inserted % 2) $result .= '</span>';
if ($trigger_deleted % 2) $result .= '</span>';
if ($trigger_center % 2) $result .= '</div>';
if ($trigger_floatl % 2) $result .= '</div>';
if ($trigger_floatr % 2) $result .= '</div>'; # JW added
for ($i = 1; $i<=5; $i ++)
{
if ($trigger_l[$i] % 2) $result .= ("</h$i>");
}
$trigger_bold = $trigger_italic = $trigger_keys = $trigger_monospace = 0;
$trigger_underline = $trigger_notes = $trigger_strike = $trigger_inserted = $trigger_deleted = 0;
$trigger_center = $trigger_floatl = $trigger_floatr = 0;
$trigger_l = array(-1, 0, 0, 0, 0, 0);
return $result;
}
else
{
$thing = $things[1];
}%%
This is now used like this:
%%(php;684) $text .= wakka2callback('closetags'); # JW changed logic%%
==Not a complete solution!==
A big problem remains, however: in order to produce valid (X)HTML, open tags cannot just be closed //anywhere//: there are rules for which elements can contain which other elements. For instance, an inline element (like <em>) can never contain a block element (like a list). So if the inline element is left open (which happens if someone types ""//"" to start emphasized text but doesn't close it before starting an indent or list), closing the generated opening <em> tag at the end of the page may prevent display problems in some browsers, but the result is still not valid (X)HTML. This type of problem can only be really addressed with completely different mechanism for a formatter. This should definitely be tackled at some time, but is outside the scope of the current improvements which are designed to work //within// the current Formatter's mechanism.
===Better handling of nested lists and indents===
//**New** as of 2005-07-12//
There were still some issues with nested lists and indents, in particular when the **type** of list changed without a "level" change or when changing to a higher level ("outdent"). At the same time, a list or indent right at the start of a page was not detected or handled at all. (Although that is bad style, and a page should start with a heading, it still should be handled correctly by the formatter, of course.) Finally, accented (Umlaut) characters were treated as a list type.
In order to detect a list or indent at the start of the page as well as after a newline (and avoid Umlauts) the detection ""RegEx"" for a list or indent is now coded as follows (in the wakka2callback call):
%%(php;670) '(^|\n)([\t~]+)(-|&|[0-9a-zA-Z]+\))?|'. # indents and lists # JW FIXED 2005-07-12 also match tab or ~ at start of document%%
By using **##(^|\n)##** as an anchor for matching instead of merely **##\n##** the start of the page is also matched.
The actual code for //handling// a list or indent line was comnpletely rewritten to properly handle change of list types and to produce readable and nicely indented XHTML code. Note that this section now also starts with the ##(^|\n)## anchor:
%%(php;368) // indented text
# JW FIXED 2005-07-09 accented chars not used for ordered lists
# JW FIXED 2005-07-12 this does not cover the case where a list item is followed by an inline comment of the *same* level
# JW FIXED 2005-07-12 as with the expression in the /edit handler this does not cover tab or ~ at the start of the document
elseif (preg_match('/(^|\n)([\t~]+)(-|&|[0-9a-zA-Z]+\))?(\n|$)/s', $thing, $matches))
{
$br = 0; # no break needed after a block
// get new indent level
$newIndentLevel = strlen($matches[2]); # JW 2005-07-12 also match tab or ~ at start of document
// derive code indent
$codeIndent = str_repeat("\t",$newIndentLevel-1);
$nlTabs = "\n".$codeIndent;
$nlTabsOut = $nlTabs."\t";
// find out which indent type we want
$newIndentType = $matches[3]; # JW 2005-07-12 also match tab or ~ at start of document
// derive code fragments
if ($newIndentType == '') # plain indent
{
$opener = '<div class="indent">';
$closer = '</div>'/*.$nlTabs*/;
}
elseif ($newIndentType == '-') # unordered list
{
$opener = '<ul>'.$nlTabs.'<li>';
$closer = '</li>'.$nlTabs.'</ul>';
}
elseif ($newIndentType == '&') # inline comment
{
$opener = '<ul class="thread">'.$nlTabs.'<li>';
$closer = '</li>'.$nlTabs.'</ul>';
}
else # ordered list
{
$opener = '<ol type="'.substr($newIndentType, 0, 1).'">'.$nlTabs.'<li>';
$closer = '</li>'.$nlTabs.'</ol>';
$newIndentType = 'o';
}
// do an indent
if ($newIndentLevel > $oldIndentLevel)
{
for ($i = 0; $i < $newIndentLevel - $oldIndentLevel; $i++)
{
$result .= $nlTabs./*'<!--nested item '.$newIndentLevel.'-->'.*/$opener;
array_push($indentClosers, $closer);
#$result .= '<!--pushed type: '.$oldIndentType.' -->'; # @@@
array_push($indentTypes, $oldIndentType); # remember type hierarchically
}
}
// do an outdent or stay at the same level
else if ($newIndentLevel <= $oldIndentLevel)
{
$bOutdent = FALSE;
if ($newIndentLevel < $oldIndentLevel)
{
$bOutdent = TRUE; # remember we're outdenting, for correct layout
// do the outdenting
for ($i = 0; $i < $oldIndentLevel - $newIndentLevel; $i++)
{
if ($i > 0)
{
$result .= $nlTabsOut;
}
$result .= array_pop($indentClosers)/*.'<!--outdent to '.$newIndentLevel.'-->'*/;
$oldIndentType = array_pop($indentTypes); # make sure we will compare with "correct" previous type
#$result .= '<!--popped type: '.$oldIndentType.' -->'; # @@@
}
}
if ($bOutdent) # outdenting: put close tag on new line
{
$result .= $nlTabs/*.'<!--outdent: close tag on new line-->'*/;
}
// JW 2005-07-11 new item of different type
if ($newIndentType != $oldIndentType)
{
$result .= array_pop($indentClosers);
$result .= /*'<!--type change follows (old: '.$oldIndentType.' new: '.$newIndentType.') -->'.*/$nlTabs.$opener;
array_push($indentClosers, $closer);
}
// new item of same type
else
{
// plain indent
if ($newIndentType == '')
{
$result .= $closer./*'<!--same type ('.$newIndentType.') same level-->'.*/$nlTabs.$opener;
}
// list or inline comment
else
{
$result .= '</li>'.$nlTabs.'<li>'/*.'<!--back to same type-->'*/;
}
}
}
$oldIndentType = $newIndentType; # remember type sequentially
$oldIndentLevel = $newIndentLevel;
return $result;
}%%
Since the new code avoids adding an extra ##<br />## before a list (##ul##, ##ol##) or indent (##div##) - these are block-level elements and line breaks should not be used to separate them (they really should be used only within flowing **text**) - the stylesheet had to be tweaked a little since it actually (implicitly) assumes a line break is there. Change the following in ##css/wikka.css## or your own "skin":
%%(css;100)ul, ol {
margin-top: 0px;
margin-bottom: 0px;
padding-top: 0px;
padding-bottom: 0px;
}
%%---
to:
%%(css;100)ul, ol {
/*margin-top: 0px;*/ /* keep natural margin; an extra <br/> is no longer generated */
margin-bottom: 0px;
padding-top: 0px;
padding-bottom: 0px;
}
ul ul, ol ol, ul ol, ol ul { /* keep suppressing margin for nested lists */
margin-top: 0px;
}
%%---(Since we're dealing with beta code anyway, line numbers refer to the stylesheet as implemented on this server.)
Also, the styling for inline comments (line 66) should be moved so it actually overrides the generic style on line 100 etc. instead of the other way round:
%%(css;66)/* ul.thread styles moved so they come after the generic ul style */%%
%%(css;308)/* these ul.thread styles must come after the generic ul style in order to override it */
ul.thread {
list-style-type: none;
border-left: 2px #666 solid;
padding-left: 10px;
margin: 5px 0px;
}
ul.thread li {
color: #333;
font-size: 12px;
}%%
===Escaping single ampersands===
While there are a few cases where it's actually allowed to use a plain **&** in HTML, in most cases where an ampersand is not part of an entity reference it needs to be escaped as &. The current (//version 1.1.6.0//) Formatter escapes the **<** and **>** special characters, but not **&**, so the result may be invalid XHTML.
We need to find the ampersands that are **not** part of an entity reference. So we first build a RegEx to recognize the part of an entity reference that //follows// the ampersand that starts it; it can be a named entity, or a decimal or a hex numerical entity; and it can be terminated by a semicolon (;) in most cases, but there are a few cases where it's legal to leave off the terminating semicolon. To make it easier to read, we build the RegEx to express all that from its constituent parts:
%%(php;649)// define entity patterns
// NOTE most also used in wikka.php for htmlentities_ent(): REGEX library!
$alpha = '[a-z]+'; # character entity reference
$numdec = '#[0-9]+'; # numeric character reference (decimal)
$numhex = '#x[0-9a-f]+'; # numeric character reference (hexadecimal)
$terminator = ';|(?=($|[\n<]|<))'; # semicolon; or end-of-string, newline or tag
$entitypat = '('.$alpha.'|'.$numdec.'|'.$numhex.')('.$terminator.')'; # defines entity pattern without the starting &
$entityref = '&'.$entitypat; # entity reference%%
So now we can define a 'lone' ampersand as one that is **not** followed by the expression **##$entitypat##**:
%%(php;675)$loneamp = '&(?!'.$entitypat.')'; # ampersand NOT part of an entity%%
This then becomes part of the big expression that's used in the **##preg_replace_callback()##** near the end of the file, as the last thing to consider before a newline:
%%(php;674) '<|>|'. # HTML special chars - after wiki markup!
$loneamp.'|'. # HTML special chars - ampersand NOT part of an enity
'\n'. # new line%%
Now we can "escape" all HTML special characters, as we should:
%%(php;96) // convert HTML thingies (including ampersand NOT part of entity)
if ($thing == '<')
return '<';
else if ($thing == '>')
return '>';
else if ($thing == '&')
return '&';%%
===Nesting floats===
I happened to find that the code for a left float (""<<"") would terminate a right float ("">>"") and vice versa. Which would of course likely leave unclosed tags. It turned out that by solving that it actually became possible to nest //unlike// floats - one level deep, at least. No great feature, but it could be handy at times.The solution is actually quite simple: there was just a single "trigger" to keep track of start and end of a float; keeping a separate trigger for left and right floats (an not generating newlines) is all that's needed:
%%(php;103) // JW 2005-05-23: changed floats handling so they can be nested (one type within another only)
// float box left
else if ($thing == '<<')
{
#return (++$trigger_floatl % 2 ? '<div class="floatl">'."\n" : "\n</div>\n");
return (++$trigger_floatl % 2 ? '<div class="floatl">' : '</div>'); # JW changed (no newline)
}
// float box right
else if ($thing == '>>')
{
#return (++$trigger_floatl % 2 ? '<div class="floatr">'."\n" : "\n</div>\n");
return (++$trigger_floatr % 2 ? '<div class="floatr">' : '</div>'); # JW changed (trigger, no newline)
}%%---
Note line 114 where we now use a **##$trigger_floatr##** instead of **##$trigger_floatl##**: this solves the bug and creates a new micro-feature at the same time.
''Now that the improved formatter has been installed as a beta feature, I added a small demo in the SandBox (in case it disappears: look for the edit made on 2005-06-14 21:45:57 in the revisions). --JW''
===Ids in embedded code===
Since in ##ID## must be unique in a page, embedding HTML code, and combining that with generated code, creates a problem. In order to to ensure that the page is valid XHTML, every id attribute **must** have unique value, regardless where it's coming from.
When **generating** code that should contain ids, this is simple: just use the ##[[GenerateUniqueId makeId()]]## method to generate one, with or without specifying parameters. Still, the result //could// conflict with id attributes in embedded HTML code so we must handle those as well.
We analyze the whole block of embedded code, run each id through ##""makeId()""##; if this method detects the id already exists, it will return an amended value with a sequence suffix; if it finds the id value wasn't valid, it will create a new one and return that. The formatter then replaces every id for which a different value was returned:
%%(php;219) // escaped text
else if (preg_match('/^""(.*)""$/s', $thing, $matches))
{
/*
echo 'embedded content<br/>';
*/
// get config
# $allowed_double_doublequote_html = $wakka->GetConfigValue('double_doublequote_html');
$ddquotes_policy = $wakka->config['double_doublequote_html'];
/*
echo 'double quotes: '.$ddquotes_policy.'<br/>';
*/
// get embedded code
$embedded = $matches[1];
// handle embedded id attributes for 'safe' and 'raw'
if ($ddquotes_policy == 'safe' || $ddquotes_policy == 'raw')
{
// get tags with id attributes
$patTagWithId = '((<[a-z].*?)(id=("|\')(.*?)\\4)(.*?>))';
// with PREG_SET_ORDER we get an array for each match: easy to use with list()!
// we do the match case-insensitive so we catch uppercase HTML as well;
// SafeHTML will treat this but 'raw' may end up with invalid code!
$tags2 = preg_match_all('/'.$patTagWithId.'/i',$embedded,$matches2,PREG_SET_ORDER); # use backref to match both single and double quotes
/*
echo '# of matches (2): '.$tags2.'<br/>';
echo '<!--found (set order):'."\n";
print_r($matches2);
echo '-->'."\n";
*/
// step through code, replacing tags with ids with tags with new ('repaired') ids
$tmpembedded = $embedded;
$newembedded = '';
for ($i=0; $i < $tags2; $i++)
{
list(,$tag,$tagstart,$attrid,$quote,$id,$tagend) = $matches2[$i]; # $attrid not needed, just for clarity
$parts = explode($tag,$tmpembedded,2); # split in two at matched tag
if ($id != ($newid = $wakka->makeId('embed',$id))) # replace if we got a new value
{
/*
echo 'replacing tag - old id: '.$id.' new id: '.$newid.'<br/>';
*/
$tag = $tagstart.'id='.$quote.$newid.$quote.$tagend;
}
/*
echo "<!--old: $tag -->\n";
echo "<!--new: $replacetag -->\n";
*/
$newembedded .= $parts[0].$tag; # append (replacement) tag to first part
$tmpembedded = $parts[1]; # after tag: next bit to handle
}
$newembedded .= $tmpembedded; # add last part
/*
echo '<!--translation:'."\n";
echo $newembedded;
echo '-->'."\n";
*/
}
// return (treated) embedded content according to config
// NOTE: we apply SafeHTML *after* id treatment so it won't be throwing away invalid ids that we're repairing instead!
switch ($ddquotes_policy)
{
case 'safe':
return $wakka->ReturnSafeHTML($newembedded);
case 'raw':
return $newembedded; # may still be invalid code - 'raw' will not be corrected!
default:
return $wakka->htmlspecialchars_ent($embedded); # display only
}
}%%
As long as ids in the embedded code are valid and unique, they remain unchanged because ##""makeId()""## is called with the 'embed' parameter which tells it not to add an id group prefix.
Still, it is important to remember that ids can only truly be guaranteed to be unique if **every** bit of code that generates HTML with ids is actually using the ##""makeId()""## method - and **that includes user-contributed extensions**.
''The "Fatal error: Call to a member function on a non-object" bug referred to on WikkaBugsResolved is also fixed here (line 282).''
''**TODO:** There is still one problem to be solved here: when embedded HTML code contains an id, it's entirely possible that it (or a following embedded section) contains a reference to that id. When the ##""makeId()""## method finds it is necessary to change an id because it conflicts with a pre-existing one, any reference to it should also be updated.
The current code does not (yet) take care of this - it's a fairy complicated problem to solve correctly, but will be tackled soon.''
===Heading ids===
Creating ids for headings is (you guessed it) the first (and necessary) piece of the puzzle to enable generating [[TableofcontentsAction page TOC]]s, but other bits will be needed for that as well, such as actually **gathering** the references to headings (and their levels), and the ability to **link** to page fragments (something our [[Docs:WikkaCore current core]] does not support yet). So: we cannot generate TOCs - yet - but we are getting there; the code is also designed to make it possible to extend it to generate TOCs not just for headings, but also for things like images, tables and code blocks.
A method for generating a TOC has not been decided yet (we may even provide alternatives), but one thing we //certainly// need is ids for headings (see TableofcontentsAction for more background on this); and even if we do not (yet) generate a TOC, being able to link to a page fragment (the obvious next step) will be useful in itself.
Some thought went into the method of generating the ids: Ideally they should be 'recognizable' so creating links **to** a page fragment with a heading wil be easy, and they should be as 'constant' as possible so a link to a section //remains// a link to //that// section, even if that is moved to a different position on the page, or another is inserted before it. This implies that all methods that simply generate a sequential id will not fulfill our requirements. We also don't burden the writer with coming up with ids (or even needing to think about them): they should be able to just concentrate on the **content**. Instead, we use following approach:
~-The actual content of the heading is the basis for the id: this makes it very likely the id wil remain the same even if page sections are re-arranged or new sections inserted.
~-A heading may contain images; if the images have an alt text, they are replaced by this (after all, alt text is meant for precisely that: to replace an image where it cannot be shown!); all other tags are simply stripped.
~-A heading may also contain entity references; where possible these are translated into ASCII letters (using ##html_entity_decode()##) while all other ones are removed.
~-Any character (except whitespace) that is not valid in an id is then removed: the result is a string that consists only of characters valid for an id - but it could now possibly be an empty string
~-A valid id can only contain letters, numbers, dashes, underscores and periods, and must start with an (ASCII) letter; this implies that spaces (and whitespace in general) is not allowed. However, the ##""makeId()""## method tackles this by first transforming all whitespace into underscores.
~-The resulting string is examined for uniqueness within the group of "heading" ids; if necessary, a sequence number is added, or a new id hash is generated.
All this is implemented as an "afterburner" type of formatter which is applied after all basic formatting has already taken place and we already have the XHTML output of that process. This ensures that **all** headings are taken into account, whether they are generated from Wikka markup or from embedded HTML code. The afterburner ##preg_replace_callback()## function is designed to be extended with other types of code fragments we might want to generate ids (and maybe [[TableofcontentsAction page TOC]]s...) for.
The 'afterburner' function is defined like this:
%%(php;531)if (!function_exists('wakka3callback'))
{
/**
* "Afterburner" formatting: extra handling of already-generated XHTML code.
*
* 1.
* Ensure every heading has an id, either specified or generated. (May be
* extended to generate section TOC data.)
* If an id is specified, that is used without any modification.
* If no id is specified, it is generated on the basis of the heading context:
* - any image tag is replaced by its alt text (if specified)
* - all tags are stripped
* - all characters that are not valid in an id are stripped (except whitespace)
* - the resulting string is then used by makedId() to generate an id out of it
*
* @access private
* @uses Wakka::makeId()
*
* @param array $things required: matches of the regex in the preg_replace_callback
* @return string heading with an id attribute
*/
function wakka3callback($things)
{
global $wakka;
$thing = $things[1];
// heading
if (preg_match('#^<(h[1-6])(.*?)>(.*?)</\\1>$#s', $thing, $matches)) # note that we don't match headings that are not valid XHTML!
{
/*
echo 'heading:<pre>';
print_r($matches);
echo '</pre>';
*/
list($element,$tagname,$attribs,$heading) = $matches;
#if (preg_match('/(id=("|\')(.*?)\\2)/',$attribs,$matches)) # use backref to match both single and double quotes
if (preg_match('/(id=("|\')(.*?)\\2)/',$attribs)) # use backref to match both single and double quotes
{
// existing id attribute: nothing to do (assume already treated as embedded code)
// @@@ we *may* want to gather ids and heading text for a TOC here ...
// heading text should then get partly the same treatment as when we're creating ids:
// at least replace images and strip tags - we can leave entities etc. alone - so we end up with
// plain text-only
// do this if we have a condition set to generate a TOC
return $element;
}
else
{
// no id: we'll have to create one
#echo 'no id provided - create one<br/>';
$tmpheading = trim($heading);
// first find and replace any image with its alt text
// @@@ can we use preg_match_all here? would it help?
while (preg_match('/(<img.*?alt=("|\')(.*?)\\2.*?>)/',$tmpheading,$matches))
{
#echo 'image found: '.$tmpheading.'<br/>';
# 1 = whole element
# 3 = alt text
list(,$element, ,$alttext) = $matches;
/*
echo 'embedded image:<pre>';
print_r($matches);
echo '</pre>';
*/
// gather data for replacement
$search = '/'.str_replace('/','\/',$element).'/'; # whole element (delimiter chars escaped!) @@@ use preg_quote as well?
$replace = trim($alttext); # alt text
/*
echo 'pat_repl:<pre>';
echo 'search: '.$search.'<br/>';
echo 'search: '.$replace.'<br/>';
echo '</pre>';
*/
// now replace img tag by corresponding alt text
$tmpheading = preg_replace($search,$replace,$tmpheading); # replace image by alt text
}
$headingtext = $tmpheading;
#echo 'headingtext (no img): '.$headingtext.'<br/>';
// @@@ 2005-05-27 now first replace linebreaks <br/> with spaces!!
// remove all other tags
$headingtext = strip_tags($headingtext);
#echo 'headingtext (no tags): '.$headingtext.'<br/>';
// @@@ this all-text result is usable for a TOC!!!
// do this if we have a condition set to generate a TOC
// replace entities that can be interpreted
// use default charset ISO-8859-1 because other chars won't be valid for an id anyway
$headingtext = html_entity_decode($headingtext,ENT_NOQUOTES);
// remove any remaining entities (so we don't end up with strange words and numbers in the id text)
$headingtext = preg_replace('/&[#]?.+?;/','',$headingtext);
#echo 'headingtext (entities decoded/removed): '.$headingtext.'<br/>';
// finally remove non-id characters (except whitespace which is handled by makeId())
$headingtext = preg_replace('/[^A-Za-z0-9_:.-\s]/','',$headingtext);
#echo 'headingtext (id-ready): '.$headingtext.'<br/>';
// now create id based on resulting heading text
$id = $wakka->makeId('hn',$headingtext);
#echo 'id: '.$id.'<br/>';
// rebuild element, adding id
return '<'.$tagname.$attribs.' id="'.$id.'">'.$heading.'</'.$tagname.'>';
}
}
// other elements to be treated go here (tables, images, code sections...)
}
}%%
This is called (after the primary formatter) as follows:
%%(php;687)// add ids to heading elements
// @@@ LATER:
// - extend with other elements (tables, images, code blocks)
// - also create array(s) for TOC(s)
$idstart = getmicrotime();
$text = preg_replace_callback(
'#('.
'<h[1-6].*?>.*?</h[1-6]>'.
// other elements to be treated go here
')#ms','wakka3callback',$text);
printf('<!-- Header id generation took %.6f seconds -->', (getmicrotime() - $idstart));%%
The result is an id that is almost always derived directly from the heading content, giving a high chance that it will remain constant even if the page content is re-arranged: thus it provides a reliable target for a link.
===Keeping track of recursion level===
Since the Formatter is now "better" at closing any tags left open at the end of the string it's handling, a new issue is arose turning up with some of the beta code on this server: when the Formatter is being called recursively by an action (Formatter -> Action -> Formatter...) the "second-level" formatter will close all tags that were opened by the "first-level" formatter. When an action using the formatter is embedded in something like a heading or a list element (which in most cases should be entirely valid) the heading or list item (and whatever else is "open") is closed at the end of the action rather than at the point where it should be. While it's not really a good idea for an action (which is interpreted **by** the Formatter calling the Action() method) to call the Formatter in its turn (and usually redundant), the Formatter should handle this more elegantly and close tags only in its "topmost" instance.
The solution is to keep track of the recursion level and close open tags at the end only at the "outermost" level. The first thing we need is a varable to keep track of the level; we add this at the start of the Wakka class in ##wikka.php##, after the other object variables:
%%(php;105) var $callLevel = 0; # JW 2005-07-15 keep track of recursion levels of the formatter%%---
Then in the Formatter, right before the ##wakka2callback()## function is called to do the actual formatting, we increment the variable:
%%(php;659)$this->callLevel++; # JW 2005-07-15 recursion level: getting in
$text = preg_replace_callback(
%%---
and right afterwards we decrement it again, after which we execute the 'closetags' routine only when the call level is back at 0:
%%(php;681)$this->callLevel--; # JW 2005-07-15 recursion level: getting out
if ($this->callLevel == 0) # JW 2005-07-15 only for "outmost" call level
{
$text .= wakka2callback('closetags'); # JW changed logic
}
%%
====The Code====
Here's the code (all of it). This **replaces** the file **##./formatters/wakka.php##**.
''This //incorporates// the small change needed to support DarTar's GrabCodeHandler, slightly extended to take advantage of the ability of the new [[AdvancedFormOpen advanced FormOpen()]] method to add a class to a form (lines 338-347) so the form can be properly styled.
If you want to test this improved formatter, you should either also grab DarTar's GrabCodeHandler or comment out these lines and **un**comment line 337.''
%%(php;1)<?php
// This may look a bit strange, but all possible formatting tags have to be in a single regular expression for this to work correctly. Yup!
// #dotmg [many lines] : Unclosed tags fix! For more info, [email protected]
// JavaWoman - corrected and improved unclosed tags handling, including missing ones and indents
// ------------- define the necessary functions -----------
if (!function_exists('close_indents'))
{
function close_indents(&$indentClosers,&$oldIndentLevel) # JW 2005-07-11 removed superfluous variables
{
$result='';
$c = count($indentClosers);
for ($i = 0; $i < $c; $i++)
{
$result .= array_pop($indentClosers);
$br = 0;
}
$oldIndentLevel = 0;
return $result;
}
}
if (!function_exists('wakka2callback'))
{
function wakka2callback($things)
{
$result='';
static $oldIndentType = ''; # JW 2005-07-12 added
static $oldIndentLevel = 0;
#static $oldIndentLength= 0; # JW 2005-07-12 removed superfluous variables
static $indentClosers = array();
static $indentTypes = array(); # JW 2005-07-12 added
#static $newIndentSpace = array(); # JW 2005-07-12 removed superfluous variables
static $br = 1;
static $trigger_bold = 0;
static $trigger_italic = 0;
static $trigger_keys = 0;
static $trigger_monospace = 0;
static $trigger_underline = 0;
static $trigger_notes = 0;
static $trigger_strike = 0;
static $trigger_inserted = 0;
static $trigger_deleted = 0;
static $trigger_center = 0;
static $trigger_floatl = 0;
static $trigger_floatr = 0; # JW added
static $trigger_l = array(-1, 0, 0, 0, 0, 0);
global $wakka; # @@@ should be capitalized but requires change in wikka.php (etc.)
if ((!is_array($things)) && ($things == 'closetags'))
{
$result .= close_indents($indentClosers,$oldIndentLevel); # JW 2005-07-11 removed superfluous variables
if ($trigger_bold % 2) $result .= '</strong>';
if ($trigger_italic % 2) $result .= '</em>';
if ($trigger_keys % 2) $result .= '</kbd>';
if ($trigger_monospace % 2) $result .= '</tt>';
if ($trigger_underline % 2) $result .= '</span>';
if ($trigger_notes % 2) $result .= '</span>';
if ($trigger_strike % 2) $result .= '</span>';
if ($trigger_inserted % 2) $result .= '</span>';
if ($trigger_deleted % 2) $result .= '</span>';
if ($trigger_center % 2) $result .= '</div>';
if ($trigger_floatl % 2) $result .= '</div>';
if ($trigger_floatr % 2) $result .= '</div>'; # JW added
for ($i = 1; $i<=5; $i ++)
{
if ($trigger_l[$i] % 2) $result .= ("</h$i>");
}
$trigger_bold = $trigger_italic = $trigger_keys = $trigger_monospace = 0;
$trigger_underline = $trigger_notes = $trigger_strike = $trigger_inserted = $trigger_deleted = 0;
$trigger_center = $trigger_floatl = $trigger_floatr = 0;
$trigger_l = array(-1, 0, 0, 0, 0, 0);
return $result;
}
else
{
$thing = $things[1];
}
// convert HTML thingies (including ampersand NOT part of entity)
if ($thing == '<')
return '<';
else if ($thing == '>')
return '>';
else if ($thing == '&')
return '&';
// JW 2005-05-23: changed floats handling so they can be nested (one type within another only)
// float box left
else if ($thing == '<<')
{
#return (++$trigger_floatl % 2 ? '<div class="floatl">'."\n" : "\n</div>\n");
return (++$trigger_floatl % 2 ? '<div class="floatl">' : '</div>'); # JW changed (no newline)
}
// float box right
else if ($thing == '>>')
{
#return (++$trigger_floatl % 2 ? '<div class="floatr">'."\n" : "\n</div>\n");
return (++$trigger_floatr % 2 ? '<div class="floatr">' : '</div>'); # JW changed (trigger, no newline)
}
// clear floated box
else if ($thing == '::c::')
{
return ('<div class="clear"> </div>'."\n");
}
// keyboard
else if ($thing == '#%')
{
return (++$trigger_keys % 2 ? '<kbd class="keys">' : '</kbd>');
}
// bold
else if ($thing == '**')
{
return (++$trigger_bold % 2 ? '<strong>' : '</strong>');
}
// italic
else if ($thing == '//')
{
return (++$trigger_italic % 2 ? '<em>' : '</em>');
}
// monospace
else if ($thing == '##')
{
return (++$trigger_monospace % 2 ? '<tt>' : '</tt>');
}
// underline
else if ($thing == '__')
{
return (++$trigger_underline % 2 ? '<span class="underline">' : '</span>');
}
// notes
else if ($thing == "''")
{
return (++$trigger_notes % 2 ? '<span class="notes">' : '</span>');
}
// strikethrough
else if ($thing == '++')
{
return (++$trigger_strike % 2 ? '<span class="strikethrough">' : '</span>');
}
// additions
else if ($thing == '££')
{
return (++$trigger_inserted % 2 ? '<span class="additions">' : '</span>');
}
// deletions
else if ($thing == '¥¥')
{
return (++$trigger_deleted % 2 ? '<span class="deletions">' : '</span>');
}
// center
else if ($thing == '@@')
{
return (++$trigger_center % 2 ? '<div class="center">'."\n" : "\n</div>\n");
}
// urls
else if (preg_match('/^([a-z]+:\/\/\S+?)([^[:alnum:]^\/])?$/', $thing, $matches))
{
$url = $matches[1];
if (preg_match('/^(.*)\.(gif|jpg|png)/si', $url)) {
return '<img src="'.$url.'" alt="image" />'.$matches[2];
} else
// Mind Mapping Mod
if (preg_match('/^(.*)\.(mm)/si', $url)) {
return $wakka->Action('mindmap '.$url);
} else
return $wakka->Link($url).$matches[2];
}
// header level 5
else if ($thing == '==')
{
$br = 0;
return (++$trigger_l[5] % 2 ? '<h5>' : "</h5>\n");
}
// header level 4
else if ($thing == '===')
{
$br = 0;
return (++$trigger_l[4] % 2 ? '<h4>' : "</h4>\n");
}
// header level 3
else if ($thing == '====')
{
$br = 0;
return (++$trigger_l[3] % 2 ? '<h3>' : "</h3>\n");
}
// header level 2
else if ($thing == '=====')
{
$br = 0;
return (++$trigger_l[2] % 2 ? '<h2>' : "</h2>\n");
}
// header level 1
else if ($thing == '======')
{
$br = 0;
return (++$trigger_l[1] % 2 ? '<h1>' : "</h1>\n");
}
// forced line breaks
else if ($thing == "---")
{
return '<br />';
}
// escaped text
else if (preg_match('/^""(.*)""$/s', $thing, $matches))
{
/*
echo 'embedded content<br/>';
*/
// get config
# $allowed_double_doublequote_html = $wakka->GetConfigValue('double_doublequote_html');
$ddquotes_policy = $wakka->config['double_doublequote_html'];
/*
echo 'double quotes: '.$ddquotes_policy.'<br/>';
*/
// get embedded code
$embedded = $matches[1];
// handle embedded id attributes for 'safe' and 'raw'
if ($ddquotes_policy == 'safe' || $ddquotes_policy == 'raw')
{
// get tags with id attributes
$patTagWithId = '((<[a-z].*?)(id=("|\')(.*?)\\4)(.*?>))';
// with PREG_SET_ORDER we get an array for each match: easy to use with list()!
// we do the match case-insensitive so we catch uppercase HTML as well;
// SafeHTML will treat this but 'raw' may end up with invalid code!
$tags2 = preg_match_all('/'.$patTagWithId.'/i',$embedded,$matches2,PREG_SET_ORDER); # use backref to match both single and double quotes
/*
echo '# of matches (2): '.$tags2.'<br/>';
echo '<!--found (set order):'."\n";
print_r($matches2);
echo '-->'."\n";
*/
// step through code, replacing tags with ids with tags with new ('repaired') ids
$tmpembedded = $embedded;
$newembedded = '';
for ($i=0; $i < $tags2; $i++)
{
list(,$tag,$tagstart,$attrid,$quote,$id,$tagend) = $matches2[$i]; # $attrid not needed, just for clarity
$parts = explode($tag,$tmpembedded,2); # split in two at matched tag
if ($id != ($newid = $wakka->makeId('embed',$id))) # replace if we got a new value
{
/*
echo 'replacing tag - old id: '.$id.' new id: '.$newid.'<br/>';
*/
$tag = $tagstart.'id='.$quote.$newid.$quote.$tagend;
}
/*
echo "<!--old: $tag -->\n";
echo "<!--new: $replacetag -->\n";
*/
$newembedded .= $parts[0].$tag; # append (replacement) tag to first part
$tmpembedded = $parts[1]; # after tag: next bit to handle
}
$newembedded .= $tmpembedded; # add last part
/*
echo '<!--translation:'."\n";
echo $newembedded;
echo '-->'."\n";
*/
}
// return (treated) embedded content according to config
// NOTE: we apply SafeHTML *after* id treatment so it won't be throwing away invalid ids that we're repairing instead!
switch ($ddquotes_policy)
{
case 'safe':
return $wakka->ReturnSafeHTML($newembedded);
case 'raw':
return $newembedded; # may still be invalid code - 'raw' will not be corrected!
default:
return $wakka->htmlspecialchars_ent($embedded); # display only
}
}
// code text
else if (preg_match('/^% %(.*?)% %$/s', $thing, $matches))
{
/*
* Note: this routine is rewritten such that (new) language formatters
* will automatically be found, whether they are GeSHi language config files
* or "internal" Wikka formatters.
* Path to GeSHi language files and Wikka formatters MUST be defined in config.
* For line numbering (GeSHi only) a starting line can be specified after the language
* code, separated by a ; e.g., % %(php;27)....% %.
* Specifying >= 1 turns on line numbering if this is enabled in the configuration.
*/
$code = $matches[1];
// if configuration path isn't set, make sure we'll get an invalid path so we
// don't match anything in the home directory
$geshi_hi_path = isset($wakka->config['geshi_languages_path']) ? $wakka->config['geshi_languages_path'] : '/:/';
$wikka_hi_path = isset($wakka->config['wikka_highlighters_path']) ? $wakka->config['wikka_highlighters_path'] : '/:/';
// check if a language (and starting line) has been specified
if (preg_match("/^\((.+?)(;([0-9]+))??\)(.*)$/s", $code, $matches))
{
list(, $language, , $start, $code) = $matches;
}
// get rid of newlines at start and end (and preceding/following whitespace)
// Note: unlike trim(), this preserves any tabs at the start of the first "real" line
$code = preg_replace('/^\s*\n+|\n+\s*$/','',$code);
// check if GeSHi path is set and we have a GeSHi hilighter for this language
if (isset($language) && isset($wakka->config['geshi_path']) && file_exists($geshi_hi_path.'/'.$language.'.php'))
{
// use GeSHi for hilighting
$output = $wakka->GeSHi_Highlight($code, $language, $start);
}
// check Wikka highlighter path is set and if we have an internal Wikka hilighter
elseif (isset($language) && isset($wakka->config['wikka_formatter_path']) && file_exists($wikka_hi_path.'/'.$language.'.php') && 'wakka' != $language)
{
// use internal Wikka hilighter
$output = '<div class="code">'."\n";
$output .= $wakka->Format($code, $language);
$output .= "</div>\n";
}
// no language defined or no formatter found: make default code block;
// IncludeBuffered() will complain if 'code' formatter doesn't exist
else
{
$output = '<div class="code">'."\n";
$output .= $wakka->Format($code, 'code');
$output .= "</div>\n";
}
#return $output;
// START DarTar modified 2005-02-17
// slight mod JavaWoman 2005-06-12: coding style, class for form
//build form
$form = $wakka->FormOpen('grabcode','','post','','grabcode');
$form .= '<input type="submit" name="save" class="grabcodebutton" style="line-height:10px; float:right; vertical-align: middle; margin-right:20px; margin-top:0px; font-size: 10px; color: #000; font-weight: normal; font-family: Verdana, Arial, sans-serif; background-color: #DDD; text-decoration: none; height:18px;" value="Grab" title="Download this code" />';
$form .= '<input type="hidden" name="code" value="'.urlencode($code).'" />';
$form .= $wakka->FormClose();
// output
return $output."\n".$form;
// END DarTar modified 2005-02-17
}
// forced links
// \S : any character that is not a whitespace character
// \s : any whitespace character
else if (preg_match('/^\[\[(\S*)(\s+(.+))?\]\]$/s', $thing, $matches)) # recognize forced links across lines
{
list(, $url, , $text) = $matches;
if ($url)
{
//if ($url!=($url=(preg_replace("/@@|££||\[\[/","",$url))))$result="</span>";
if (!$text) $text = $url;
//$text=preg_replace("/@@|££|\[\[/","",$text);
return $result.$wakka->Link($url,'', $text);
}
else
{
return '';
}
}
// indented text
# JW FIXED 2005-07-09 accented chars not used for ordered lists
# JW FIXED 2005-07-12 this does not cover the case where a list item is followed by an inline comment of the *same* level
# JW FIXED 2005-07-12 as with the expression in the /edit handler this does not cover tab or ~ at the start of the document
elseif (preg_match('/(^|\n)([\t~]+)(-|&|[0-9a-zA-Z]+\))?(\n|$)/s', $thing, $matches))
{
$br = 0; # no break needed after a block
// get new indent level
$newIndentLevel = strlen($matches[2]); # JW 2005-07-12 also match tab or ~ at start of document
// derive code indent
$codeIndent = str_repeat("\t",$newIndentLevel-1);
$nlTabs = "\n".$codeIndent;
$nlTabsOut = $nlTabs."\t";
// find out which indent type we want
$newIndentType = $matches[3]; # JW 2005-07-12 also match tab or ~ at start of document
// derive code fragments
if ($newIndentType == '') # plain indent
{
$opener = '<div class="indent">';
$closer = '</div>'/*.$nlTabs*/;
}
elseif ($newIndentType == '-') # unordered list
{
$opener = '<ul>'.$nlTabs.'<li>';
$closer = '</li>'.$nlTabs.'</ul>';
}
elseif ($newIndentType == '&') # inline comment
{
$opener = '<ul class="thread">'.$nlTabs.'<li>';
$closer = '</li>'.$nlTabs.'</ul>';
}
else # ordered list
{
$opener = '<ol type="'.substr($newIndentType, 0, 1).'">'.$nlTabs.'<li>';
$closer = '</li>'.$nlTabs.'</ol>';
$newIndentType = 'o';
}
// do an indent
if ($newIndentLevel > $oldIndentLevel)
{
for ($i = 0; $i < $newIndentLevel - $oldIndentLevel; $i++)
{
$result .= $nlTabs./*'<!--nested item '.$newIndentLevel.'-->'.*/$opener;
array_push($indentClosers, $closer);
#$result .= '<!--pushed type: '.$oldIndentType.' -->'; # @@@
array_push($indentTypes, $oldIndentType); # remember type hierarchically
}
}
// do an outdent or stay at the same level
else if ($newIndentLevel <= $oldIndentLevel)
{
$bOutdent = FALSE;
if ($newIndentLevel < $oldIndentLevel)
{
$bOutdent = TRUE; # remember we're outdenting, for correct layout
// do the outdenting
for ($i = 0; $i < $oldIndentLevel - $newIndentLevel; $i++)
{
if ($i > 0)
{
$result .= $nlTabsOut;
}
$result .= array_pop($indentClosers)/*.'<!--outdent to '.$newIndentLevel.'-->'*/;
$oldIndentType = array_pop($indentTypes); # make sure we will compare with "correct" previous type
#$result .= '<!--popped type: '.$oldIndentType.' -->'; # @@@
}
}
if ($bOutdent) # outdenting: put close tag on new line
{
$result .= $nlTabs/*.'<!--outdent: close tag on new line-->'*/;
}
// JW 2005-07-11 new item of different type
if ($newIndentType != $oldIndentType)
{
$result .= array_pop($indentClosers);
$result .= /*'<!--type change follows (old: '.$oldIndentType.' new: '.$newIndentType.') -->'.*/$nlTabs.$opener;
array_push($indentClosers, $closer);
}
// new item of same type
else
{
// plain indent
if ($newIndentType == '')
{
$result .= $closer./*'<!--same type ('.$newIndentType.') same level-->'.*/$nlTabs.$opener;
}
// list or inline comment
else
{
$result .= '</li>'.$nlTabs.'<li>'/*.'<!--back to same type-->'*/;
}
}
}
$oldIndentType = $newIndentType; # remember type sequentially
$oldIndentLevel = $newIndentLevel;
return $result;
}
// new lines
else if ($thing == "\n")
{
// if we got here, there was no tab (or ~) in the next line; this means that we can close all open indents.
// JW: we need to do the same thing at the end of the page to close indents NOT followed by newline: use a function
/*
$c = count($indentClosers);
for ($i = 0; $i < $c; $i++)
{
$result .= array_pop($indentClosers);
$br = 0;
}
$oldIndentLevel = 0;
#$oldIndentLength= 0; # superfluous
#$newIndentSpace=array(); # superfluous
*/
$result .= close_indents($indentClosers,$oldIndentLevel); # JW 2005-07-11 removed superfluous variables
$result .= ($br) ? "<br />\n" : "\n";
$br = 1;
return $result;
}
// Actions
else if (preg_match('/^\{\{(.*?)\}\}$/s', $thing, $matches))
{
if ($matches[1])
return $wakka->Action($matches[1]);
else
return '{{}}';
}
// interwiki links!
else if (preg_match('/^[A-ZÄÖÜ][A-Za-zÄÖÜßäöü]+[:]\S*$/s', $thing))
{
return $wakka->Link($thing);
}
// wiki links!
else if (preg_match('/^[A-ZÄÖÜ]+[a-zßäöü]+[A-Z0-9ÄÖÜ][A-Za-z0-9ÄÖÜßäöü]*$/s', $thing))
{
return $wakka->Link($thing);
}
// separators
else if (preg_match('/-{4,}/', $thing, $matches))
{
// TODO: This could probably be improved for situations where someone puts text on the same line as a separator.
// Which is a stupid thing to do anyway! HAW HAW! Ahem.
$br = 0;
return "<hr />\n";
}
// mind map xml
else if (preg_match('/^<map.*<\/map>$/s', $thing))
{
return $wakka->Action('mindmap '.$wakka->Href().'/mindmap.mm');
}
// if we reach this point, it must have been an accident.
// @@@ JW: or a detailed regex that excludes something that was included in the
// preg_replace_callback expression
return $thing;
}
}
if (!function_exists('wakka3callback'))
{
/**
* "Afterburner" formatting: extra handling of already-generated XHTML code.
*
* 1.
* Ensure every heading has an id, either specified or generated. (May be
* extended to generate section TOC data.)
* If an id is specified, that is used without any modification.
* If no id is specified, it is generated on the basis of the heading context:
* - any image tag is replaced by its alt text (if specified)
* - all tags are stripped
* - all characters that are not valid in an id are stripped (except whitespace)
* - the resulting string is then used by makedId() to generate an id out of it
*
* @access private
* @uses Wakka::makeId()
*
* @param array $things required: matches of the regex in the preg_replace_callback
* @return string heading with an id attribute
*/
function wakka3callback($things)
{
global $wakka;
$thing = $things[1];
// heading
if (preg_match('#^<(h[1-6])(.*?)>(.*?)</\\1>$#s', $thing, $matches)) # note that we don't match headings that are not valid XHTML!
{
/*
echo 'heading:<pre>';
print_r($matches);
echo '</pre>';
*/
list($element,$tagname,$attribs,$heading) = $matches;
#if (preg_match('/(id=("|\')(.*?)\\2)/',$attribs,$matches)) # use backref to match both single and double quotes
if (preg_match('/(id=("|\')(.*?)\\2)/',$attribs)) # use backref to match both single and double quotes
{
// existing id attribute: nothing to do (assume already treated as embedded code)
// @@@ we *may* want to gather ids and heading text for a TOC here ...
// heading text should then get partly the same treatment as when we're creating ids:
// at least replace images and strip tags - we can leave entities etc. alone - so we end up with
// plain text-only
// do this if we have a condition set to generate a TOC
return $element;
}
else
{
// no id: we'll have to create one
#echo 'no id provided - create one<br/>';
$tmpheading = trim($heading);
// first find and replace any image with its alt text
// @@@ can we use preg_match_all here? would it help?
while (preg_match('/(<img.*?alt=("|\')(.*?)\\2.*?>)/',$tmpheading,$matches))
{
#echo 'image found: '.$tmpheading.'<br/>';
# 1 = whole element
# 3 = alt text
list(,$element, ,$alttext) = $matches;
/*
echo 'embedded image:<pre>';
print_r($matches);
echo '</pre>';
*/
// gather data for replacement
$search = '/'.str_replace('/','\/',$element).'/'; # whole element (delimiter chars escaped!) @@@ use preg_quote as well?
$replace = trim($alttext); # alt text
/*
echo 'pat_repl:<pre>';
echo 'search: '.$search.'<br/>';
echo 'search: '.$replace.'<br/>';
echo '</pre>';
*/
// now replace img tag by corresponding alt text
$tmpheading = preg_replace($search,$replace,$tmpheading); # replace image by alt text
}
$headingtext = $tmpheading;
#echo 'headingtext (no img): '.$headingtext.'<br/>';
// @@@ 2005-05-27 now first replace linebreaks <br/> with spaces!!
// remove all other tags
$headingtext = strip_tags($headingtext);
#echo 'headingtext (no tags): '.$headingtext.'<br/>';
// @@@ this all-text result is usable for a TOC!!!
// do this if we have a condition set to generate a TOC
// replace entities that can be interpreted
// use default charset ISO-8859-1 because other chars won't be valid for an id anyway
$headingtext = html_entity_decode($headingtext,ENT_NOQUOTES);
// remove any remaining entities (so we don't end up with strange words and numbers in the id text)
$headingtext = preg_replace('/&[#]?.+?;/','',$headingtext);
#echo 'headingtext (entities decoded/removed): '.$headingtext.'<br/>';
// finally remove non-id characters (except whitespace which is handled by makeId())
$headingtext = preg_replace('/[^A-Za-z0-9_:.-\s]/','',$headingtext);
#echo 'headingtext (id-ready): '.$headingtext.'<br/>';
// now create id based on resulting heading text
$id = $wakka->makeId('hn',$headingtext);
#echo 'id: '.$id.'<br/>';
// rebuild element, adding id
return '<'.$tagname.$attribs.' id="'.$id.'">'.$heading.'</'.$tagname.'>';
}
}
// other elements to be treated go here (tables, images, code sections...)
}
}
// ------------- do the work -----------
$text = str_replace("\r\n", "\n", $text);
// replace 4 consecutive spaces at the beginning of a line with tab character
// $text = preg_replace("/\n[ ]{4}/", "\n\t", $text); // moved to edit.php
if ($this->method == 'show') $mind_map_pattern = '<map.*?<\/map>|'; else $mind_map_pattern = '';
// define entity patterns
// NOTE most also used in wikka.php for htmlentities_ent(): REGEX library!
$alpha = '[a-z]+'; # character entity reference
$numdec = '#[0-9]+'; # numeric character reference (decimal)
$numhex = '#x[0-9a-f]+'; # numeric character reference (hexadecimal)
$terminator = ';|(?=($|[\n<]|<))'; # semicolon; or end-of-string, newline or tag
$entitypat = '('.$alpha.'|'.$numdec.'|'.$numhex.')('.$terminator.')'; # defines entity pattern without the starting &
$entityref = '&'.$entitypat; # entity reference
$loneamp = '&(?!'.$entitypat.')'; # ampersand NOT part of an entity
$this->callLevel++; # JW 2005-07-15 recursion level: getting in
$text = preg_replace_callback(
'/('.
'% %.*?% %|'. # code
'"".*?""|'. # literal
$mind_map_pattern.
'\[\[[^\[]*?\]\]|'. # forced link
'-{4,}|---|'. # separator, new line
'\b[a-z]+:\/\/\S+|'. # URL
'\*\*|\'\'|\#\#|\#\%|@@|::c::|\>\>|\<\<|££|¥¥|\+\+|__|\/\/|'. # Wiki markup
'======|=====|====|===|==|'. # headings
'(^|\n)([\t~]+)(-|&|[0-9a-zA-Z]+\))?|'. # indents and lists # JW FIXED 2005-07-12 also match tab or ~ at start of document
'\{\{.*?\}\}|'. # action
'\b[A-ZÄÖÜ][A-Za-zÄÖÜßäöü]+[:](?![=_])\S*\b|'. # InterWiki link
'\b([A-ZÄÖÜ]+[a-zßäöü]+[A-Z0-9ÄÖÜ][A-Za-z0-9ÄÖÜßäöü]*)\b|'. # CamelWords
'<|>|'. # HTML special chars - after wiki markup!
$loneamp.'|'. # HTML special chars - ampersand NOT part of an enity
'\n'. # new line
')/ms','wakka2callback',$text);
// we're cutting the last <br />
$text = preg_replace('/<br \/>$/','',$text);
$this->callLevel--; # JW 2005-07-15 recursion level: getting out
if ($this->callLevel == 0) # JW 2005-07-15 only for "outmost" call level
{
$text .= wakka2callback('closetags'); # JW changed logic
}
// add ids to heading elements
// @@@ LATER:
// - extend with other elements (tables, images, code blocks)
// - also create array(s) for TOC(s)
$idstart = getmicrotime();
$text = preg_replace_callback(
'#('.
'<h[1-6].*?>.*?</h[1-6]>'.
// other elements to be treated go here
')#ms','wakka3callback',$text);
printf('<!-- Header id generation took %.6f seconds -->', (getmicrotime() - $idstart));
echo $text;
?>
%%
''Make sure you replace every occurrence of '**##% %##**' in this code with '**##""%%""##**'!''
====Supporting code====
Only a single **new** [[WikkaCore core]] method is needed for this improved formatter (other new functions are part of the formatter script itself):
===##""makeId()""##===
Used here to both for handling ids in embedded HTML code and to generate a unique id for headings; see GenerateUniqueId for the code and where to insert it.
====Todo====
==Bugs==
~-Solve conflict with ""::c::"" appearing right after a page name being interpreted as an interwiki link (see comment by TimoK below)
==Extensions==
~-The obvious next step would be to add code to generate code for a [[TableofcontentsAction page TOC]]
~-Handling tables, images and code blocks in a similar way (to create separate TOCs on request) would also be nice
~-Find a way to generate proper paragraphs (possibly based on [[IanAndolina]]'s [[SemanticMarkup method]] - see comments below)
~-When embedded HTML code contains an id and these are //changed// to avoid duplicates, any references should be changed accordingly (not so easy)
~-Later (much later) a complete rewrite will be needed to better handle closing tags, ensuring valid XHTML and being able to generate proper paragraphs instead of text separated by <br /> tags which isn't very structural code (and bad for accessibility).
====Test? Comments?====
Go ahead and test it - either on your own Wikka installation or on this site where it is now Installed as a [[WikkaBetaFeatures beta feature]].
Comments and suggestions are more than welcome, as always.
----
CategoryDevelopmentFormatters
//Installed as a [[WikkaBetaFeatures beta feature]] on this server as of 2005-06-12.//
>>**see also:**
~-AdvancedFormatter
~-GenerateUniqueId
~-GrabCodeHandler
~-TableofcontentsAction
>>This is the development page for an improved version of "the Formatter", specifically, the code in ##./formatters/wakka.php## (as opposed to the AdvancedFormatter page which deals with "advanced" formatting in other ways as well, such as standardized code //generation// utilities).::c::
====Why?====
While our current (//version 1.1.6.0//) Formatter is quite capable, it has some quirks and bugs, doesn't always generate valid XHTML (though it tries hard), and misses a few things that would be nice to have or that would //enable// things that would be nice to have (such as a [[TableofcontentsAction page TOC]]s). The improved version presented here tries to address some of these issues (with more likely to follow).
====What?====
Here is a short summary of what has changed (details below):
~- using single quotes wherever possible making [[RegEx]]es and generated HTML easier to read;
~- better closing of open tags at end of document, including open indents and lists (a long-standing bug!) ''Now improved''
~- better handling of nested lists so change of list "type" is actually detected and coded correctly; also produces nicely-formatted HTML code for lists and indents now, especially more readable for nested lists. ''New!''
~- escaping single & (not part of an entity) (another long-standing problem);
~- ability to nest one type of float within another (so a right float can contain a left float and vice versa)
~- handling ids (and making them unique) as provided in embedded code, using the ##""makeId()""## method;
~- creating ids for headings based on content ('afterburner' type formatting so this //includes// originally embedded code); this code not only uses the ##""makeId()""## method but also the ##html_entity_decode()## method in PHP versions older than 4.3.
''The code presented below is still considered a beta version and as such contains many lines of (commented-out) debug code. These will of course be removed before final release. Any reference to line numbers is (for now) to the new (beta) code since this is a complete drop-in replacement for the original file.''
===Closing open tags===
The current version (Wikka //1.1.6.0//) of the Formatter has a bit of code contributed by DotMG to close any left-open tags at the very end of a page. While that can solve some problems with rendering and including pages, the code was incomplete in //which// open tags were closed. A particular problem was still-open lists and indents which weren't handled at all (see "List parsing bug?" on WikkaBugs). Also, this code would directly ##echo## output instead of returning a string as the rest of the Formatter's main function does.
The new version addresses all of these problems.
Closing of indents and (open) lists was already happening when encountering a newline that **doesn't** start with a TAB or a **##~##**, so this bit is separated out as a function. ''Improved now by removing superfluous variables and corresponding parameters.''
%%(php;10)if (!function_exists('close_indents'))
{
function close_indents(&$indentClosers,&$oldIndentLevel) # JW 2005-07-11 removed superfluous variables
{
$result='';
$c = count($indentClosers);
for ($i = 0; $i < $c; $i++)
{
$result .= array_pop($indentClosers);
$br = 0;
}
$oldIndentLevel = 0;
return $result;
}
}%%
The section that handles newlines now only needs to call this function:
%%(php;487) $result .= close_indents($indentClosers,$oldIndentLevel); # JW 2005-07-11 removed superfluous variables
$result .= ($br) ? "<br />\n" : "\n";
$br = 1;
return $result;%%
To close open tags at the end of the page, the new code now calls this function first, and then handles all other open tags, in an order to at least minimize incorrect tag nesting (but see "**Not a compete solution!**" below):
%%(php;61) if ((!is_array($things)) && ($things == 'closetags'))
{
$result .= close_indents($indentClosers,$oldIndentLevel); # JW 2005-07-11 removed superfluous variables
if ($trigger_bold % 2) $result .= '</strong>';
if ($trigger_italic % 2) $result .= '</em>';
if ($trigger_keys % 2) $result .= '</kbd>';
if ($trigger_monospace % 2) $result .= '</tt>';
if ($trigger_underline % 2) $result .= '</span>';
if ($trigger_notes % 2) $result .= '</span>';
if ($trigger_strike % 2) $result .= '</span>';
if ($trigger_inserted % 2) $result .= '</span>';
if ($trigger_deleted % 2) $result .= '</span>';
if ($trigger_center % 2) $result .= '</div>';
if ($trigger_floatl % 2) $result .= '</div>';
if ($trigger_floatr % 2) $result .= '</div>'; # JW added
for ($i = 1; $i<=5; $i ++)
{
if ($trigger_l[$i] % 2) $result .= ("</h$i>");
}
$trigger_bold = $trigger_italic = $trigger_keys = $trigger_monospace = 0;
$trigger_underline = $trigger_notes = $trigger_strike = $trigger_inserted = $trigger_deleted = 0;
$trigger_center = $trigger_floatl = $trigger_floatr = 0;
$trigger_l = array(-1, 0, 0, 0, 0, 0);
return $result;
}
else
{
$thing = $things[1];
}%%
This is now used like this:
%%(php;684) $text .= wakka2callback('closetags'); # JW changed logic%%
==Not a complete solution!==
A big problem remains, however: in order to produce valid (X)HTML, open tags cannot just be closed //anywhere//: there are rules for which elements can contain which other elements. For instance, an inline element (like <em>) can never contain a block element (like a list). So if the inline element is left open (which happens if someone types ""//"" to start emphasized text but doesn't close it before starting an indent or list), closing the generated opening <em> tag at the end of the page may prevent display problems in some browsers, but the result is still not valid (X)HTML. This type of problem can only be really addressed with completely different mechanism for a formatter. This should definitely be tackled at some time, but is outside the scope of the current improvements which are designed to work //within// the current Formatter's mechanism.
===Better handling of nested lists and indents===
//**New** as of 2005-07-12//
There were still some issues with nested lists and indents, in particular when the **type** of list changed without a "level" change or when changing to a higher level ("outdent"). At the same time, a list or indent right at the start of a page was not detected or handled at all. (Although that is bad style, and a page should start with a heading, it still should be handled correctly by the formatter, of course.) Finally, accented (Umlaut) characters were treated as a list type.
In order to detect a list or indent at the start of the page as well as after a newline (and avoid Umlauts) the detection ""RegEx"" for a list or indent is now coded as follows (in the wakka2callback call):
%%(php;670) '(^|\n)([\t~]+)(-|&|[0-9a-zA-Z]+\))?|'. # indents and lists # JW FIXED 2005-07-12 also match tab or ~ at start of document%%
By using **##(^|\n)##** as an anchor for matching instead of merely **##\n##** the start of the page is also matched.
The actual code for //handling// a list or indent line was comnpletely rewritten to properly handle change of list types and to produce readable and nicely indented XHTML code. Note that this section now also starts with the ##(^|\n)## anchor:
%%(php;368) // indented text
# JW FIXED 2005-07-09 accented chars not used for ordered lists
# JW FIXED 2005-07-12 this does not cover the case where a list item is followed by an inline comment of the *same* level
# JW FIXED 2005-07-12 as with the expression in the /edit handler this does not cover tab or ~ at the start of the document
elseif (preg_match('/(^|\n)([\t~]+)(-|&|[0-9a-zA-Z]+\))?(\n|$)/s', $thing, $matches))
{
$br = 0; # no break needed after a block
// get new indent level
$newIndentLevel = strlen($matches[2]); # JW 2005-07-12 also match tab or ~ at start of document
// derive code indent
$codeIndent = str_repeat("\t",$newIndentLevel-1);
$nlTabs = "\n".$codeIndent;
$nlTabsOut = $nlTabs."\t";
// find out which indent type we want
$newIndentType = $matches[3]; # JW 2005-07-12 also match tab or ~ at start of document
// derive code fragments
if ($newIndentType == '') # plain indent
{
$opener = '<div class="indent">';
$closer = '</div>'/*.$nlTabs*/;
}
elseif ($newIndentType == '-') # unordered list
{
$opener = '<ul>'.$nlTabs.'<li>';
$closer = '</li>'.$nlTabs.'</ul>';
}
elseif ($newIndentType == '&') # inline comment
{
$opener = '<ul class="thread">'.$nlTabs.'<li>';
$closer = '</li>'.$nlTabs.'</ul>';
}
else # ordered list
{
$opener = '<ol type="'.substr($newIndentType, 0, 1).'">'.$nlTabs.'<li>';
$closer = '</li>'.$nlTabs.'</ol>';
$newIndentType = 'o';
}
// do an indent
if ($newIndentLevel > $oldIndentLevel)
{
for ($i = 0; $i < $newIndentLevel - $oldIndentLevel; $i++)
{
$result .= $nlTabs./*'<!--nested item '.$newIndentLevel.'-->'.*/$opener;
array_push($indentClosers, $closer);
#$result .= '<!--pushed type: '.$oldIndentType.' -->'; # @@@
array_push($indentTypes, $oldIndentType); # remember type hierarchically
}
}
// do an outdent or stay at the same level
else if ($newIndentLevel <= $oldIndentLevel)
{
$bOutdent = FALSE;
if ($newIndentLevel < $oldIndentLevel)
{
$bOutdent = TRUE; # remember we're outdenting, for correct layout
// do the outdenting
for ($i = 0; $i < $oldIndentLevel - $newIndentLevel; $i++)
{
if ($i > 0)
{
$result .= $nlTabsOut;
}
$result .= array_pop($indentClosers)/*.'<!--outdent to '.$newIndentLevel.'-->'*/;
$oldIndentType = array_pop($indentTypes); # make sure we will compare with "correct" previous type
#$result .= '<!--popped type: '.$oldIndentType.' -->'; # @@@
}
}
if ($bOutdent) # outdenting: put close tag on new line
{
$result .= $nlTabs/*.'<!--outdent: close tag on new line-->'*/;
}
// JW 2005-07-11 new item of different type
if ($newIndentType != $oldIndentType)
{
$result .= array_pop($indentClosers);
$result .= /*'<!--type change follows (old: '.$oldIndentType.' new: '.$newIndentType.') -->'.*/$nlTabs.$opener;
array_push($indentClosers, $closer);
}
// new item of same type
else
{
// plain indent
if ($newIndentType == '')
{
$result .= $closer./*'<!--same type ('.$newIndentType.') same level-->'.*/$nlTabs.$opener;
}
// list or inline comment
else
{
$result .= '</li>'.$nlTabs.'<li>'/*.'<!--back to same type-->'*/;
}
}
}
$oldIndentType = $newIndentType; # remember type sequentially
$oldIndentLevel = $newIndentLevel;
return $result;
}%%
Since the new code avoids adding an extra ##<br />## before a list (##ul##, ##ol##) or indent (##div##) - these are block-level elements and line breaks should not be used to separate them (they really should be used only within flowing **text**) - the stylesheet had to be tweaked a little since it actually (implicitly) assumes a line break is there. Change the following in ##css/wikka.css## or your own "skin":
%%(css;100)ul, ol {
margin-top: 0px;
margin-bottom: 0px;
padding-top: 0px;
padding-bottom: 0px;
}
%%---
to:
%%(css;100)ul, ol {
/*margin-top: 0px;*/ /* keep natural margin; an extra <br/> is no longer generated */
margin-bottom: 0px;
padding-top: 0px;
padding-bottom: 0px;
}
ul ul, ol ol, ul ol, ol ul { /* keep suppressing margin for nested lists */
margin-top: 0px;
}
%%---(Since we're dealing with beta code anyway, line numbers refer to the stylesheet as implemented on this server.)
Also, the styling for inline comments (line 66) should be moved so it actually overrides the generic style on line 100 etc. instead of the other way round:
%%(css;66)/* ul.thread styles moved so they come after the generic ul style */%%
%%(css;308)/* these ul.thread styles must come after the generic ul style in order to override it */
ul.thread {
list-style-type: none;
border-left: 2px #666 solid;
padding-left: 10px;
margin: 5px 0px;
}
ul.thread li {
color: #333;
font-size: 12px;
}%%
===Escaping single ampersands===
While there are a few cases where it's actually allowed to use a plain **&** in HTML, in most cases where an ampersand is not part of an entity reference it needs to be escaped as &. The current (//version 1.1.6.0//) Formatter escapes the **<** and **>** special characters, but not **&**, so the result may be invalid XHTML.
We need to find the ampersands that are **not** part of an entity reference. So we first build a RegEx to recognize the part of an entity reference that //follows// the ampersand that starts it; it can be a named entity, or a decimal or a hex numerical entity; and it can be terminated by a semicolon (;) in most cases, but there are a few cases where it's legal to leave off the terminating semicolon. To make it easier to read, we build the RegEx to express all that from its constituent parts:
%%(php;649)// define entity patterns
// NOTE most also used in wikka.php for htmlentities_ent(): REGEX library!
$alpha = '[a-z]+'; # character entity reference
$numdec = '#[0-9]+'; # numeric character reference (decimal)
$numhex = '#x[0-9a-f]+'; # numeric character reference (hexadecimal)
$terminator = ';|(?=($|[\n<]|<))'; # semicolon; or end-of-string, newline or tag
$entitypat = '('.$alpha.'|'.$numdec.'|'.$numhex.')('.$terminator.')'; # defines entity pattern without the starting &
$entityref = '&'.$entitypat; # entity reference%%
So now we can define a 'lone' ampersand as one that is **not** followed by the expression **##$entitypat##**:
%%(php;675)$loneamp = '&(?!'.$entitypat.')'; # ampersand NOT part of an entity%%
This then becomes part of the big expression that's used in the **##preg_replace_callback()##** near the end of the file, as the last thing to consider before a newline:
%%(php;674) '<|>|'. # HTML special chars - after wiki markup!
$loneamp.'|'. # HTML special chars - ampersand NOT part of an enity
'\n'. # new line%%
Now we can "escape" all HTML special characters, as we should:
%%(php;96) // convert HTML thingies (including ampersand NOT part of entity)
if ($thing == '<')
return '<';
else if ($thing == '>')
return '>';
else if ($thing == '&')
return '&';%%
===Nesting floats===
I happened to find that the code for a left float (""<<"") would terminate a right float ("">>"") and vice versa. Which would of course likely leave unclosed tags. It turned out that by solving that it actually became possible to nest //unlike// floats - one level deep, at least. No great feature, but it could be handy at times.The solution is actually quite simple: there was just a single "trigger" to keep track of start and end of a float; keeping a separate trigger for left and right floats (an not generating newlines) is all that's needed:
%%(php;103) // JW 2005-05-23: changed floats handling so they can be nested (one type within another only)
// float box left
else if ($thing == '<<')
{
#return (++$trigger_floatl % 2 ? '<div class="floatl">'."\n" : "\n</div>\n");
return (++$trigger_floatl % 2 ? '<div class="floatl">' : '</div>'); # JW changed (no newline)
}
// float box right
else if ($thing == '>>')
{
#return (++$trigger_floatl % 2 ? '<div class="floatr">'."\n" : "\n</div>\n");
return (++$trigger_floatr % 2 ? '<div class="floatr">' : '</div>'); # JW changed (trigger, no newline)
}%%---
Note line 114 where we now use a **##$trigger_floatr##** instead of **##$trigger_floatl##**: this solves the bug and creates a new micro-feature at the same time.
''Now that the improved formatter has been installed as a beta feature, I added a small demo in the SandBox (in case it disappears: look for the edit made on 2005-06-14 21:45:57 in the revisions). --JW''
===Ids in embedded code===
Since in ##ID## must be unique in a page, embedding HTML code, and combining that with generated code, creates a problem. In order to to ensure that the page is valid XHTML, every id attribute **must** have unique value, regardless where it's coming from.
When **generating** code that should contain ids, this is simple: just use the ##[[GenerateUniqueId makeId()]]## method to generate one, with or without specifying parameters. Still, the result //could// conflict with id attributes in embedded HTML code so we must handle those as well.
We analyze the whole block of embedded code, run each id through ##""makeId()""##; if this method detects the id already exists, it will return an amended value with a sequence suffix; if it finds the id value wasn't valid, it will create a new one and return that. The formatter then replaces every id for which a different value was returned:
%%(php;219) // escaped text
else if (preg_match('/^""(.*)""$/s', $thing, $matches))
{
/*
echo 'embedded content<br/>';
*/
// get config
# $allowed_double_doublequote_html = $wakka->GetConfigValue('double_doublequote_html');
$ddquotes_policy = $wakka->config['double_doublequote_html'];
/*
echo 'double quotes: '.$ddquotes_policy.'<br/>';
*/
// get embedded code
$embedded = $matches[1];
// handle embedded id attributes for 'safe' and 'raw'
if ($ddquotes_policy == 'safe' || $ddquotes_policy == 'raw')
{
// get tags with id attributes
$patTagWithId = '((<[a-z].*?)(id=("|\')(.*?)\\4)(.*?>))';
// with PREG_SET_ORDER we get an array for each match: easy to use with list()!
// we do the match case-insensitive so we catch uppercase HTML as well;
// SafeHTML will treat this but 'raw' may end up with invalid code!
$tags2 = preg_match_all('/'.$patTagWithId.'/i',$embedded,$matches2,PREG_SET_ORDER); # use backref to match both single and double quotes
/*
echo '# of matches (2): '.$tags2.'<br/>';
echo '<!--found (set order):'."\n";
print_r($matches2);
echo '-->'."\n";
*/
// step through code, replacing tags with ids with tags with new ('repaired') ids
$tmpembedded = $embedded;
$newembedded = '';
for ($i=0; $i < $tags2; $i++)
{
list(,$tag,$tagstart,$attrid,$quote,$id,$tagend) = $matches2[$i]; # $attrid not needed, just for clarity
$parts = explode($tag,$tmpembedded,2); # split in two at matched tag
if ($id != ($newid = $wakka->makeId('embed',$id))) # replace if we got a new value
{
/*
echo 'replacing tag - old id: '.$id.' new id: '.$newid.'<br/>';
*/
$tag = $tagstart.'id='.$quote.$newid.$quote.$tagend;
}
/*
echo "<!--old: $tag -->\n";
echo "<!--new: $replacetag -->\n";
*/
$newembedded .= $parts[0].$tag; # append (replacement) tag to first part
$tmpembedded = $parts[1]; # after tag: next bit to handle
}
$newembedded .= $tmpembedded; # add last part
/*
echo '<!--translation:'."\n";
echo $newembedded;
echo '-->'."\n";
*/
}
// return (treated) embedded content according to config
// NOTE: we apply SafeHTML *after* id treatment so it won't be throwing away invalid ids that we're repairing instead!
switch ($ddquotes_policy)
{
case 'safe':
return $wakka->ReturnSafeHTML($newembedded);
case 'raw':
return $newembedded; # may still be invalid code - 'raw' will not be corrected!
default:
return $wakka->htmlspecialchars_ent($embedded); # display only
}
}%%
As long as ids in the embedded code are valid and unique, they remain unchanged because ##""makeId()""## is called with the 'embed' parameter which tells it not to add an id group prefix.
Still, it is important to remember that ids can only truly be guaranteed to be unique if **every** bit of code that generates HTML with ids is actually using the ##""makeId()""## method - and **that includes user-contributed extensions**.
''The "Fatal error: Call to a member function on a non-object" bug referred to on WikkaBugsResolved is also fixed here (line 282).''
''**TODO:** There is still one problem to be solved here: when embedded HTML code contains an id, it's entirely possible that it (or a following embedded section) contains a reference to that id. When the ##""makeId()""## method finds it is necessary to change an id because it conflicts with a pre-existing one, any reference to it should also be updated.
The current code does not (yet) take care of this - it's a fairy complicated problem to solve correctly, but will be tackled soon.''
===Heading ids===
Creating ids for headings is (you guessed it) the first (and necessary) piece of the puzzle to enable generating [[TableofcontentsAction page TOC]]s, but other bits will be needed for that as well, such as actually **gathering** the references to headings (and their levels), and the ability to **link** to page fragments (something our [[Docs:WikkaCore current core]] does not support yet). So: we cannot generate TOCs - yet - but we are getting there; the code is also designed to make it possible to extend it to generate TOCs not just for headings, but also for things like images, tables and code blocks.
A method for generating a TOC has not been decided yet (we may even provide alternatives), but one thing we //certainly// need is ids for headings (see TableofcontentsAction for more background on this); and even if we do not (yet) generate a TOC, being able to link to a page fragment (the obvious next step) will be useful in itself.
Some thought went into the method of generating the ids: Ideally they should be 'recognizable' so creating links **to** a page fragment with a heading wil be easy, and they should be as 'constant' as possible so a link to a section //remains// a link to //that// section, even if that is moved to a different position on the page, or another is inserted before it. This implies that all methods that simply generate a sequential id will not fulfill our requirements. We also don't burden the writer with coming up with ids (or even needing to think about them): they should be able to just concentrate on the **content**. Instead, we use following approach:
~-The actual content of the heading is the basis for the id: this makes it very likely the id wil remain the same even if page sections are re-arranged or new sections inserted.
~-A heading may contain images; if the images have an alt text, they are replaced by this (after all, alt text is meant for precisely that: to replace an image where it cannot be shown!); all other tags are simply stripped.
~-A heading may also contain entity references; where possible these are translated into ASCII letters (using ##html_entity_decode()##) while all other ones are removed.
~-Any character (except whitespace) that is not valid in an id is then removed: the result is a string that consists only of characters valid for an id - but it could now possibly be an empty string
~-A valid id can only contain letters, numbers, dashes, underscores and periods, and must start with an (ASCII) letter; this implies that spaces (and whitespace in general) is not allowed. However, the ##""makeId()""## method tackles this by first transforming all whitespace into underscores.
~-The resulting string is examined for uniqueness within the group of "heading" ids; if necessary, a sequence number is added, or a new id hash is generated.
All this is implemented as an "afterburner" type of formatter which is applied after all basic formatting has already taken place and we already have the XHTML output of that process. This ensures that **all** headings are taken into account, whether they are generated from Wikka markup or from embedded HTML code. The afterburner ##preg_replace_callback()## function is designed to be extended with other types of code fragments we might want to generate ids (and maybe [[TableofcontentsAction page TOC]]s...) for.
The 'afterburner' function is defined like this:
%%(php;531)if (!function_exists('wakka3callback'))
{
/**
* "Afterburner" formatting: extra handling of already-generated XHTML code.
*
* 1.
* Ensure every heading has an id, either specified or generated. (May be
* extended to generate section TOC data.)
* If an id is specified, that is used without any modification.
* If no id is specified, it is generated on the basis of the heading context:
* - any image tag is replaced by its alt text (if specified)
* - all tags are stripped
* - all characters that are not valid in an id are stripped (except whitespace)
* - the resulting string is then used by makedId() to generate an id out of it
*
* @access private
* @uses Wakka::makeId()
*
* @param array $things required: matches of the regex in the preg_replace_callback
* @return string heading with an id attribute
*/
function wakka3callback($things)
{
global $wakka;
$thing = $things[1];
// heading
if (preg_match('#^<(h[1-6])(.*?)>(.*?)</\\1>$#s', $thing, $matches)) # note that we don't match headings that are not valid XHTML!
{
/*
echo 'heading:<pre>';
print_r($matches);
echo '</pre>';
*/
list($element,$tagname,$attribs,$heading) = $matches;
#if (preg_match('/(id=("|\')(.*?)\\2)/',$attribs,$matches)) # use backref to match both single and double quotes
if (preg_match('/(id=("|\')(.*?)\\2)/',$attribs)) # use backref to match both single and double quotes
{
// existing id attribute: nothing to do (assume already treated as embedded code)
// @@@ we *may* want to gather ids and heading text for a TOC here ...
// heading text should then get partly the same treatment as when we're creating ids:
// at least replace images and strip tags - we can leave entities etc. alone - so we end up with
// plain text-only
// do this if we have a condition set to generate a TOC
return $element;
}
else
{
// no id: we'll have to create one
#echo 'no id provided - create one<br/>';
$tmpheading = trim($heading);
// first find and replace any image with its alt text
// @@@ can we use preg_match_all here? would it help?
while (preg_match('/(<img.*?alt=("|\')(.*?)\\2.*?>)/',$tmpheading,$matches))
{
#echo 'image found: '.$tmpheading.'<br/>';
# 1 = whole element
# 3 = alt text
list(,$element, ,$alttext) = $matches;
/*
echo 'embedded image:<pre>';
print_r($matches);
echo '</pre>';
*/
// gather data for replacement
$search = '/'.str_replace('/','\/',$element).'/'; # whole element (delimiter chars escaped!) @@@ use preg_quote as well?
$replace = trim($alttext); # alt text
/*
echo 'pat_repl:<pre>';
echo 'search: '.$search.'<br/>';
echo 'search: '.$replace.'<br/>';
echo '</pre>';
*/
// now replace img tag by corresponding alt text
$tmpheading = preg_replace($search,$replace,$tmpheading); # replace image by alt text
}
$headingtext = $tmpheading;
#echo 'headingtext (no img): '.$headingtext.'<br/>';
// @@@ 2005-05-27 now first replace linebreaks <br/> with spaces!!
// remove all other tags
$headingtext = strip_tags($headingtext);
#echo 'headingtext (no tags): '.$headingtext.'<br/>';
// @@@ this all-text result is usable for a TOC!!!
// do this if we have a condition set to generate a TOC
// replace entities that can be interpreted
// use default charset ISO-8859-1 because other chars won't be valid for an id anyway
$headingtext = html_entity_decode($headingtext,ENT_NOQUOTES);
// remove any remaining entities (so we don't end up with strange words and numbers in the id text)
$headingtext = preg_replace('/&[#]?.+?;/','',$headingtext);
#echo 'headingtext (entities decoded/removed): '.$headingtext.'<br/>';
// finally remove non-id characters (except whitespace which is handled by makeId())
$headingtext = preg_replace('/[^A-Za-z0-9_:.-\s]/','',$headingtext);
#echo 'headingtext (id-ready): '.$headingtext.'<br/>';
// now create id based on resulting heading text
$id = $wakka->makeId('hn',$headingtext);
#echo 'id: '.$id.'<br/>';
// rebuild element, adding id
return '<'.$tagname.$attribs.' id="'.$id.'">'.$heading.'</'.$tagname.'>';
}
}
// other elements to be treated go here (tables, images, code sections...)
}
}%%
This is called (after the primary formatter) as follows:
%%(php;687)// add ids to heading elements
// @@@ LATER:
// - extend with other elements (tables, images, code blocks)
// - also create array(s) for TOC(s)
$idstart = getmicrotime();
$text = preg_replace_callback(
'#('.
'<h[1-6].*?>.*?</h[1-6]>'.
// other elements to be treated go here
')#ms','wakka3callback',$text);
printf('<!-- Header id generation took %.6f seconds -->', (getmicrotime() - $idstart));%%
The result is an id that is almost always derived directly from the heading content, giving a high chance that it will remain constant even if the page content is re-arranged: thus it provides a reliable target for a link.
===Keeping track of recursion level===
Since the Formatter is now "better" at closing any tags left open at the end of the string it's handling, a new issue is arose turning up with some of the beta code on this server: when the Formatter is being called recursively by an action (Formatter -> Action -> Formatter...) the "second-level" formatter will close all tags that were opened by the "first-level" formatter. When an action using the formatter is embedded in something like a heading or a list element (which in most cases should be entirely valid) the heading or list item (and whatever else is "open") is closed at the end of the action rather than at the point where it should be. While it's not really a good idea for an action (which is interpreted **by** the Formatter calling the Action() method) to call the Formatter in its turn (and usually redundant), the Formatter should handle this more elegantly and close tags only in its "topmost" instance.
The solution is to keep track of the recursion level and close open tags at the end only at the "outermost" level. The first thing we need is a varable to keep track of the level; we add this at the start of the Wakka class in ##wikka.php##, after the other object variables:
%%(php;105) var $callLevel = 0; # JW 2005-07-15 keep track of recursion levels of the formatter%%---
Then in the Formatter, right before the ##wakka2callback()## function is called to do the actual formatting, we increment the variable:
%%(php;659)$this->callLevel++; # JW 2005-07-15 recursion level: getting in
$text = preg_replace_callback(
%%---
and right afterwards we decrement it again, after which we execute the 'closetags' routine only when the call level is back at 0:
%%(php;681)$this->callLevel--; # JW 2005-07-15 recursion level: getting out
if ($this->callLevel == 0) # JW 2005-07-15 only for "outmost" call level
{
$text .= wakka2callback('closetags'); # JW changed logic
}
%%
====The Code====
Here's the code (all of it). This **replaces** the file **##./formatters/wakka.php##**.
''This //incorporates// the small change needed to support DarTar's GrabCodeHandler, slightly extended to take advantage of the ability of the new [[AdvancedFormOpen advanced FormOpen()]] method to add a class to a form (lines 338-347) so the form can be properly styled.
If you want to test this improved formatter, you should either also grab DarTar's GrabCodeHandler or comment out these lines and **un**comment line 337.''
%%(php;1)<?php
// This may look a bit strange, but all possible formatting tags have to be in a single regular expression for this to work correctly. Yup!
// #dotmg [many lines] : Unclosed tags fix! For more info, [email protected]
// JavaWoman - corrected and improved unclosed tags handling, including missing ones and indents
// ------------- define the necessary functions -----------
if (!function_exists('close_indents'))
{
function close_indents(&$indentClosers,&$oldIndentLevel) # JW 2005-07-11 removed superfluous variables
{
$result='';
$c = count($indentClosers);
for ($i = 0; $i < $c; $i++)
{
$result .= array_pop($indentClosers);
$br = 0;
}
$oldIndentLevel = 0;
return $result;
}
}
if (!function_exists('wakka2callback'))
{
function wakka2callback($things)
{
$result='';
static $oldIndentType = ''; # JW 2005-07-12 added
static $oldIndentLevel = 0;
#static $oldIndentLength= 0; # JW 2005-07-12 removed superfluous variables
static $indentClosers = array();
static $indentTypes = array(); # JW 2005-07-12 added
#static $newIndentSpace = array(); # JW 2005-07-12 removed superfluous variables
static $br = 1;
static $trigger_bold = 0;
static $trigger_italic = 0;
static $trigger_keys = 0;
static $trigger_monospace = 0;
static $trigger_underline = 0;
static $trigger_notes = 0;
static $trigger_strike = 0;
static $trigger_inserted = 0;
static $trigger_deleted = 0;
static $trigger_center = 0;
static $trigger_floatl = 0;
static $trigger_floatr = 0; # JW added
static $trigger_l = array(-1, 0, 0, 0, 0, 0);
global $wakka; # @@@ should be capitalized but requires change in wikka.php (etc.)
if ((!is_array($things)) && ($things == 'closetags'))
{
$result .= close_indents($indentClosers,$oldIndentLevel); # JW 2005-07-11 removed superfluous variables
if ($trigger_bold % 2) $result .= '</strong>';
if ($trigger_italic % 2) $result .= '</em>';
if ($trigger_keys % 2) $result .= '</kbd>';
if ($trigger_monospace % 2) $result .= '</tt>';
if ($trigger_underline % 2) $result .= '</span>';
if ($trigger_notes % 2) $result .= '</span>';
if ($trigger_strike % 2) $result .= '</span>';
if ($trigger_inserted % 2) $result .= '</span>';
if ($trigger_deleted % 2) $result .= '</span>';
if ($trigger_center % 2) $result .= '</div>';
if ($trigger_floatl % 2) $result .= '</div>';
if ($trigger_floatr % 2) $result .= '</div>'; # JW added
for ($i = 1; $i<=5; $i ++)
{
if ($trigger_l[$i] % 2) $result .= ("</h$i>");
}
$trigger_bold = $trigger_italic = $trigger_keys = $trigger_monospace = 0;
$trigger_underline = $trigger_notes = $trigger_strike = $trigger_inserted = $trigger_deleted = 0;
$trigger_center = $trigger_floatl = $trigger_floatr = 0;
$trigger_l = array(-1, 0, 0, 0, 0, 0);
return $result;
}
else
{
$thing = $things[1];
}
// convert HTML thingies (including ampersand NOT part of entity)
if ($thing == '<')
return '<';
else if ($thing == '>')
return '>';
else if ($thing == '&')
return '&';
// JW 2005-05-23: changed floats handling so they can be nested (one type within another only)
// float box left
else if ($thing == '<<')
{
#return (++$trigger_floatl % 2 ? '<div class="floatl">'."\n" : "\n</div>\n");
return (++$trigger_floatl % 2 ? '<div class="floatl">' : '</div>'); # JW changed (no newline)
}
// float box right
else if ($thing == '>>')
{
#return (++$trigger_floatl % 2 ? '<div class="floatr">'."\n" : "\n</div>\n");
return (++$trigger_floatr % 2 ? '<div class="floatr">' : '</div>'); # JW changed (trigger, no newline)
}
// clear floated box
else if ($thing == '::c::')
{
return ('<div class="clear"> </div>'."\n");
}
// keyboard
else if ($thing == '#%')
{
return (++$trigger_keys % 2 ? '<kbd class="keys">' : '</kbd>');
}
// bold
else if ($thing == '**')
{
return (++$trigger_bold % 2 ? '<strong>' : '</strong>');
}
// italic
else if ($thing == '//')
{
return (++$trigger_italic % 2 ? '<em>' : '</em>');
}
// monospace
else if ($thing == '##')
{
return (++$trigger_monospace % 2 ? '<tt>' : '</tt>');
}
// underline
else if ($thing == '__')
{
return (++$trigger_underline % 2 ? '<span class="underline">' : '</span>');
}
// notes
else if ($thing == "''")
{
return (++$trigger_notes % 2 ? '<span class="notes">' : '</span>');
}
// strikethrough
else if ($thing == '++')
{
return (++$trigger_strike % 2 ? '<span class="strikethrough">' : '</span>');
}
// additions
else if ($thing == '££')
{
return (++$trigger_inserted % 2 ? '<span class="additions">' : '</span>');
}
// deletions
else if ($thing == '¥¥')
{
return (++$trigger_deleted % 2 ? '<span class="deletions">' : '</span>');
}
// center
else if ($thing == '@@')
{
return (++$trigger_center % 2 ? '<div class="center">'."\n" : "\n</div>\n");
}
// urls
else if (preg_match('/^([a-z]+:\/\/\S+?)([^[:alnum:]^\/])?$/', $thing, $matches))
{
$url = $matches[1];
if (preg_match('/^(.*)\.(gif|jpg|png)/si', $url)) {
return '<img src="'.$url.'" alt="image" />'.$matches[2];
} else
// Mind Mapping Mod
if (preg_match('/^(.*)\.(mm)/si', $url)) {
return $wakka->Action('mindmap '.$url);
} else
return $wakka->Link($url).$matches[2];
}
// header level 5
else if ($thing == '==')
{
$br = 0;
return (++$trigger_l[5] % 2 ? '<h5>' : "</h5>\n");
}
// header level 4
else if ($thing == '===')
{
$br = 0;
return (++$trigger_l[4] % 2 ? '<h4>' : "</h4>\n");
}
// header level 3
else if ($thing == '====')
{
$br = 0;
return (++$trigger_l[3] % 2 ? '<h3>' : "</h3>\n");
}
// header level 2
else if ($thing == '=====')
{
$br = 0;
return (++$trigger_l[2] % 2 ? '<h2>' : "</h2>\n");
}
// header level 1
else if ($thing == '======')
{
$br = 0;
return (++$trigger_l[1] % 2 ? '<h1>' : "</h1>\n");
}
// forced line breaks
else if ($thing == "---")
{
return '<br />';
}
// escaped text
else if (preg_match('/^""(.*)""$/s', $thing, $matches))
{
/*
echo 'embedded content<br/>';
*/
// get config
# $allowed_double_doublequote_html = $wakka->GetConfigValue('double_doublequote_html');
$ddquotes_policy = $wakka->config['double_doublequote_html'];
/*
echo 'double quotes: '.$ddquotes_policy.'<br/>';
*/
// get embedded code
$embedded = $matches[1];
// handle embedded id attributes for 'safe' and 'raw'
if ($ddquotes_policy == 'safe' || $ddquotes_policy == 'raw')
{
// get tags with id attributes
$patTagWithId = '((<[a-z].*?)(id=("|\')(.*?)\\4)(.*?>))';
// with PREG_SET_ORDER we get an array for each match: easy to use with list()!
// we do the match case-insensitive so we catch uppercase HTML as well;
// SafeHTML will treat this but 'raw' may end up with invalid code!
$tags2 = preg_match_all('/'.$patTagWithId.'/i',$embedded,$matches2,PREG_SET_ORDER); # use backref to match both single and double quotes
/*
echo '# of matches (2): '.$tags2.'<br/>';
echo '<!--found (set order):'."\n";
print_r($matches2);
echo '-->'."\n";
*/
// step through code, replacing tags with ids with tags with new ('repaired') ids
$tmpembedded = $embedded;
$newembedded = '';
for ($i=0; $i < $tags2; $i++)
{
list(,$tag,$tagstart,$attrid,$quote,$id,$tagend) = $matches2[$i]; # $attrid not needed, just for clarity
$parts = explode($tag,$tmpembedded,2); # split in two at matched tag
if ($id != ($newid = $wakka->makeId('embed',$id))) # replace if we got a new value
{
/*
echo 'replacing tag - old id: '.$id.' new id: '.$newid.'<br/>';
*/
$tag = $tagstart.'id='.$quote.$newid.$quote.$tagend;
}
/*
echo "<!--old: $tag -->\n";
echo "<!--new: $replacetag -->\n";
*/
$newembedded .= $parts[0].$tag; # append (replacement) tag to first part
$tmpembedded = $parts[1]; # after tag: next bit to handle
}
$newembedded .= $tmpembedded; # add last part
/*
echo '<!--translation:'."\n";
echo $newembedded;
echo '-->'."\n";
*/
}
// return (treated) embedded content according to config
// NOTE: we apply SafeHTML *after* id treatment so it won't be throwing away invalid ids that we're repairing instead!
switch ($ddquotes_policy)
{
case 'safe':
return $wakka->ReturnSafeHTML($newembedded);
case 'raw':
return $newembedded; # may still be invalid code - 'raw' will not be corrected!
default:
return $wakka->htmlspecialchars_ent($embedded); # display only
}
}
// code text
else if (preg_match('/^% %(.*?)% %$/s', $thing, $matches))
{
/*
* Note: this routine is rewritten such that (new) language formatters
* will automatically be found, whether they are GeSHi language config files
* or "internal" Wikka formatters.
* Path to GeSHi language files and Wikka formatters MUST be defined in config.
* For line numbering (GeSHi only) a starting line can be specified after the language
* code, separated by a ; e.g., % %(php;27)....% %.
* Specifying >= 1 turns on line numbering if this is enabled in the configuration.
*/
$code = $matches[1];
// if configuration path isn't set, make sure we'll get an invalid path so we
// don't match anything in the home directory
$geshi_hi_path = isset($wakka->config['geshi_languages_path']) ? $wakka->config['geshi_languages_path'] : '/:/';
$wikka_hi_path = isset($wakka->config['wikka_highlighters_path']) ? $wakka->config['wikka_highlighters_path'] : '/:/';
// check if a language (and starting line) has been specified
if (preg_match("/^\((.+?)(;([0-9]+))??\)(.*)$/s", $code, $matches))
{
list(, $language, , $start, $code) = $matches;
}
// get rid of newlines at start and end (and preceding/following whitespace)
// Note: unlike trim(), this preserves any tabs at the start of the first "real" line
$code = preg_replace('/^\s*\n+|\n+\s*$/','',$code);
// check if GeSHi path is set and we have a GeSHi hilighter for this language
if (isset($language) && isset($wakka->config['geshi_path']) && file_exists($geshi_hi_path.'/'.$language.'.php'))
{
// use GeSHi for hilighting
$output = $wakka->GeSHi_Highlight($code, $language, $start);
}
// check Wikka highlighter path is set and if we have an internal Wikka hilighter
elseif (isset($language) && isset($wakka->config['wikka_formatter_path']) && file_exists($wikka_hi_path.'/'.$language.'.php') && 'wakka' != $language)
{
// use internal Wikka hilighter
$output = '<div class="code">'."\n";
$output .= $wakka->Format($code, $language);
$output .= "</div>\n";
}
// no language defined or no formatter found: make default code block;
// IncludeBuffered() will complain if 'code' formatter doesn't exist
else
{
$output = '<div class="code">'."\n";
$output .= $wakka->Format($code, 'code');
$output .= "</div>\n";
}
#return $output;
// START DarTar modified 2005-02-17
// slight mod JavaWoman 2005-06-12: coding style, class for form
//build form
$form = $wakka->FormOpen('grabcode','','post','','grabcode');
$form .= '<input type="submit" name="save" class="grabcodebutton" style="line-height:10px; float:right; vertical-align: middle; margin-right:20px; margin-top:0px; font-size: 10px; color: #000; font-weight: normal; font-family: Verdana, Arial, sans-serif; background-color: #DDD; text-decoration: none; height:18px;" value="Grab" title="Download this code" />';
$form .= '<input type="hidden" name="code" value="'.urlencode($code).'" />';
$form .= $wakka->FormClose();
// output
return $output."\n".$form;
// END DarTar modified 2005-02-17
}
// forced links
// \S : any character that is not a whitespace character
// \s : any whitespace character
else if (preg_match('/^\[\[(\S*)(\s+(.+))?\]\]$/s', $thing, $matches)) # recognize forced links across lines
{
list(, $url, , $text) = $matches;
if ($url)
{
//if ($url!=($url=(preg_replace("/@@|££||\[\[/","",$url))))$result="</span>";
if (!$text) $text = $url;
//$text=preg_replace("/@@|££|\[\[/","",$text);
return $result.$wakka->Link($url,'', $text);
}
else
{
return '';
}
}
// indented text
# JW FIXED 2005-07-09 accented chars not used for ordered lists
# JW FIXED 2005-07-12 this does not cover the case where a list item is followed by an inline comment of the *same* level
# JW FIXED 2005-07-12 as with the expression in the /edit handler this does not cover tab or ~ at the start of the document
elseif (preg_match('/(^|\n)([\t~]+)(-|&|[0-9a-zA-Z]+\))?(\n|$)/s', $thing, $matches))
{
$br = 0; # no break needed after a block
// get new indent level
$newIndentLevel = strlen($matches[2]); # JW 2005-07-12 also match tab or ~ at start of document
// derive code indent
$codeIndent = str_repeat("\t",$newIndentLevel-1);
$nlTabs = "\n".$codeIndent;
$nlTabsOut = $nlTabs."\t";
// find out which indent type we want
$newIndentType = $matches[3]; # JW 2005-07-12 also match tab or ~ at start of document
// derive code fragments
if ($newIndentType == '') # plain indent
{
$opener = '<div class="indent">';
$closer = '</div>'/*.$nlTabs*/;
}
elseif ($newIndentType == '-') # unordered list
{
$opener = '<ul>'.$nlTabs.'<li>';
$closer = '</li>'.$nlTabs.'</ul>';
}
elseif ($newIndentType == '&') # inline comment
{
$opener = '<ul class="thread">'.$nlTabs.'<li>';
$closer = '</li>'.$nlTabs.'</ul>';
}
else # ordered list
{
$opener = '<ol type="'.substr($newIndentType, 0, 1).'">'.$nlTabs.'<li>';
$closer = '</li>'.$nlTabs.'</ol>';
$newIndentType = 'o';
}
// do an indent
if ($newIndentLevel > $oldIndentLevel)
{
for ($i = 0; $i < $newIndentLevel - $oldIndentLevel; $i++)
{
$result .= $nlTabs./*'<!--nested item '.$newIndentLevel.'-->'.*/$opener;
array_push($indentClosers, $closer);
#$result .= '<!--pushed type: '.$oldIndentType.' -->'; # @@@
array_push($indentTypes, $oldIndentType); # remember type hierarchically
}
}
// do an outdent or stay at the same level
else if ($newIndentLevel <= $oldIndentLevel)
{
$bOutdent = FALSE;
if ($newIndentLevel < $oldIndentLevel)
{
$bOutdent = TRUE; # remember we're outdenting, for correct layout
// do the outdenting
for ($i = 0; $i < $oldIndentLevel - $newIndentLevel; $i++)
{
if ($i > 0)
{
$result .= $nlTabsOut;
}
$result .= array_pop($indentClosers)/*.'<!--outdent to '.$newIndentLevel.'-->'*/;
$oldIndentType = array_pop($indentTypes); # make sure we will compare with "correct" previous type
#$result .= '<!--popped type: '.$oldIndentType.' -->'; # @@@
}
}
if ($bOutdent) # outdenting: put close tag on new line
{
$result .= $nlTabs/*.'<!--outdent: close tag on new line-->'*/;
}
// JW 2005-07-11 new item of different type
if ($newIndentType != $oldIndentType)
{
$result .= array_pop($indentClosers);
$result .= /*'<!--type change follows (old: '.$oldIndentType.' new: '.$newIndentType.') -->'.*/$nlTabs.$opener;
array_push($indentClosers, $closer);
}
// new item of same type
else
{
// plain indent
if ($newIndentType == '')
{
$result .= $closer./*'<!--same type ('.$newIndentType.') same level-->'.*/$nlTabs.$opener;
}
// list or inline comment
else
{
$result .= '</li>'.$nlTabs.'<li>'/*.'<!--back to same type-->'*/;
}
}
}
$oldIndentType = $newIndentType; # remember type sequentially
$oldIndentLevel = $newIndentLevel;
return $result;
}
// new lines
else if ($thing == "\n")
{
// if we got here, there was no tab (or ~) in the next line; this means that we can close all open indents.
// JW: we need to do the same thing at the end of the page to close indents NOT followed by newline: use a function
/*
$c = count($indentClosers);
for ($i = 0; $i < $c; $i++)
{
$result .= array_pop($indentClosers);
$br = 0;
}
$oldIndentLevel = 0;
#$oldIndentLength= 0; # superfluous
#$newIndentSpace=array(); # superfluous
*/
$result .= close_indents($indentClosers,$oldIndentLevel); # JW 2005-07-11 removed superfluous variables
$result .= ($br) ? "<br />\n" : "\n";
$br = 1;
return $result;
}
// Actions
else if (preg_match('/^\{\{(.*?)\}\}$/s', $thing, $matches))
{
if ($matches[1])
return $wakka->Action($matches[1]);
else
return '{{}}';
}
// interwiki links!
else if (preg_match('/^[A-ZÄÖÜ][A-Za-zÄÖÜßäöü]+[:]\S*$/s', $thing))
{
return $wakka->Link($thing);
}
// wiki links!
else if (preg_match('/^[A-ZÄÖÜ]+[a-zßäöü]+[A-Z0-9ÄÖÜ][A-Za-z0-9ÄÖÜßäöü]*$/s', $thing))
{
return $wakka->Link($thing);
}
// separators
else if (preg_match('/-{4,}/', $thing, $matches))
{
// TODO: This could probably be improved for situations where someone puts text on the same line as a separator.
// Which is a stupid thing to do anyway! HAW HAW! Ahem.
$br = 0;
return "<hr />\n";
}
// mind map xml
else if (preg_match('/^<map.*<\/map>$/s', $thing))
{
return $wakka->Action('mindmap '.$wakka->Href().'/mindmap.mm');
}
// if we reach this point, it must have been an accident.
// @@@ JW: or a detailed regex that excludes something that was included in the
// preg_replace_callback expression
return $thing;
}
}
if (!function_exists('wakka3callback'))
{
/**
* "Afterburner" formatting: extra handling of already-generated XHTML code.
*
* 1.
* Ensure every heading has an id, either specified or generated. (May be
* extended to generate section TOC data.)
* If an id is specified, that is used without any modification.
* If no id is specified, it is generated on the basis of the heading context:
* - any image tag is replaced by its alt text (if specified)
* - all tags are stripped
* - all characters that are not valid in an id are stripped (except whitespace)
* - the resulting string is then used by makedId() to generate an id out of it
*
* @access private
* @uses Wakka::makeId()
*
* @param array $things required: matches of the regex in the preg_replace_callback
* @return string heading with an id attribute
*/
function wakka3callback($things)
{
global $wakka;
$thing = $things[1];
// heading
if (preg_match('#^<(h[1-6])(.*?)>(.*?)</\\1>$#s', $thing, $matches)) # note that we don't match headings that are not valid XHTML!
{
/*
echo 'heading:<pre>';
print_r($matches);
echo '</pre>';
*/
list($element,$tagname,$attribs,$heading) = $matches;
#if (preg_match('/(id=("|\')(.*?)\\2)/',$attribs,$matches)) # use backref to match both single and double quotes
if (preg_match('/(id=("|\')(.*?)\\2)/',$attribs)) # use backref to match both single and double quotes
{
// existing id attribute: nothing to do (assume already treated as embedded code)
// @@@ we *may* want to gather ids and heading text for a TOC here ...
// heading text should then get partly the same treatment as when we're creating ids:
// at least replace images and strip tags - we can leave entities etc. alone - so we end up with
// plain text-only
// do this if we have a condition set to generate a TOC
return $element;
}
else
{
// no id: we'll have to create one
#echo 'no id provided - create one<br/>';
$tmpheading = trim($heading);
// first find and replace any image with its alt text
// @@@ can we use preg_match_all here? would it help?
while (preg_match('/(<img.*?alt=("|\')(.*?)\\2.*?>)/',$tmpheading,$matches))
{
#echo 'image found: '.$tmpheading.'<br/>';
# 1 = whole element
# 3 = alt text
list(,$element, ,$alttext) = $matches;
/*
echo 'embedded image:<pre>';
print_r($matches);
echo '</pre>';
*/
// gather data for replacement
$search = '/'.str_replace('/','\/',$element).'/'; # whole element (delimiter chars escaped!) @@@ use preg_quote as well?
$replace = trim($alttext); # alt text
/*
echo 'pat_repl:<pre>';
echo 'search: '.$search.'<br/>';
echo 'search: '.$replace.'<br/>';
echo '</pre>';
*/
// now replace img tag by corresponding alt text
$tmpheading = preg_replace($search,$replace,$tmpheading); # replace image by alt text
}
$headingtext = $tmpheading;
#echo 'headingtext (no img): '.$headingtext.'<br/>';
// @@@ 2005-05-27 now first replace linebreaks <br/> with spaces!!
// remove all other tags
$headingtext = strip_tags($headingtext);
#echo 'headingtext (no tags): '.$headingtext.'<br/>';
// @@@ this all-text result is usable for a TOC!!!
// do this if we have a condition set to generate a TOC
// replace entities that can be interpreted
// use default charset ISO-8859-1 because other chars won't be valid for an id anyway
$headingtext = html_entity_decode($headingtext,ENT_NOQUOTES);
// remove any remaining entities (so we don't end up with strange words and numbers in the id text)
$headingtext = preg_replace('/&[#]?.+?;/','',$headingtext);
#echo 'headingtext (entities decoded/removed): '.$headingtext.'<br/>';
// finally remove non-id characters (except whitespace which is handled by makeId())
$headingtext = preg_replace('/[^A-Za-z0-9_:.-\s]/','',$headingtext);
#echo 'headingtext (id-ready): '.$headingtext.'<br/>';
// now create id based on resulting heading text
$id = $wakka->makeId('hn',$headingtext);
#echo 'id: '.$id.'<br/>';
// rebuild element, adding id
return '<'.$tagname.$attribs.' id="'.$id.'">'.$heading.'</'.$tagname.'>';
}
}
// other elements to be treated go here (tables, images, code sections...)
}
}
// ------------- do the work -----------
$text = str_replace("\r\n", "\n", $text);
// replace 4 consecutive spaces at the beginning of a line with tab character
// $text = preg_replace("/\n[ ]{4}/", "\n\t", $text); // moved to edit.php
if ($this->method == 'show') $mind_map_pattern = '<map.*?<\/map>|'; else $mind_map_pattern = '';
// define entity patterns
// NOTE most also used in wikka.php for htmlentities_ent(): REGEX library!
$alpha = '[a-z]+'; # character entity reference
$numdec = '#[0-9]+'; # numeric character reference (decimal)
$numhex = '#x[0-9a-f]+'; # numeric character reference (hexadecimal)
$terminator = ';|(?=($|[\n<]|<))'; # semicolon; or end-of-string, newline or tag
$entitypat = '('.$alpha.'|'.$numdec.'|'.$numhex.')('.$terminator.')'; # defines entity pattern without the starting &
$entityref = '&'.$entitypat; # entity reference
$loneamp = '&(?!'.$entitypat.')'; # ampersand NOT part of an entity
$this->callLevel++; # JW 2005-07-15 recursion level: getting in
$text = preg_replace_callback(
'/('.
'% %.*?% %|'. # code
'"".*?""|'. # literal
$mind_map_pattern.
'\[\[[^\[]*?\]\]|'. # forced link
'-{4,}|---|'. # separator, new line
'\b[a-z]+:\/\/\S+|'. # URL
'\*\*|\'\'|\#\#|\#\%|@@|::c::|\>\>|\<\<|££|¥¥|\+\+|__|\/\/|'. # Wiki markup
'======|=====|====|===|==|'. # headings
'(^|\n)([\t~]+)(-|&|[0-9a-zA-Z]+\))?|'. # indents and lists # JW FIXED 2005-07-12 also match tab or ~ at start of document
'\{\{.*?\}\}|'. # action
'\b[A-ZÄÖÜ][A-Za-zÄÖÜßäöü]+[:](?![=_])\S*\b|'. # InterWiki link
'\b([A-ZÄÖÜ]+[a-zßäöü]+[A-Z0-9ÄÖÜ][A-Za-z0-9ÄÖÜßäöü]*)\b|'. # CamelWords
'<|>|'. # HTML special chars - after wiki markup!
$loneamp.'|'. # HTML special chars - ampersand NOT part of an enity
'\n'. # new line
')/ms','wakka2callback',$text);
// we're cutting the last <br />
$text = preg_replace('/<br \/>$/','',$text);
$this->callLevel--; # JW 2005-07-15 recursion level: getting out
if ($this->callLevel == 0) # JW 2005-07-15 only for "outmost" call level
{
$text .= wakka2callback('closetags'); # JW changed logic
}
// add ids to heading elements
// @@@ LATER:
// - extend with other elements (tables, images, code blocks)
// - also create array(s) for TOC(s)
$idstart = getmicrotime();
$text = preg_replace_callback(
'#('.
'<h[1-6].*?>.*?</h[1-6]>'.
// other elements to be treated go here
')#ms','wakka3callback',$text);
printf('<!-- Header id generation took %.6f seconds -->', (getmicrotime() - $idstart));
echo $text;
?>
%%
''Make sure you replace every occurrence of '**##% %##**' in this code with '**##""%%""##**'!''
====Supporting code====
Only a single **new** [[WikkaCore core]] method is needed for this improved formatter (other new functions are part of the formatter script itself):
===##""makeId()""##===
Used here to both for handling ids in embedded HTML code and to generate a unique id for headings; see GenerateUniqueId for the code and where to insert it.
====Todo====
==Bugs==
~-Solve conflict with ""::c::"" appearing right after a page name being interpreted as an interwiki link (see comment by TimoK below)
==Extensions==
~-The obvious next step would be to add code to generate code for a [[TableofcontentsAction page TOC]]
~-Handling tables, images and code blocks in a similar way (to create separate TOCs on request) would also be nice
~-Find a way to generate proper paragraphs (possibly based on [[IanAndolina]]'s [[SemanticMarkup method]] - see comments below)
~-When embedded HTML code contains an id and these are //changed// to avoid duplicates, any references should be changed accordingly (not so easy)
~-Later (much later) a complete rewrite will be needed to better handle closing tags, ensuring valid XHTML and being able to generate proper paragraphs instead of text separated by <br /> tags which isn't very structural code (and bad for accessibility).
====Test? Comments?====
Go ahead and test it - either on your own Wikka installation or on this site where it is now Installed as a [[WikkaBetaFeatures beta feature]].
Comments and suggestions are more than welcome, as always.
----
CategoryDevelopmentFormatters