Revision [5963]

This is an old revision of SemanticMarkup made by NilsLindenberg on 2005-02-14 14:07:52.

 

Trying to Get Wikka to Produce More Semantic Markup


Because of the fast'n'furious approach of the original Wakka, there is a still a lot to be improved in terms of Wikka producing semantically meaningful code. I do not mean (X)HTML validation, but the deeper concept of using minimul, yet meaningful XHTML elements.

The biggest crime of Wakka (& therefore Wikka) is the lack of any paragraphs and instead, the sprinkling of redundant <br /> everywhere. Chunks of text are sitting in the document with no meaningful structure other than being contained by a parent element. The use of <br /> should be strictly minimal — as it pollutes the document with presentational elements. CSS is exactly designed to deal with such presentational issues and XHTML must be left for the information.

A Second issue is the choice of elements to use. XHTML has a range of elements that cover several needs. For example instead of using:
<span class="additions">Some added text</span>

it always makes more sense to use:
<ins>Some added text</ins>


As another example, let's look at the RecentlyCommented output:
<br />
<strong>Sat, 12 Feb 2005:</strong><br />
    <a href="http://wikka.jsnx.com/WikkaLogicalDataModel?show_comments=1&showall=1#comment_1575" title="View comment">WikkaLogicalDataModel</a>, comment by NilsLindenberg: <br />
         <em>I had more or less the historical thought in mind :) Now you could add info about a changed ownership into the page histor<a href="http://wikka.jsnx.com/WikkaLogicalDataModel?show_comments=1&showall=1#comment_1575" title="View comment">[.... ]</a></em><br />

That would be much cleaner marked up as:
<h3>Sat, 12 Feb 2005:</h3>
<ul>
<li><a href="http://wikka.jsnx.com/WikkaLogicalDataModel?show_comments=1&showall=1#comment_1575" title="View comment">WikkaLogicalDataModel</a>, comment by NilsLindenberg:
<blockquote>I had more or less the historical thought in mind :) Now you could add info about a changed ownership into the page histor<a href="http://wikka.jsnx.com/WikkaLogicalDataModel?show_comments=1&showall=1#comment_1575" title="View comment">[.... ]</a></blockquote>


Our wiki (a modified wakka, though I try to keep up-to-date with some Wikka changes) is currently going through such purification. As an example you can see our Page Index — note that this wiki is predomiantly used by Opera users, we have used some advanced CSS only supported by Opera (CSS3 media queries etc.); We have yet to add IE fixes so it is broken there ATM — but good engines like Gecko should display fine. On the page notice that the XHTML is quite minimal, yet the styling is very rich. It is a work in progress and not yet finished.

Action:

As a first contribution, I have modified the Wakka formatter to output proper paragraphed markup instead of polluted anonymous text sections. You can see a demo of it here: Parser demo. It simply uses two RegExp's after the main Wakka callback to (1) find text with no elements and (2) elements that signify inline content, and wraps that in paragraphs. I have had to make some minor and easy changes to the Wakka callback to allow single newlines to still be converted to <br />'s while everything else doesn't. In code blocks/actions/escaped text I convert all newlines to \r as a precaution to stop whitespace modification.

This is before the main callback:

//Start off using consistent line endings
$text = str_replace("\r", "", $text);
//This converts all single line-breaks into \r's so formatter can single them out from 2+ linebreaks
$text = preg_replace("/(?<!\n)\n(?!(\n|~|\"\"|%@|\t|<|>))/","\r",$text);
$text = "\n".trim($text)."\n";


The main call back is as normal (my formatting is a bit different to Wikka). Notice the last line has been split into multi \n and \r (that is distinguished in the call back (not shown):
$text = preg_replace_callback(
    "/(\%\%.*?\%\%|".
    "\"\".*?\"\"|".
    "\[\[.*?\]\]|".
    "\{\-\{.*?\}\-\}|\{\{.*?\}\}|".
    "\b[a-z]+:\/\/\S+|".
    "\*\*|\'\'|\#\#|\#\%|@@|::c::|::p::|\>\>|\<\<|&pound;&pound;|\+\+|__|<|>|\/\/|".
    "======|=====|====|===|==|".
    "-{4,}|---|".
    "\n([\t~]+)(-|[0-9a-zA-Z]+\))?|".
    "\b[A-ZÄÖÜ][A-Za-zÄÖÜßäöü]+[:]([A-Za-z0-9ÄÖÜßäöü]*)\b|".
    "\b([A-ZÄÖÜ][a-zßäöü]+[A-Z0-9ÄÖÜ][A-Za-z0-9ÄÖÜßäöü]*)\b|".
    "(?=\n+)\n|\r)/ms", "wakka2callback", $text);


The final piece to the puzzle is the two additional RegExp's and some clean-up lines (note this adds up to 1/10th of the time of the main callback so it is not very affected temporally):
if ($para == True) {
    //matches any line with no <element> (and variable leading space) - assume a paragraph
    $pattern1 = "/^(\040|\t)*(?!<|&|\t|\040)(.+)$/m";
    $replace1 = "\n<p>\\2</p>\n";
    $text = preg_replace($pattern1,$replace1,$text);
   
    //matches any <element> text lines not considered block formatting:
    $pattern2 = "/^(\040|\t)*(<(?=(a|em|b[r>]|i.[^p]|s[mptu]|kbd|tt|del)).*)$/m";
    $replace2 = "\n<p>\\2</p>\n";
    $text = preg_replace($pattern2,$replace2,$text);
}
$text = str_replace("<p><br />","<p>",$text); //get rid of leading line breaks
$text = str_replace("<p></p>","",$text); //get rid of empty paragraphs
$text = preg_replace("/\n{2,}/","\n",$text); //strip out new lines...
$text = str_replace("\r", "\n", $text); //convert back \r for consistent line ends
echo (trim($text));


To be able to easily switch this on:
function Format($text, $formatter = "wakka", $para = True)

The Format function can be called using $para set to true or false. This allows you to maximise compatibility with older actions that don't expect proper semantic code back.

Anyone who wants the whole wakka.php let me know and I'll add it here.

Help:
I would love those with much better RegExp fu than I to suggest improvements to the RegExps and any other ideas. This is unfinished - and the general question of semantic improvement a long road..

There are 4 comments on this page. [Show comments]
Valid XHTML :: Valid CSS: :: Powered by WikkaWiki