Revision [2742]

This is an old revision of LinkRewriting made by DarTar on 2004-12-02 09:48:59.

 

Last edited by DarTar:
Help needed on regex!
Thu, 02 Dec 2004 09:48 UTC [diff]


I open this page for discussing a regex issue met during the development of a IncludeRemote FetchRemote action.
Basically, the aim of this action is to fetch raw page content from a remote Wikka server and rewrite it before printing it on screen.

1. What is raw content


Raw content is the source code of Wikka pages, containing FormattingRules WikkaSyntax tags. For example, the raw content of a page like WikiEngine is:

[[HelpInfo Wikka Documentation]]
----
===== What is a Wiki? =====

A **wiki** (pronounced "wicky" or "weeky" or "viki") is a website (or other hypertext document 
collection) that allows any user to add content, but also allows that content to be edited by any 
other user while keeping track of the different versions.
In short, a Wiki is one of the most powerful tools for **web-based collaborative editing**.

A WikiEngine is the software used to create and run such websites. For instance, this wiki runs on 
the [[HomePage WikkaWiki]] engine.


<<More information on Wikis is available on: [[http://en.wikipedia.org/wiki/Wiki Wikipedia]]<<


----
CategoryDocumentation - CategoryWiki


The new Mod042fShowPageCodeHandler showpagecode handler allows you to display the raw content of any page by appending /showpagecode to its name in the URL:
http://wikka.jsnx.com/WikiEngine/showpagecode


The IncludeRemote FetchRemote action requires parsing a fetched page's raw content and rewriting internal links in a specific way.
Basically there are two kinds of links that have to be rewritten: forced internal links and CamelCase links.
For the action to work properly, forced internal links and camelcase links in the fetched page should be respectively rewritten as follows:

[[HelpInfo A good link]] => <a href="FetchRemote?page=HelpInfo">A good link</a>


HelpInfo => <a href="FetchRemote?page=HelpInfo">HelpInfo</a>




To do so, I use the PHP preg_replace() function. I've almost managed to have both of the above cases correctly parsed using the following patterns:

$forced = "/\[\[([^ \/]+) ([^\]]+)\]\]/";
$camel = "/[^a-z=>\"\[\/\{]([A-Z]+[a-z]+[A-Z][A-Za-z0-9]+)+/";


and rewrite the raw page content ($content) by applying twice the preg_replace() function:

// rewrite forced links
$content = preg_replace($forced, "\"\"<a href='".$this->Href("","","page=\\1")."'>\\2</a>\"\"", $content);

// rewrite camelcase links
$content = preg_replace($camel, "\"\" <a href='".$this->Href("","","page=\\1")."'>\\1</a>\"\"", $content);

     

4. Tricky cases

 
The link rewriting rules above will work fine in most cases. What they still cannot capture is a number of cases in which a WikiWord appears in the context of a forced internal link, like for example:
[[HelpInfo This is the homepage of the WikkaWiki Documentation Project]]
.
It's clear in this case that WikkaWiki should NOT be rewritten (it's not a link, but part of the anchor text of a link).

If you take for example the rawcontent of WikiEngine displayed above, the preg_replace() patterns I'm using won't handle a link like [[HomePage WikkaWiki]] properly.

After the first preg_replace() application (forced link rewriting) this code is correctly rendered as:
""<a href='FetchRemote?page=HomePage'>WikkaWiki</a>""

But after the second preg_replace() application (camelcase links rewriting), this will be rendered as:
""<a href='FetchRemote?page=HomePage'>""<a href='FetchRemote?page=WikkaWiki'>WikkaWiki</a>""</a>""


5. Million-dollar question


Now, here comes the big question.
How can I have the camelcase rewriting rule parse and rewrite any camelcase-formatted strings except those that appear in the anchor text of an already rewritten link?

The question is tricky, because whereas in the above example cases like FetchRemote or HomePage that appear in the URI are easily dealt with by excluding camelcase words that are adjacent to characters like ", =, ' etc., a camelcase word within the anchor text can be preceded and followed by other text, like:

""<a href='FetchRemote?page=HomePage'>Here's some text preceding WikkaWiki, which is in turn followed by other text</a>""


How do I exclude WikkaWiki from being rewritten?

Thanks if you had the patience to read this long and boring page.

-- DarTar

CategoryRegex CategoryDevelopment
There is one comment on this page. [Display comment]
Valid XHTML :: Valid CSS: :: Powered by WikkaWiki