Fetching Remote Wikka Content

Last edited by DarTar:
Modified links pointing to docs server
Mon, 28 Jan 2008 00:14 UTC [diff]


FetchRemote v.0.6 available for testing
Download the source and save it as:
actions/fetchremote.php
Feedback is welcome!

See also:

FetchRemote Action

Version 0.7

Note:
JavaWoman has done a huge work in improving/debugging the link rewrite engine, which now works almost perfectly.
Hope she won't mind if I post here the 0.7 'debugging version' of the code ;)

What it does

How to use it
Simply add {{fetchremote}} in one of your pages.
You can specify a starting page by adding: {{fetchremote page="HomePage"}}

Notes

Long-term development ideas
The potential utility of such a plugin is pretty large. Just think of scenarios in which central Mother-wikis distribute wiki-formatted content to Child-wikis.
Providing up-to-date documentation is only one of the possible uses of this plugin.
And now, for something completely different
<mode sci-fi="on">
Imagine that the set of patterns used by the rewrite engine to format the local version of the fetched page might be user-configurable and extended beyond link formatting. One day, we could have a plugin to retrieve content from remote 'non-wikka-powered' wikis, translate the wiki-content in wikka syntax and seamlessly integrate/save it locally. Sounds exciting, doesn't it? :)
<mode sci-fi="off">

The code (actions/fetchremote.php)

Note I had to modify a line in the code below because it contained two "%" in a row (which broke the code display on this page):

Before testing this code please remove the space I added between the two "%":

original:
define('PATTERN_CODE', '% %.*?% %'); # ignore code block

modified:
define('PATTERN_CODE', '%%.*?%%'); # ignore code block


<?php

/**
 * Connects to a specified Wikka server, fetches a remote page and formats it for local use.
 *
 * This action allows the user to locally browse in a Wikka client content fetched from a
 * remote Wikka server. It displays an error message if the remote page does not exist
 * on the server or if a connection is not available.
 * Once a connection is established, the fetched page is parsed for internal links, which
 * are rewritten as links to fetchable pages, and printed on the screen.
 * Fetched pages can then be safely stored on the Wikka client. If a local version
 * of a fetched page is available, a "see local version" button replaces the default
 * "download" button.
 *
 * A "raw" method must be available on the main Wikka server, in order to
 * produce raw wikka-formatted content with header and footer stripped.
 *
 * @package     Actions
 * @name        FetchRemote
 *
 * @author      {@link http://wikka.jsnx.com/DarTar DarTar}
 * @author      {@link http://wikka.jsnx.com/JavaWoman JavaWoman} - replacing double by single quotes, better patterns
 * @version     0.7
 * @since       Wikka 1.1.X
 *
 * @input       string  $page   optional: Starting page on the main Wikka server;
 *              default: WikkaDocumentation
 *              can be overridden by a $_REQUEST['page'] parameter.
 * @output      prints fetched documentation pages
 *
 * @todo        -CamelCase link rewriting: check regex for consistency with wikka formatters.
 *              -Interwiki link rewriting => don't rewrite! just prevent CamelCase rewriting here
 */



// pattern defines
    // NOTE: (initial) REs for URL taken from wakka.php formatter - same potential problems.. - now adapted since there WERE indeed problems!
// string to mark a "don't replace me" camel words and other strings with
define('IGNOREMARKER', '!!!');
define('PATTERN_IGNOREMARKER', '!!!');      # @@@ PHP function to escape for RE?

// patterns to be ignored for rewriting
define('PATTERN_IGNORE', PATTERN_IGNOREMARKER.'.*?'.PATTERN_IGNOREMARKER)# string "marked up" to be ignored
//Note: REMOVE spaces between % % in the following line before using the plugin
define('PATTERN_CODE', '% %.*?% %');            # ignore code block
define('PATTERN_LITERAL', '"".*?""');                                       # ignore Wikka literal
define('PATTERN_ACTION', '{{(?!image).*?}}');                               # ignore action _except_ image
define('PATTERN_ATTRIB', '\b(\w*?\s*)(=\s?"[^\n]*?"|=\s?\'[^\n]*?\')');     # attributes (HTML, action)
#define('PATTERN_URL', '\b[a-z]+:\/\/\S+');                                                         # copied from formatter
define('PATTERN_URL', '[a-z]+:\/\/\S+');                                                            # copied from formatter - adapted
#define('PATTERN_URL2', '^([a-z]+:\/\/\S+?)([^[:alnum:]^\/])?$');                                   # copied from formatter
define('PATTERN_URL2', '\b[a-z]+:\/\/[[:alnum:]][-_[:alnum:]\/@:\.,_\?&;=]+[-_[:alnum:]\/\?&;=]')# copied from formatter - adapted to recognize more URLs, @@@ not perfect yet
define('PATTERN_INTERWIKI', '\b[A-Zƒ÷‹][A-Za-zƒ÷‹?‰ˆ¸]+[:](?![=_])\S*\b');                          # copied from formatter
define('PATTERN_FORCEDURL', '\[\[(?!")'.PATTERN_URL2.'(\s+(.*?))?\]\]');    # forced link with URL (ignore)         @@@ (?!") still needed??

// regex pattern for forced links: accept "internal pages" (camelwords) on remote server but ignore URLs
define('PATTERN_FORCED', '\[\[(?!")([^\s\/\]]+)(\s+(.*?))?\]\]');           # forced link not with URL (rewrite)    @@@ (?!") still needed??

// regex patterns to recognize a "CamelWord"
#define('PATTERN_CAMELWORD', '[A-Z]+[a-z]+[A-Z][A-Za-z0-9]+');                                      # @@@ make equivalent to formatter (see below)
#define('PATTERN_CAMELWORD', '\b[A-Zƒ÷‹]+[a-z?‰ˆ¸]+[A-Z0-9ƒ÷‹][A-Za-z0-9ƒ÷‹?‰ˆ¸]*\b');              # copied from formatter but removed brackets
define('PATTERN_CAMELWORD', '[A-Zƒ÷‹]+[a-z?‰ˆ¸]+[A-Z0-9ƒ÷‹][A-Za-z0-9ƒ÷‹?‰ˆ¸]*');                   # copied from formatter but removed brackets
#define('PATTERN_FREECAMEL', '(\s*)('.PATTERN_CAMELWORD.')');                                       # @@@ not needed? leave for now

// regex pattern to recognize an image link (imaghe links with URLs are left to the formatter)
define('PATTERN_IMGLINK', 'link="('.PATTERN_CAMELWORD.')"');

/* problems solved so far
forced links:
- a forced link like [[MHM]] just disappeared (see CreateNewPage)
- forced links of the form [[WikiName]]s are misinterpreted (mangled result) (example on WikkaBugsResolved "Interwiki is broken")
- some URLs (in forced links) not recognized but should be ignored (see DarTar)
- forced links on NotifyOnChange not recognized at all (caused by the credits in (single) [] ?) => No: solution: single LinkRewrite!
camelwords:
- JsnX not recognised (see WikkaBugsResolved) => incorrect RE
- Words like Mod040fSmartPageTitles not recognized (see WikkaBugsResolved) => incorrect RE
ignores:
- ignore literals ""[[double bracket]]"" or ""WikiWord"" were rewritten when they shouldn't be (see also CreateNewPage)
- ignore code blocks (may contain forced links or WikiWords)
- ignore URLs that contain camelwords => simply ignore URLs
- URL with embedded camelword on its own on a line: URL not recognized (see Mod039fMindMapMod) => error in preg_replace_callback RE
- ignore InterWiki links (see WikiName for an example; better xmp at WikkaBugsResolved "Interwiki is broken")
- code not recognized on LoggedUsersHomepage and RedirectOnLogin => solution: single LinkRewrite!
- literal not recognized on LoggedUsersHomepage (Camel matched first on ""IntraNet"" - why?) => solution: single LinkRewrite!
- interwiki links broken again in single-function rewrite (see WikkaBugsResolved "Interwiki is broken") => clumsy fix with extra function
- OrphanedPages shows error message:
    "Unknown action; an action name can consist only of US-ASCII characters and/or digits." but no page names at all...
    => add ignore for actions
other:
- code blocks may disappear or be broken (see FeedbackAction for an example) => incorrect code block ignore; RE must be match over multiple lines
*/


/* outstanding problems
- rewritten image links show up as external links - unavoidable, I think: the image does link to an external URL after rewriting! (see AddingLinks for an example)
*/


/* list of important TEST pages
- CreateNewPage         - forced link without description (test correct RE and matching elements for forced links)
                        - literals that should be ignored (including literal containing URL containing camelword)
- NotifyOnChange        - more forced links (test not getting confused by extra [] around forced links)
- WikkaBugsResolved     - forced links of the form [[WikiName]]s - see "Interwiki is broken"
                        - InterWiki links
                        - camelwords like JsnX and Mod040fSmartPageTitles (test correct RE for camelwords)
- DarTar                - forced links with external URLs (test not rewriting such forced links)
- LoggedUsersHomepage   - literals to be ignored (such as ""IntraNet"") as well as code blocks
- FeedbackAction        - code blocks (containing camelwords, literals and forced links) to be ignored
- FreeMind              - forced link with URL containing underscore (test correct URL RE)
- Mod039fMindMapMod     - lone camelword on one line followed by lone URL with camelword on next line (test URL RE and preg_replace_callback RE)
- OrphanedPages         - action (with camelword!) should be ignored
- AddingLinks           - image actions should NOT be ignored
*/


/* (possible) server-side bugs
- XBUG: problem with googleform on UsingActions => cause: bug in googleform itself! => REPORTED on WikkaBugs
- XBUG? OrphanedPages   - shown directly starts with an "orphan" '12Action!' (does not exist) followed by page names; database problem?
*/


// SET DEFAULTS

$remote_server_root = 'http://wikka.jsnx.com/'; # set remote server root
//$remote_server_root = "http://test/wikka-1.1.5.0/wikka.php?wakka="; # debug server

$defaultpage = 'WikkaDocumentation'; # define default page to be fetched
if (isset($page)) $defaultpage = $page; # pick up action parameter
if (isset($_REQUEST['page'])) $defaultpage = $_REQUEST['page']; # pick up URL parameter
$page = $defaultpage; # ready to roll

// PERFORM REDIRECTIONS

// redirect to main documentation page
if ($_POST['action'] == 'Return to Wikka Documentation') $this->Redirect($this->GetPageTag());

// redirect to Wikka homepage on disconnection
if ($_POST['action'] == 'Disconnect') $this->Redirect($this->GetConfigValue('root_page'));

// switch to local version of the page
if ($_POST['action'] == 'See local version') $this->Redirect($page);

// automatically redirect to local page if it exists
// NOTE: the use of this feature is discouraged since it traps users 'locally'
// and prevents them from accessing recently updated versions of the Wikka documentation
//if ($this->LoadPage($page)) $this->Redirect($page);

// SET HEADER & FORM ELEMENTS

// header style
// to be replaced by a CSS selector in the definitive version
$style = 'text-align: center; margin: 30px 25%; border: 1px dotted #333; background-color: #EEE; padding: 5px;';

// build form chunks
$form_local = '<input type="submit" name="action" value="See local version" />';            # i18n
$form_main = '<input type="submit" name="action" value="Return to Wikka Documentation" />'# i18n
$form_disconnect = '<input type="submit" name="action" value="Disconnect" />';              # i18n
$form_page = '<input type="hidden" name="page" value="'.$page.'" />';
$form_download = '<input type="submit" name="action" value="Download this page" />';        # i18n


// TRY TO CONNECT
$remote_page = fopen($remote_server_root.$page."/raw", "r");

if (!$remote_page) {

    // NO CONNECTION AVAILABLE
    echo $this->Format('=====Wikka Documentation===== --- Visit the **[[http://wikka.jsnx.com/WikkaDocumentation Wikka Documentation Project]]** --- --- ');
    // if a local version of the starting page is available:
    if ($this->LoadPage($page)) print $this->FormOpen().$form_local.$this->FormClose();

} else {

    // CONNECTION ESTABLISHED

    // fetch raw content of remote page
    while (!feof($remote_page)) {
        $content .= fgets($remote_page, 1024);
    }

    if (!$content)
    {
        // missing or empty page: show error message
        $header = 'Sorry, **';
        $header .=  '""<a href="'.$this->Href('','','page='.$page).'">'.$page.'</a>""';
        $header .= '** cannot be found on the [['.$remote_server_root.$page.' Wikka server]]! --- --- ';
        $form = $this->FormOpen().$form_page;
        $form .= ($this->LoadPage($page)) ? $form_local : '';
        $form .= $form_main.$this->FormClose();
    }
    else
    {

        // START LINK-REWRITING ENGINE

        // define callback functions
        // mark strings to be ignored for rewriting
        function MarkIgnore($things)
        {
/* DEBUG - remove later
if ('' != $things[0])
{
    echo '<br/>START MarkIgnore - $things:<pre>';
    print_r($things);
    echo '</pre>';
}
/**/

            $thing = $things[0];
            // ignore things BEFORE looking at forced links or camels
            if (
                // s modifier to match over multiple lines
                // i modifier to make case-insensitive
                    preg_match('/'.PATTERN_CODE.'/s',$thing)                            # ignore code block
                ||  preg_match('/'.PATTERN_LITERAL.'/s',$thing)                         # ignore literals
                ||  preg_match('/'.PATTERN_ACTION.'/is',$thing)                         # ignore actions (keywords are case-insensitive and may be camelword!)
                ||  preg_match('/'.PATTERN_INTERWIKI.'/',$thing)                        # ignore Interwiki links
                )
            {
/* DEBUG - remove later
echo 'CODE, LITERAL or INTERWIKI match: {'.htmlspecialchars($thing).'}<br/>';
/**/

                $output = IGNOREMARKER.$thing.IGNOREMARKER;                             # mark to be ignored
            }
            // ignore attributes except in image (action) links - MUST come before checking URLs
            elseif (preg_match('/'.PATTERN_ATTRIB.'/',$thing,$matches))
            {
/* DEBUG - remove later
echo '<br/>ATTRIB match:<pre>';
print_r($matches);
echo '</pre>';
/**/

                if ('link' != $matches[1])
                {
                    $output = $matches[1].IGNOREMARKER.$matches[2].IGNOREMARKER;
/* DEBUG - remove later
echo 'ATTRIB output: {'.htmlspecialchars($output).'}<br/>';
/**/

                }
                else
                {
                    $output = $thing;
/* DEBUG - remove later
echo 'ATTRIB output in image link: {'.htmlspecialchars($output).'}<br/>';
/**/

                }
            }
            // ignore forced links with URLs and 'free' URLs
            elseif (
                    preg_match('/'.PATTERN_FORCEDURL.'/', $thing)                       # ignore forced links with URLs
                ||  preg_match('/'.PATTERN_URL2.'/', $thing)                            # ignore URLs
                )
            {
/* DEBUG - remove later
if (preg_match('/'.PATTERN_FORCEDURL.'/', $thing)) {
    echo '<br/>FORCEDURL or URL match:<pre>';
    echo htmlspecialchars($thing);
    echo '</pre>';
}
/**/

                $output = IGNOREMARKER.$thing.IGNOREMARKER;                             # mark to be ignored
/* DEBUG - remove later
echo 'REWRITE IGNORE (FORCED) URL - output: {'.htmlentities($output).'}<br/><br/>';
/**/

            }
/* DEBUG - remove later
echo 'IGNORE - output: {'.htmlentities($output).'}<br/><br/>';
/**/

            return $output;
        }

        // rewrite links (unless in a to be ignored string)
        function RewriteLink($things)
        {
/* DEBUG - remove later
if ('' != $things[0])
{
    echo '<br/>START RewriteLink - $things:<pre>';
    print_r($things);
    echo '</pre>';
}
/**/

            global $wakka;
            $thing = $things[0];
            if (preg_match('/'.PATTERN_IGNORE.'/s',$thing))                         # already marked as ignore: nothing to do
            {
/* DEBUG - remove later
echo 'IGNORE match: {'.htmlspecialchars($thing).'}<br/>';
/**/

                $output = $thing;
            }
            // rewrite forced (non-URL) links
            elseif (preg_match('/'.PATTERN_FORCED.'/',$thing,$matches))
            {
/* DEBUG - remove later
echo '<br/>FORCED match:<pre>';
print_r($matches);
echo '</pre>';
/**/

                if (isset($matches[3]))
                    #$linktext = preg_replace('/'.PATTERN_CAMELWORD.'/', IGNOREMARKER."$0".IGNOREMARKER, $matches[3]);
                    $linktext = $matches[3];
                else
                    $linktext = $matches[1];                                            # use name for forced link without a description (like [[MHM]])
                $output = IGNOREMARKER.'""<a href="'.$wakka->Href('','',"page=".$matches[1]).'">'.$linktext.'</a>""'.IGNOREMARKER;
/* DEBUG - remove later
echo 'REWRITE FORCED - output: {'.htmlentities($output).'}<br/><br/>';
/**/

            }
            // rewrite image links -  MUST come before rewriting Camelwords!
            elseif (preg_match('/'.PATTERN_IMGLINK.'/',$thing,$matches))
            {
/* DEBUG - remove later
echo '<br/>IMGLINK match:<pre>';
print_r($matches);
echo '</pre>';
/**/

                $output = 'link="'.$wakka->Href('','',"page=".$matches[1]).'"';
/* DEBUG - remove later/
echo 'REWRITE IMGLINK - output: {'.htmlspecialchars($output).'}<br/><br/>';
/**/

            }
            // rewrite Camelwords
            elseif (preg_match('/'.PATTERN_CAMELWORD.'/',$thing,$matches))
            {
/* DEBUG - remove later
echo '<br/>CAMEL match:<pre>';
print_r($matches);
echo '</pre>';
/**/

                #$output = $matches[1].'""<a href="'.$wakka->Href('','',"page=".$matches[2]).'">'.$matches[2].'</a>""';`# freecamel
                $output = '""<a href="'.$wakka->Href('','',"page=".$matches[0]).'">'.$matches[0].'</a>""';              # camelword
/* DEBUG - remove later/
echo 'REWRITE CAMEL - output: {'.htmlentities($output).'}<br/><br/>';
/**/

            }
            // nothing to do
            else
            {
                $output = $thing;
            }
            return $output;
        }

        // 1) mark things to be ignored for rewriting (formatter wil take care of these when necessary)
        $content = preg_replace_callback('/'.
            PATTERN_CODE.
            '|'.
            PATTERN_LITERAL.
            '|'.
            PATTERN_ACTION.
            '|'.
            PATTERN_INTERWIKI.
            '|'.
            PATTERN_FORCEDURL.
            '|'.
            PATTERN_URL.
            '|'.
            PATTERN_ATTRIB.
            '/s', 'MarkIgnore', $content);

/* DEBUG (!) - remove later
echo '<br/>content before rewriting links:<br/>';
echo '{<pre>'.htmlspecialchars($content).'</pre>}<br/>';
/**/


        // 2) rewrite links (unless to be ignored)
        $content = preg_replace_callback('/'.
            PATTERN_IGNORE.                                                         # needed to be able to skip strings to be ignored
            '|'.
            PATTERN_FORCED.                                                         # rewrite
            '|'.
            PATTERN_IMGLINK.                                                        # rewrite
            '|'.
            PATTERN_CAMELWORD.                                                      # rewrite
            '/s', 'RewriteLink', $content);

/* DEBUG - remove later
echo '<br/>content before cleaning up ignore markers:<br/>';
echo '{<pre>'.htmlspecialchars($content).'</pre>}<br/>';
/**/


        // 3)strip "ignore markers" from content
        $content = str_replace(IGNOREMARKER, '', $content);

/* DEBUG - remove later
echo '<br/>content after cleaning up ignore markers:<br/>';
echo '{<pre>'.htmlspecialchars($content).'</pre>}<br/>';
/**/


        if ("Download this page" == $_POST['action'])                                   # i18n
        {
            // SAVING FETCHED PAGE
            if ($this->LoadPage($page))
            {
                // local page with this name already exists => display error message
                // in the future we might show a form to ask if the local version should be overwritten
                $header = 'Sorry, a page named **[['.$page.']]** already exists on this site! --- ';    # i18n
                $form = $this->FormOpen().$form_main.$form_disconnect.$this->FormClose();
            }
            else
            {
                // local page does not exist => proceed
                // write page to database and display message
                $note = "fetched from the Wikka server";                                # i18n
                $this->SavePage($page, $content, $note);
                $header = 'This page is now available on your site! --- --- ';          # i18n
                $form = $this->FormOpen().$form_page.$form_local.$form_main.$this->FormClose();
            }
        }
        else
        {
            // display default header & form                                            # @@@ i18n!!
            $header  = 'You are currently browsing: **';
            $header .=  '""<a href="'.$this->Href('','','page='.$page).'">'.$page.'</a>""';
            $header .= '** --- from the **[['.$this->GetPageTag().' Wikka Documentation Project]]** --- ';
            $header .= '(fetched from the [['.$remote_server_root.$page.' Wikka server]])';
            $form  = $this->FormOpen().$form_page;
            $form .= ($this->LoadPage($page)) ? $form_local : $form_download;
            $form .= $form_disconnect.$this->FormClose();
        }
    }
/* DEBUG - remove later
echo '<br/>content after defining form:<br/>';
echo '{<pre>'.$content.'</pre>}<br/>';
/**/


    // PRINT HEADER AND CONTENT
    print '<div style="'.$style.'">'.$this->Format($header).$form.'</div>'.$this->Format($content);
}

// CLOSE CONNECTION
fclose($remote_page);
?>


-- DarTar

The code contains references to HelpInfo which has now disappeared and been replaced by WikkaDocumentation - I haven't updated your code here, but I am updating the copy I'm working on... --JavaWoman
done -- DarTar

Thanks for posting this code, DarTar! (No I don't mind.) A few notes about testing this though:
  1. Note my just-added comment - somehow some of the regular expressions (copied from the formatter as noted, sometimes changed minimally) have become changed in transit. Compare with the corresponding code in ./formatters/wakka.php!
  2. You see a lot of little comment blocks starting with the line "" . They are intended to trace the inner workings of the rewrite engine while it does its work. Each of these traces can be "turned on" simply by adding */ to the first line so it reads . Don't do that for all of them at once: you'd end up with a huge amount of output - rather, pick and choose to concentrate on a particular aspect of the link rewriting. Simply remove the */ from the first line of the block again to suppress the debug output, but leave the lines in place so you can later turn them on again.
  3. Finally, a lot still needs to be done... most of the work now was on the actual rewriting guts of the action. There are still matters of code organization, internationalization (preparation) and other things to address - but work on those aspects is pretty futile untile the rewrite engine itself works properly.
Have fun testing! --JavaWoman

CategoryDevelopmentActions
There are 17 comments on this page. [Show comments]
Valid XHTML :: Valid CSS: :: Powered by WikkaWiki