To prevent web
SearchEngines' spiders from crawling non-content pages, most of
WikkaWiki's
WikiEdit,
PageHistoryInfo, etcetera pages include robots meta tags to prevent the spiders from indexing them. This keeps the databases of the
SearchEngines cleaner, at least for your website. This page aims to help make your website more friendly to robots.
To start, you probably don’t want deleted pages to show up in the
SearchEngines’ indices. There are a few ways to do this.
Contributed by BarkerJr
This patch adds the robots meta tag to the header of deleted pages. This works good, but some
SearchEngines don’t support it, and will still show up in some indexes. Sending 404 (see below) works on all
SearchEngines, but can sometimes display incorrectly due to a “feature” in Internet Explorer (see the note below it).
diff -ur wiki.orig/actions/
header.php wiki/actions/
header.php
--- wiki.orig/actions/
header.php Tue Feb
15 21:
47:
56 2005
+++ wiki/actions/
header.php Tue Feb
15 21:
51:
43 2005
@@ -
9,
7 +
9,
7 @@
<head>
<title><?php
echo $this->
GetWakkaName().
": ".
$this->
PageTitle(); ?></title>
<base href=
"<?php echo $site_base ?>" />
-
<?php if ($this->
GetMethod() !=
'show' ||
$this->
page["latest"] ==
'N' ||
$this->
page["tag"] ==
'SandBox') echo "<meta name=\"robots\" content=\"noindex, nofollow, noarchive\" />\n";
?>
+
<?php if ($this->
GetMethod() !=
'show' || !
$this->
page ||
$this->
page["latest"] ==
'N' ||
$this->
page["tag"] ==
'SandBox') echo "<meta name=\"robots\" content=\"noindex, nofollow, noarchive\" />\n";
?>
<meta http-equiv=
"Content-Type" content=
"text/html; charset=iso-8859-1" />
<meta name=
"keywords" content=
"<?php echo $this->GetConfigValue("meta_keywords
") ?>" />
<meta name=
"description" content=
"<?php echo $this->GetConfigValue("meta_description
") ?>" />
To install a patch, place it in a file in your wiki's directory and execute:
patch -p1 < filename
Sending 404 Not Found
Contributed by DotMG
Ticket:258∞
Modify ./handlers/page/show.php like this :
if (!$this->page)
{
$httpversion =
isset($_SERVER["SERVER_PROTOCOL"]) ?
$_SERVER["SERVER_PROTOCOL"] :
'HTTP/1.1';
header("$httpversion 404 Not Found");
print("<p>This page doesn't exist yet. Maybe you want to <a href=\"".
$this->
Href("edit").
"\">create</a> it?</p></div>");
}
Note: On IE, there is a number of bytes required, and if the length of body is less than this limit, IE displays its own default content. But normally, the page should always display the content we expect (This page doesn't exist. Maybe you want to create it).
Disabling robots with robots.txt
Contributed by DotMG
The Robots Meta Tag suggested by
BarkerJr has the inconvenient that friendly spiders must load the page before knowing that they cannot archive them. Here is another solution, in which
robots.txt will instruct them they aren't allowed to access pages.
The idea is to use another url beginning with the terms
nobot/ for each page not allowed to spiders. Ie, all links on the site will be changed from http://wikkasite/notallowedtorobots to http://wikkasite/nobot/notallowedtorobots
First:
Robots.txt should be available with url http://wikkasite/robots.txt. Its content should be :
User-Agent: *
Disallow: nobot
Second: Create another page nobot.php near wikka.php, its content will be
<?php include ("wikka.php"); ?>
This page will be used if mod_rewrite is disabled.
Third: Modify ./.htaccess like this :
<IfModule mod_rewrite.c>
RewriteEngine on
RewriteCond %{REQUEST_FILENAME} -d
RewriteRule ^(.*/[^\./]*[^/])$ $1/
RewriteRule ^nobot\/(.*)$ wikka.php?wakka=$1 [QSA,L]
RewriteCond %{REQUEST_URI} !=/favicon.ico
RewriteCond %{REQUEST_URI} !=/robots.txt
RewriteRule ^(.*)$ wikka.php?wakka=$1 [QSA,L]
</IfModule>
These are used if mod_rewrite is enabled. The line
RewriteRule ^nobot\/(.*)$ wikka.php?wakka=$1 [QSA,L] remove the
nobot/ in url. So http://wikkasite/nobot/HomePage/edit and http://wikkasite/HomePage/edit will point to the same location (for mod_rewrite enabled) . The other two pairs http://wikkasite/nobot.php?wakka=HomePage/edit and http://wikkasite/wikka.php?wakka=HomePage/edit also point to the same location.
Finally: We modify the MiniHref() and Href() methods, so that for non-archivable pages, the URL used will contain nobot...
function MiniHref
($method =
"",
$tag =
"")
{
if (!
$tag =
trim($tag)) $tag =
$this->
tag;
//if mod_rewrite enabled, and if method is not show, and if page not found in $this->config['nobot'],
//we prepend nobot/ to link.
if ( ( (($method !=
"") &&
($method !=
'show'))
||
(stristr($this->
config['nobot'],
$tag)))
&&
($this->
config["rewrite_mode"]))
$tag =
"nobot/$tag";
return $tag.
($method ?
"/".
$method :
"");
}
// returns the full url to a page/method.
function Href
($method =
"",
$tag =
"",
$params =
"")
{
$base_url =
$this->
config["base_url"];
//if mod_rewrite disabled, we use nobot.php?wakka=notallowedtorobots instead of wikka.php?wakka=notallowedtorobots
if (!
$this->
config["rewrite_mode"])
{
$base_url =
preg_replace('/wikka\.php=$/',
'nobot.php=',
$base_url);
}
$href =
$base_url.
$this->
MiniHref($method,
$tag);
if ($params)
{
$href .=
($this->
config["rewrite_mode"] ?
"?" :
"&").
$params;
}
return $href;
}
Todo: If wikka is installed in a subdirectory, like http://wikkasite/wikkadir/HomePage, robots.txt will be changed to
Disallow: wikkadir/nobot
For upgrading site, Bots already know the url http://wikkasite/HomePage/edit, we must send the status
moved permanently if a page is requested with that old url.
- wouldn't it be much easier to just rel="nofollow" a link that should not be followed by search Engines (<a href="revisions" rel="nofollow">revisions</a>) and do that for the edit, history, revision and referrers link by changing the actions/footer.php
- Links may be followed from another non-wiki site. You cannot expect other sites to add rel="nofollow". So, you also have to add "noindex, nofollow, noarchive" in a meta robot tag.
- The above mentioned nobot mod somehow breaks the loading of the editor images. The server gets trapped in a loop.
-
"GET /nobot/3rdparty/plugins/wikiedit/images/images/3rdparty/plugins/wikiedit/images/images/3rdparty/plugins/wikiedit/images/images/3rdparty/plugins/wikiedit/images/images/3rdparty/plugins/wikiedit/images/images/3rdparty/plugins/wikiedit/images/images/3rdparty/plugins/wikiedit/images/images/3rdparty/plugins/wikiedit/images/images/3rdparty/plugins/wikiedit/images/images/3rdparty/plugins/wikiedit/images/images/3rdparty/plugins/wikiedit/images/images/3rdparty/plugins/wikiedit/images/images/3rdparty/plugins/wikiedit/images/images/3rdparty/plugins/wikiedit/images/images/3rdparty/plugins/wikiedit/images/images/3rdparty/plugins/wikiedit/images/images/3rdparty/plugins/wikiedit/images/images/3rdparty/plugins/wikiedit/images/images/3rdparty/plugins/wikiedit/images/images/3rdparty/plugins/wikiedit/images/images/3rdparty/plugins/wikiedit/images/images/3rdparty/plugins/wikiedit/images/images/3rdparty/plugins/wikiedit/images/images/3rdparty/plugins/wikiedit/images/images/3rdparty/plugins/wikiedit/images/images/3rdparty/plugins/wikiedit/images/images/3rdparty/plugins/wikiedit/images/images/3rdparty/plugins/wikiedit/images/images/3rdparty/plugins/wikiedit/images/images/3rdparty/plugins/wikiedit/images/images/3rdparty/plugins/wikiedit/images/images/3rdparty/plugins/wikiedit/images/images/3rdparty/plugins/wikiedit/images/images/3rdparty/plugins/wikiedit/images/images/3rdparty/plugins/wikiedit/images/images/3rdparty/plugins/wikiedit/images/images/3rdparty/plugins/wikiedit/images/images/3rdparty/plugins/wikiedit/images/images/3rdparty/plugins/wikiedit/images/images/3rdparty/plugins/wikiedit/images/images/3rdparty/plugins/wikiedit/images/images/3rdparty/plugins/wikiedit/images/images/3rdparty/plugins/wikiedit/images/images/3rdparty/plugins/wikiedit/images/images/3rdparty/plugins/wikiedit/images/images/3rdparty/plugins/wikiedit/images/images/3rdparty/plugins/wikiedit/images/images/3rdparty/plugins/wikiedit/images/images/3rdparty/plugins/wikiedit/images/images/3rdparty/plugins/wikiedit/images/images/3rdparty/plugins/wikiedit/images/images/3rdparty/plugins/wikiedit/images/images/3rdparty/plugins/wikiedit/images/images/3rdparty/plugins/wikiedit/images/images/3rdparty/plugins/wikiedit/images/images/3rdparty/plugins/wikiedit/images/images/3rdparty/plugins/wikiedit/images/images/3rdparty/plugins/wikiedit/images/images/3rdparty/plugins/wikiedit/images/images/3rdparty/plugins/wikiedit/images/images/3rdparty/plugins/wikiedit/images/images/3rdparty/plugins/wikiedit/images/images/3rdparty/plugins/wikiedit/images/images/3rdparty/plugins/wikiedit/images/images/3rdparty/plugins/wikiedit/images/images/3rdparty/plugins/wikiedit/images/images/3rdparty/plugins/wikiedit/images/images/3rdparty/plugins/wikiedit/images/images/3rdparty/plugins/wikiedit/images/images/3rdparty/plugins/wikiedit/images/images/3rdparty/plugins/wikiedit/images/images/3rdparty/plugins/wikiedit/images/images/3rdparty/plugins/wikiedit/images/images/3rdparty/plugins/wikiedit/images/images/3rdparty/plugins/wikiedit/images/images/3rdparty/plugins/wikiedit/images/images/3rdparty/plugins/wikiedit/images/images/3rdparty/plugins/wikiedit/images/images/3rdparty/plugins/wikiedit/images/images/3rdparty/plugins/wikiedit/images/images/3rdparty/plugins/wikiedit/images/images/3rdparty/plugins/wikiedit/images/images/3rdparty/plugins/wikiedit/images/images/3rdparty/plugins/wikiedit/images/images/3rdparty/plugins/wikiedit/images/images/3rdparty/plugins/wikiedit/images/images/3rdparty/plugins/wikiedit/images/images/3rdparty/plugins/wikiedit/images/images/3rdparty/plugins/wikiedit/images/images/3rdparty/plugins/wikiedit/images/images/3rdparty/plugins/wikiedit/images/images/3rdparty/plugins/wikiedit/images/images/3rdparty/plugins/wikiedit/images/images/3rdparty/plugins/wikiedit/images/images/3rdparty/plugins/wikiedit/images/images/3rdparty/plugins/wikiedit/images/images/3rdparty/plugins/wikiedit/images/images/3rdparty/plugins/wikiedit/images/images/3rdparty/plugins/wikiedit/images/images/3rdparty/plugins/wikiedit/images/images/3rdparty/plugins/wikiedit/images/images/3rdparty/plugins/wikiedit/images/images/3rdparty/plugins/wikiedit/images/images/3rdparty/plugins/wikiedit/images/images/3rdparty/plugins/wikiedit/images/images/3rdparty/plugins/wikiedit/images/images/3rdparty/plugins/wikiedit/images/images/3rdparty/plugins/wikiedit/images/images/3rdparty/plugins/wikiedit/images/images/3rdparty/plugins/wikiedit/images/images/3rdparty/plugins/wikiedit/images/images/3rdparty/plugins/wikiedit/images/images/3rdparty/plugins/wikiedit/images/images/3rdparty/plugins/wikiedit/images/images/3rdparty/plugins/wikiedit/images/images/3rdparty/plugins/wikiedit/images/images/3rdparty/plugins/wikiedit/images/images/3rdparty/plugins/wikiedit/images/images/3rdparty/plugins/wikiedit/images/images/3rdparty/plugins/wikiedit/images/images/3rdparty/plugins/wikiedit/images/images/3rdparty/plugins/wikiedit/images/images/3rdparty/plugins/wikiedit/images/images/3rdparty/plugins/wikiedit/images/images/3rdparty/plugins/wikiedit/images/images/3rdparty/plugins/wikiedit/images/images/3rdparty/plugins/wikiedit/images/images/3rdparty/plugins/wikiedit/images/indent.gif HTTP/1.1" 403 1226 "http://wiki.xxxx.de/nobot/HomePage/edit" "Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.8.0.9) Gecko/20061206 Firefox/1.5.0.9"
- Any ideas how to fix this? --DaC
Sample robots.txt for Google & Yahoo
Googlebot and Yahoo! Slurp support wildcards in their robots.txt. As I prefer blocking the URL instead of sending a noindex meta tag (merely bandwidth reasons) I am using the below mentioned robots.txt to insure that only wanted parts are index by Google & Yahoo!. Another advantage of this robots.txt is the prevention of duplicate content (e.g.
WikkaDocumentation or
PasswordForgotten exists a million times on the web and it makes no sense to let it get indexed).
User-agent: Slurp
Disallow: /MenuConfig
Disallow: /WikiName
Disallow: /WikkaDocumentation
Disallow: /WikkaReleaseNotes
Disallow: /UserSettings
Disallow: /TextSearch
Disallow: /SysInfo
Disallow: /PasswordForgotten
Disallow: /InterWiki
Disallow: /MyPages
Disallow: /MyChanges
Disallow: /FormattingRules
Disallow: /CategoryWiki
Disallow: /nobot
Disallow: /SandBox
Disallow: /*?
Disallow: /*/edit
Disallow: /*/history
Disallow: /*/revisions
Disallow: /*/acls
Disallow: /*/referrers
Disallow: /*/backlinks
Disallow: /*/recentchanges.xml
Disallow: /*/showcode
Disallow: /*/raw
User-agent: Googlebot
Disallow: /MenuConfig
Disallow: /WikiName
Disallow: /WikkaDocumentation
Disallow: /WikkaReleaseNotes
Disallow: /UserSettings
Disallow: /TextSearch
Disallow: /SysInfo
Disallow: /PasswordForgotten
Disallow: /InterWiki
Disallow: /MyPages
Disallow: /MyChanges
Disallow: /FormattingRules
Disallow: /CategoryWiki
Disallow: /nobot
Disallow: /SandBox
Disallow: /*?
Disallow: /*/edit
Disallow: /*/history
Disallow: /*/revisions
Disallow: /*/acls
Disallow: /*/referrers
Disallow: /*/backlinks
Disallow: /*/recentchanges.xml
Disallow: /*/showcode
Disallow: /*/raw
CategoryUserContributions