Google Sitemap support for Wikka
This is a drop-in extension that provides support for Google Sitemaps in Wikka. Priority and frequency can be customized for a specific list of pages. The sitemap can be accessed by appending /sitemap.xml to the full URL of any page.
Note this is a preliminary implementation based on a previous draft by BarkerJr, improvements are welcome. Tested on 1.1.6.5.
Sample output
http://nitens.org/taraborelli/home/sitemap.xmlValidation
ValidomeCode
Save the following as handlers/page/sitemap.xml.php<?php
/**
* Generate a {@link https://www.google.com/webmasters/tools/docs/en/protocol.html Google Sitemap} to optimize indexing of the wiki.
*
* @package Handlers
* @subpackage XML
* @version $Id$
* @license http://www.gnu.org/copyleft/gpl.html GNU General Public License
* @filesource
*
* @author {@link http://wikkawiki.org/BarkerJr BarkerJr} initial code
* @author {@link http://wikkawiki.org/DarTar Dario Taraborelli} using Wikka internals, added support for changefreq and priority
*
* @uses Config::$base_url
* @uses Config::$table_prefix
* @todo - Calculate optimal changefreq for each page depending on actual revision history
*/
//------------BEGIN Configuration------------
/* How frequently a page is likely to change. This value provides general information
to search engines and may not correlate exactly to how often they crawl the page.
Valid values are: always, hourly, daily, weekly, monthly, yearly, never
*/
$default_frequency = 'monthly';
$custom_frequency = array(
'home' => 'weekly',
'papers' => 'weekly',
'webcommunities' => 'weekly',
'latex' => 'daily',
'cvtex' => 'daily'
);
/* The priority of this URL relative to other URLs on your site. Valid values range from
0.0 to 1.0. This value has no effect on your pages compared to pages on other sites,
and only lets the search engines know which of your pages you deem most important
so they can order the crawl of your pages in the way you would most like.
The default priority of a page is 0.5.
*/
$default_priority = '0.5';
$custom_priority = array(
'home' => '1.0',
'papers' => '1.0',
'webcommunities' => '0.8',
'latex' => '0.8',
);
//------------END Configuration------------
//initialize
$xml = '';
//build output
$xml .= '<?xml version="1.0" encoding="utf-8"?>'."\n";
$xml .= '<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9
http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd">'."\n";
$pages = $this->Query('SELECT SQL_NO_CACHE tag, time FROM '.$this->config['table_prefix'] . 'pages LEFT JOIN '.$this->config['table_prefix'] . "acls ON page_tag = tag WHERE latest = 'Y' AND (read_acl = '*' OR read_acl IS NULL) ORDER BY time DESC");
while ($row = mysql_fetch_assoc($pages))
{
$priority = (isset($custom_priority[$row['tag']]))? $custom_priority[$row['tag']] : $default_priority;
$frequency = (isset($custom_frequency[$row['tag']]))? $custom_frequency[$row['tag']] : $default_frequency;
$date = date('Y-m-d\TH:i:sO', strtotime($row['time']));
$xml .= '<url>'."\n";
$xml .= ' <loc>' . $this->config['base_url'].$row['tag']."</loc>\n";
$xml .= ' <priority>'.$priority.'</priority>'."\n";
$xml .= ' <changefreq>'.$frequency.'</changefreq>'."\n";
$xml .= ' <lastmod>'.substr($date, 0, -2).':'.substr($date, -2)."</lastmod>\n";
$xml .= '</url>'."\n";
}
$xml .= '</urlset>';
//echo
header('Content-Type: text/xml; charset=utf-8');
echo $xml;
?>
/**
* Generate a {@link https://www.google.com/webmasters/tools/docs/en/protocol.html Google Sitemap} to optimize indexing of the wiki.
*
* @package Handlers
* @subpackage XML
* @version $Id$
* @license http://www.gnu.org/copyleft/gpl.html GNU General Public License
* @filesource
*
* @author {@link http://wikkawiki.org/BarkerJr BarkerJr} initial code
* @author {@link http://wikkawiki.org/DarTar Dario Taraborelli} using Wikka internals, added support for changefreq and priority
*
* @uses Config::$base_url
* @uses Config::$table_prefix
* @todo - Calculate optimal changefreq for each page depending on actual revision history
*/
//------------BEGIN Configuration------------
/* How frequently a page is likely to change. This value provides general information
to search engines and may not correlate exactly to how often they crawl the page.
Valid values are: always, hourly, daily, weekly, monthly, yearly, never
*/
$default_frequency = 'monthly';
$custom_frequency = array(
'home' => 'weekly',
'papers' => 'weekly',
'webcommunities' => 'weekly',
'latex' => 'daily',
'cvtex' => 'daily'
);
/* The priority of this URL relative to other URLs on your site. Valid values range from
0.0 to 1.0. This value has no effect on your pages compared to pages on other sites,
and only lets the search engines know which of your pages you deem most important
so they can order the crawl of your pages in the way you would most like.
The default priority of a page is 0.5.
*/
$default_priority = '0.5';
$custom_priority = array(
'home' => '1.0',
'papers' => '1.0',
'webcommunities' => '0.8',
'latex' => '0.8',
);
//------------END Configuration------------
//initialize
$xml = '';
//build output
$xml .= '<?xml version="1.0" encoding="utf-8"?>'."\n";
$xml .= '<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9
http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd">'."\n";
$pages = $this->Query('SELECT SQL_NO_CACHE tag, time FROM '.$this->config['table_prefix'] . 'pages LEFT JOIN '.$this->config['table_prefix'] . "acls ON page_tag = tag WHERE latest = 'Y' AND (read_acl = '*' OR read_acl IS NULL) ORDER BY time DESC");
while ($row = mysql_fetch_assoc($pages))
{
$priority = (isset($custom_priority[$row['tag']]))? $custom_priority[$row['tag']] : $default_priority;
$frequency = (isset($custom_frequency[$row['tag']]))? $custom_frequency[$row['tag']] : $default_frequency;
$date = date('Y-m-d\TH:i:sO', strtotime($row['time']));
$xml .= '<url>'."\n";
$xml .= ' <loc>' . $this->config['base_url'].$row['tag']."</loc>\n";
$xml .= ' <priority>'.$priority.'</priority>'."\n";
$xml .= ' <changefreq>'.$frequency.'</changefreq>'."\n";
$xml .= ' <lastmod>'.substr($date, 0, -2).':'.substr($date, -2)."</lastmod>\n";
$xml .= '</url>'."\n";
}
$xml .= '</urlset>';
//echo
header('Content-Type: text/xml; charset=utf-8');
echo $xml;
?>
Discussion
It would be nice to calculate the optimal value for changefreq as a function of the actual history of revisions of a page. As a first approximation, the following query gives all the data one may need:SELECT SQL_NO_CACHE tag,
MAX(TIME) AS latest,
MIN(TIME) AS FIRST,
DATEDIFF(MAX(TIME), MIN(TIME)) AS history,
COUNT(id) AS revisions
FROM wikka_pages
GROUP BY tag
ORDER BY revisions DESC;
MAX(TIME) AS latest,
MIN(TIME) AS FIRST,
DATEDIFF(MAX(TIME), MIN(TIME)) AS history,
COUNT(id) AS revisions
FROM wikka_pages
GROUP BY tag
ORDER BY revisions DESC;
Dividing the number of existing revisions by the number of days between the first and the last edit should give an approximate index of the frequency with which the page has been modified. Unfortunately this approach is not able to make any useful distinction between a page that has been modified several times per hour on a single date and has been unchanged for months vs. a page that has been modified on a regular basis every week or month.
CategoryUserContributions
SELECT SQL_NO_CACHE id, tag,
MAX(time) AS latest,
MIN(time) AS first,
DATEDIFF(MAX(time), MIN(time)) AS history,
count (distinct (LEFT (CAST (time AS char(20)),10))) AS revisions
FROM wikka_pages
GROUP BY tag
ORDER BY revisions DESC;
regards, jens
this
-------8<--------
count (distinct(CAST (time AS CHAR(10)))) AS revisions
-------8<--------
does the same, but is somewhat smarter using the cast function as implicit cutter.
BTW, thanks for the link to your site. Always nice to see Wikka sites that are "in the neighborhood" (well, at least in the same part of the state)...
you can simply add the name of the pages you wish to "hide" to the $custom_frequency array specifying "never" as a value. Non-public pages (those with admin-only or registered-user-only read ACL) will automatically be excluded by the sitemap. Hope this helps.
Thanks for the quick reply. I suppose I am being picky, but I wanted those pages omitted entirely from the sitemap.
Anyway, the $custom_frequency pointed me in the right direction, and I ended up inserting the following if statement under the while loop, which solved my problem.
if ($custom_frequency[$row['tag']] != 'never')
Brian,
I didn't notice you were in Garland. Howdy! :)