Revision [5168]

This is an old revision of WikkaSpamFighting made by GmBowen on 2005-01-25 13:54:18.

 

Fighting spam in Wikka


As it may have dawned on you by now, spam is getting to be a problem in wiki's - both the type of spam that also plagues many blogs in the form of comment spam (only in a wiki it would (also) affect page content), and referrer spam. And then there are spambots intent on gathering email addresses.

Wikka sites are no exception any more (and other WakkaWiki forks seem to be having problems, too).

This page is intended to gather ideas for how to fight spam (of all types) in Wikka, so we can coordinate our efforts and get a spammer-hardened Wikka out there. You can also find some general information about (fighting) wiki spam and what Wikka has already implemented as defense measures.

Spam in Wikka pages

About how to discourage spammers to post links on spam pages in the first place, and what to do when your pages have been spammed already.

One issue with the google redirection and newer rel="nofollow" is that good sites also get hit by this procedure. As we can't really tag links on a "trusted user" basis, we have to do that on a trusted server one. I use a whitelist in config.php with a list of "good servers":

<?php
"serverre" => "/(nontroppo.org|goodsite.com|etc)/",
?>


And my Link routine in the main wakka.php (wikka.php) is modified to make use of it:

<?php
if (preg_match($this->GetConfigValue("serverre"), $tag))
{
    $url = $tag; //trusted web sites so no need for redirects
    $urlclass= "ext";
}
else
{
    $tag = rawurlencode($tag);
    $url = "http://www.google.com/url?q=".$tag;
    $urlclass= "ext";
    $follow = " rel=\"nofollow\" ";
}
return $url ? "<a ".$follow." class=\"".$urlclass."\" href=\"".$url."\">$text</a>" : $text;
?>


This way, trusted sites get full and unadulterated links, but anything else has BOTH google redirection and rel="nofollow" added. The CSS can then contain ways to visually tag those different URLs, so the user can see if a link is trusted or not (I use advanced generated content - not supported in IE):

a.ext:after, a[rel="nofollow"]:after {content:"\00220A";
    text-decoration: none !important;
    font-size: 0.9em;
    color: #888;
    position: relative;
    bottom: 1ex;}

a[rel="nofollow"]:after {content:"\002209";}
-- IanAndolina

Spam Block for Saving pages
As I was getting a lot of repeat spam of the same domains over and over, I implemented a "link blacklist" to my Wiki for comments and edits:

add to edit.php & addcomment.php:
<?php preg_match_all($this->GetConfigValue("spamre"),$body,$out); //keyword spam block
if (count($out[0])>=1)
{
    $this->SetMessage("Go spam somewhere else.  You links will never get spidered here anyway.");
    $this->redirect($this->href());
    return;
}?>


config.php
<?php
"spamre" => "/(voip99|zhiliaotuofa|mycv|princeofprussia|imobissimo|valeofglamorganconservatives|68l|8cx|online-deals99).(net|cn|com|org)|(phentermine)/m",
?>


Now, what I wanted to do was have an admin only wiki page, where the contents of the spamre regexp could be edited, instead of being hardwired in config.php - but never got round to it. But this would be the better way to do it - have a function that finds a wiki page and builds a regexp from the keywords added by admins to that wiki page (not all of whom may have access to config.php). It is a fairly basic method - but with a couple of vigilant admins can reduce repeat attacks from spam bots considerably. -- IanAndolina

User Validation

I like the ascii-based user validation scheme (Captcha) here:

http://www.moztips.com/wiki/index.pcgi?action=edit&page=SandBox

I don't know how to do that in PHP (it is a PHP based wiki I believe) - though the more complex image based solutions are available. This for me is far prefereable to locking pages for writing using ACLs - which IMO destroys the very purpose of the wiki. --IanAndolina

[Copied from SuggestionBox] "There's also code around that uses GD & that could be built onto Nils' code that generates a "registration password" automatically and outputs it as a distorted graphic image.....the code is intended to befuddle auto spam registers & thus stop open-registration sites from being hit by spam bots that register themselves as users. Ultimately, as the bots become more sophisticated I think we'll have to use something like that or else sites like this one (with open registration) will be victimized. Here and here are examples of what I mean (I like the simplicity of the first version in the second example). -- GmBowen" .... I think at least we need a system like one of these (or like the one Ian suggests) on the user registration page. mb

Spam repair and defense
See also DeleteSpamAction !
1/22/05 - Spam help! I have apparently been attacked by an army of spam bots. Has this happened to anyone else? For now, I am asking for your help with:

Whatever script they used (on multiple machines, no less) could certainly be used against any Wakka-like site with minimal modifications, so something has to be done...I will do what I can to help you guys combat future attacks as well as implement the new HTML attribute you've probably all heard about. --RichardBerg

 UPDATE acls SET comment_acl="+" WHERE comment_acl="*";
UPDATE acls SET write_acl="+" WHERE write_acl="*";



Stopping Spammers getting Google Juice
There is a technique to stop spammers from gaining any advantage of spamming, which is to redirect external links to stop them from affecting their PageRank. Great to stop the whole purpose of spamming, but this has the disadvantage that good sites lose their google juice too. Check the comments out on that page for more cons. I've noticed since I enabled this on the Opera 7 wiki that slowly spam volume has dropped out, but I'm not entirely happy at the price paid. Had you thought about this, maybe have it as an option during config? -- IanAndolina


Referrer spam

Spammers sometimes visit Wikis and blogs with a tool with "bogus" referer headers containing the sites they want to generate incoming links for - this works on many wikis and blogs since such sites often have a page listing referrers (wikis) or list referrers to a particular post (blogs). If a Search engine indexes such a page, it would find a link to the spammed site, resulting in a higher "score" for that spammed page.

The general solution is to cause such links not to be followed by search engines. The technique outlined below under "Don't let old pages get indexed" already takes care of this for the referrer listings Wikka uses.

Email-gathering spambots

Spambots spider websites looking for email addresses to add to the list (to use, or to sell as a targeted list). A general defense that works well (though not 100%) is to "obfuscate" email addresses so such spambots don't recognize them.

Obfuscating addresses automatically
Wikka 1.1.6.0 comes with a small action to create an obfuscated email "contact" link for the site administrator. Meanwhile, the formatter will simply turn every email address it recognizes into an email link (with the address also used for the link text) - providing nice fodder for spambots.

What we should have is a function that can turn a given email address into an obfuscated link - this could then be used by both the {{contact}} action and the formatter. It would (then) also enable use to change the obfuscating algorithm inside the fuction without affecting either the formatter or the contact action any more, and others can use this in their own extensions as well. --JavaWoman


Resolved Suggestions

Spam-defense measures that are already implemented in Wikka.

Don't let old pages get indexed
Extended method implemented as of Wikka 1.1.6.0 (Both the "noarchive" addition and applying it to the Sandbox)

To make absolutely sure old pages don't get archived (irrespective of your robots.txt) - essential to stopping WikiSpam from still getting juice from archived pages, why not make sure to add meta directives to those pages by adding something like:
<?php if ($this->GetMethod() != 'show' || $this->page["latest"] == "N") echo "<meta name=\"robots\" content=\"noindex, nofollow, noarchive\" />\n<meta name=\"googlebot\" content=\"noarchive, noindex, nofollow\">\n";?>

to header.php. This stops pages with handlers other than show or non current pages from any kind of archiving/cacheing.



Further references

Where to read more about Wiki spam.






CategoryWikka
CategoryDevelopment
There are 11 comments on this page. [Show comments]
Valid XHTML :: Valid CSS: :: Powered by WikkaWiki