Fighting spam in Wikka


As it may have dawned on you by now, spam is getting to be a problem in wiki's - both the type of spam that also plagues many blogs in the form of comment spam (only in a wiki it would (also) affect page content), and referrer spam. And then there are spambots intent on gathering email addresses.

Wikka sites are no exception any more (and other WakkaWiki forks seem to be having problems, too).

This page is intended to gather ideas for how to fight spam (of all types) in Wikka, so we can coordinate our efforts and get a spammer-hardened Wikka out there. You can also find some general information about (fighting) wiki spam and what Wikka has already implemented as defense measures.


Spam in Wikka pages

About how to discourage spammers to post links on spam pages in the first place, and what to do when your pages have been spammed already.

Blocking Agents

Bad Behavior is a set of PHP scripts which prevents spambots from accessing your site by analyzing their actual HTTP requests and comparing them to profiles from known spambots. (quote from the homepage)

Two Suggestions


Content Filter
Wacko wiki has implemented a content filter based on a word/phrase list. I'm not sure how sophisticated it is (it's not a Bayesian filter), but uses a list updated from chongqed.org. Read more about it here. I thought this might contribute to our conversations about spamfighting. --GmBowen
Preliminary list of links to (apparent) content blocking systems in wikis (more as I find them): --JavaWoman

Bayesian Filter: Focus on the content
Many of these suggestions will stop a certain degree of spam, but spammers can easily break these anti-spam measures such as adding random tokens (modern spam bots can already scan a page for form elements and submit all of them). Therefore, I suggest analyzing the content based on what might constitute spam (text frequency, link frequency, blacklist, bayesian filter) and then assigning a score to the post. If the post has over, let's say, a 50% chance for spam, then perhaps email validation, post approval, or a captcha can be used to further validate the user.

I'm particularly supportive of the bayesian filter. For instance, many spam fighting programs today use the bayesian filter (ie. Thunderbird). The bayesian algorithm is adaptive and learning which will work best when used in conjunction with other standard filters. The process might be like this:
  1. The standard filters (ie. blacklist) catches a suspicious post. The post is marked for approval.
  2. The admins will review the post at the post moderation panel. If the post is "ham" then the bayesian filters will automatically adapt to allow future posts that resemble the approved post through. However, if the post is "spam", then the bayesian filter will automatically adapt to block future posts with those certain keywords.

Therefore, a bayesian filter cannot be solely implemented, but rather, it requires admin intervention (to help the filter learn) and other standard filters.

Bayesian filters have been extremely successful in eliminating over 98% of common spam after a few weeks of adaptation.
--MikeXstudios
Adding Random Tokens for Form Submissions?
Ticket:154
Based on this post, I wonder whether providing randomised session tokens for form submission may provide just one more step to impede spambots. Very simple to implement:

wikka.php:
function FormOpen($method = "", $tag = "", $formMethod = "post")
{
    if(!isset($_SESSION['token'])) {
        $token = md5(uniqid(rand(), true));
        $_SESSION['token'] = $token;
    }
    $result = "<form action=\"".$this->Href($method, $tag)."\" method=\"".$formMethod."\"><p>\n";
    $result .= "<input type=\"hidden\" name=\"token\" value=\"".$_SESSION['token']."\" />";
    if (!$this->config["rewrite_mode"]) $result .= "<input type=\"hidden\" name=\"wakka\" value=\"".$this->MiniHref($method, $tag)."\" />\n";
    return $result;
}


and then just wrap edit.php and addcomment.php sections using:
if ($_POST['token'] == $_SESSION['token']) { //form spoof protection

}


I'm definitely no expert on security, and I can see how it can be bypassed, but it does require one more step and adds complexity for spambots to spoof the wiki forms at no cost... --IanAndolina
One issue with the google redirection and newer rel="nofollow" is that good sites also get hit by this procedure. As we can't really tag links on a "trusted user" basis, we have to do that on a trusted server one. I use a whitelist in config.php with a list of "good servers":

"serverre" => "/(nontroppo.org|goodsite.com|etc)/",


And my Link routine in the main wakka.php (wikka.php) is modified to make use of it:

if (preg_match($this->GetConfigValue("serverre"), $tag))
{
    $url = $tag; //trusted web sites so no need for redirects
    $urlclass= "ext";
}
else
{
    $tag = rawurlencode($tag);
    $url = "http://www.google.com/url?q=".$tag;
    $urlclass= "ext";
    $follow = " rel=\"nofollow\" ";
}
return $url ? "<a ".$follow." class=\"".$urlclass."\" href=\"".$url."\">$text</a>" : $text;


This way, trusted sites get full and unadulterated links, but anything else has BOTH google redirection and rel="nofollow" added. The CSS can then contain ways to visually tag those different URLs, so the user can see if a link is trusted or not (I use advanced generated content - not supported in IE):

a.ext:after, a[rel="nofollow"]:after {content:"\00220A";
    text-decoration: none !important;
    font-size: 0.9em;
    color: #888;
    position: relative;
    bottom: 1ex;}

a[rel="nofollow"]:after {content:"\002209";}
-- IanAndolina

Spam Block for Saving pages
As I was getting a lot of repeat spam of the same domains over and over, I implemented a "link blacklist" to my Wiki for comments and edits:

add to edit.php & addcomment.php:
preg_match_all($this->GetConfigValue("spamre"),$body,$out); //keyword spam block
if (count($out[0])>=1)
{
    $this->SetMessage("Go spam somewhere else.  You links will never get spidered here anyway.");
    $this->redirect($this->href());
    return;
}


config.php
"spamre" => "/(voip99|zhiliaotuofa|mycv|princeofprussia|imobissimo|valeofglamorganconservatives|68l|8cx|online-deals99).(net|cn|com|org)|(phentermine)/m",


Now, what I wanted to do was have an admin only wiki page, where the contents of the spamre regexp could be edited, instead of being hardwired in config.php - but never got round to it. But this would be the better way to do it - have a function that finds a wiki page and builds a regexp from the keywords added by admins to that wiki page (not all of whom may have access to config.php). It is a fairly basic method - but with a couple of vigilant admins can reduce repeat attacks from spam bots considerably. -- IanAndolina

User Validation
I like the ascii-based user validation scheme (Captcha) here:

http://www.moztips.com/wiki/index.pcgi?action=edit&page=SandBox

I don't know how to do that in PHP (it is a PHP based wiki I believe) - though the more complex image based solutions are available. This for me is far prefereable to locking pages for writing using ACLs - which IMO destroys the very purpose of the wiki. --IanAndolina

[Copied from SuggestionBox] "There's also code around that uses GD & that could be built onto Nils' code that generates a "registration password" automatically and outputs it as a distorted graphic image.....the code is intended to befuddle auto spam registers & thus stop open-registration sites from being hit by spam bots that register themselves as users. Ultimately, as the bots become more sophisticated I think we'll have to use something like that or else sites like this one (with open registration) will be victimized. Here and here are examples of what I mean (I like the simplicity of the first version in the second example). -- GmBowen" .... I think at least we need a system like one of these (or like the one Ian suggests) on the user registration page. mb
Spam repair and defense
See also DeleteSpamAction !
1/22/05 - Spam help! I have apparently been attacked by an army of spam bots. Has this happened to anyone else? For now, I am asking for your help with:
Whatever script they used (on multiple machines, no less) could certainly be used against any Wakka-like site with minimal modifications, so something has to be done...I will do what I can to help you guys combat future attacks as well as implement the new HTML attribute you've probably all heard about. --RichardBerg

Banning users
just so it doesn't get "lost", I'm copying a few comments from another page here. --JW
So, what to do? Banning by IP is indeed fraught with the risk of banning innocent users since with a lot of large ISPs the IP addresses are assigned round-robin and may be different even between subsequent requests (i.e. the request for embedded images may each come from a different IP address, and from a different address than that for the page itself).

Possibly creating and storing a kind of "signature" consisting of not IP address but other request header elements, including user agent string but also accept headers, maybe combining that with a whole IP block rather than a single address might give us some sort of handle. But you'd need to actually store that and watch it for a while before you can tell how reliable (or not) that might be. --JavaWoman


Stopping Spammers getting Google Juice
There is a technique to stop spammers from gaining any advantage of spamming, which is to redirect external links to stop them from affecting their PageRank. Great to stop the whole purpose of spamming, but this has the disadvantage that good sites lose their google juice too. Check the comments out on that page for more cons. I've noticed since I enabled this on the Opera 7 wiki that slowly spam volume has dropped out, but I'm not entirely happy at the price paid. Had you thought about this, maybe have it as an option during config? -- IanAndolina


Referrer spam

Spammers sometimes visit Wikis and blogs with a tool with "bogus" referer headers containing the sites they want to generate incoming links for - this works on many wikis and blogs since such sites often have a page listing referrers (wikis) or list referrers to a particular post (blogs). If a Search engine indexes such a page, it would find a link to the spammed site, resulting in a higher "score" for that spammed page.
The general solution is to cause such links not to be followed by search engines. The technique outlined below under "Don't let old pages get indexed" already takes care of this for the referrer listings Wikka uses.

I am trying this technique to detect which referrer is a spam and which isn't :
1) Add a field named remote_addr to referrers table.
2) Modify LogReferrer() method and add set remote_addr = $_SERVER['REMOTE_ADDR'];
3) Change LoadReferrers() to returns only records where the field remote_addr is blank
4) Add this code to header action :
<link rel="stylesheet" href="<?php echo $this->Href("wikka.css", "pseudodir"); ?>" type="text/css" />

5) Create a file named wikka.css.php and put it at ./handler/page. This file will update the table referrers and set remote_addr to blank if remote_addr is the same as $_SERVER['REMOTE_ADDR'] and time is greater than "now() plus six minutes" ... The scripts will return a css file, with header no-cache, must-revalidate, expired, random etag, ... so that it will be requested each time a page is requested. --DotMG

Explanation : The difference between a spambot and a real user is that a spambot just loads a page, and it doesn' t analyse its content, so, with a spambot, all css files linked within the document won't be loaded.

Email-gathering spambots

Spambots spider websites looking for email addresses to add to the list (to use, or to sell as a targeted list). A general defense that works well (though not 100%) is to "obfuscate" email addresses so such spambots don't recognize them.
I would like to use an offensive attack against spams. My wikka would generate some false email address and present them as
<a href="mailto:unexistingemail@thesnake.us" class="email">unexistingemail@thesnake.us</a>
, and somewhere in css files, you will find
.email {display: none;}
so that the false email won't be visible by human visitors of the site. The domain of the false email address will be either a domain name of a spam/porn site (tax them bandwidth), or a non existing domain. --DotMG
Obfuscating addresses automatically
Wikka 1.1.6.0 comes with a small action to create an obfuscated email "contact" link for the site administrator. Meanwhile, the formatter will simply turn every email address it recognizes into an email link (with the address also used for the link text) - providing nice fodder for spambots.

What we should have is a function that can turn a given email address into an obfuscated link - this could then be used by both the {{contact}} action and the formatter. It would (then) also enable use to change the obfuscating algorithm inside the fuction without affecting either the formatter or the contact action any more, and others can use this in their own extensions as well. --JavaWoman


Resolved Suggestions

Spam-defense measures that are already implemented in Wikka.
Don't let old pages get indexed
Extended method implemented as of Wikka 1.1.6.0 (Both the "noarchive" addition and applying it to the Sandbox)

To make absolutely sure old pages don't get archived (irrespective of your robots.txt) - essential to stopping WikiSpam from still getting juice from archived pages, why not make sure to add meta directives to those pages by adding something like:
<?php if ($this->GetMethod() != 'show' || $this->page["latest"] == "N") echo "<meta name=\"robots\" content=\"noindex, nofollow, noarchive\" />\n<meta name=\"googlebot\" content=\"noarchive, noindex, nofollow\">\n";?>

to header.php. This stops pages with handlers other than show or non current pages from any kind of archiving/cacheing.



Further references

Where to read more about Wiki spam.


CategoryWikka CategoryDevelopmentArchitecture
There are 11 comments on this page. [Show comments]
Valid XHTML :: Valid CSS: :: Powered by WikkaWiki