WikkaSpamFighting:Wikka

Revision [5134]

This is an old revision of WikkaSpamFighting made by GmBowen on 2005-01-24 19:29:03.

Fighting spam in Wikka

see also:

HideReferrers

RemovingUsers

WikkaAndEmail

As it may have dawned on you by now, spam is getting to be a problem in wiki's - both the type of spam that also plagues many blogs in the form of comment spam (only in a wiki it woudl (also) affect page content), and referrer spam. And then there are spambots intent on gathering email addresses.

Wikka sites are no exception any more (and other WakkaWiki forks seem to be having problems, too).

This page is intended to gather ideas for how to fight spam (of all types) in Wikka, so we can coordinate our efforts and get a spammer-hardened Wikka out there. You can also find some general information about (fighting) wiki spam and what Wikka has already implemented as defense measures.

Spam in Wikka pages

About how to discourage spammers to post links on spam pages in the first place, and what to do when your pages have been spammed already.

Refining Redirection / nofollow modification for links

One issue with the google redirection and newer rel="nofollow" is that good sites also get hit by this procedure. As we can't really tag links on a "trusted user" basis, we have to do that on a trusted server one. I use a whitelist in config.php with a list of "good servers":

<?php
"serverre" => "/(nontroppo.org|goodsite.com|etc)/",
?>

And my Link routine in the main wakka.php (wikka.php) is modified to make use of it:

<?php
if (preg_match($this->GetConfigValue("serverre"), $tag))
{
$url = $tag; //trusted web sites so no need for redirects
$urlclass= "ext";
}
else
{
$tag = rawurlencode($tag);
$url = "http://www.google.com/url?q=".$tag;
$urlclass= "ext";
$follow = " rel=\"nofollow\" ";
}
return $url ? "<a ".$follow." class=\"".$urlclass."\" href=\"".$url."\">$text</a>" : $text;
?>

This way, trusted sites get full and unadulterated links, but anything else has BOTH google redirection and rel="nofollow" added. The CSS can then contain ways to visually tag those different URLs, so the user can see if a link is trusted or not (I use advanced generated content - not supported in IE):

a.ext:after, a[rel="nofollow"]:after {content:"\00220A";
text-decoration: none !important;
font-size: 0.9em;
color: #888;
position: relative;
bottom: 1ex;}

a[rel="nofollow"]:after {content:"\002209";}

-- IanAndolina

Spam Block for Saving pages

As I was getting a lot of repeat spam of the same domains over and over, I implemented a "link blacklist" to my Wiki for comments and edits:

add to edit.php & addcomment.php:

<?php preg_match_all($this->GetConfigValue("spamre"),$body,$out); //keyword spam block
if (count($out[0])>=1)
{
$this->SetMessage("Go spam somewhere else. You links will never get spidered here anyway.");
$this->redirect($this->href());
return;
}?>

config.php

Now, what I wanted to do was have an admin only wiki page, where the contents of the spamre regexp could be edited, instead of being hardwired in config.php - but never got round to it. But this would be the better way to do it - have a function that finds a wiki page and builds a regexp from the keywords added by admins to that wiki page (not all of whom may have access to config.php). It is a fairly basic method - but with a couple of vigilant admins can reduce repeat attacks from spam bots considerably. -- IanAndolina

User Validation

I like the ascii-based user validation scheme (Captcha) here:

http://www.moztips.com/wiki/index.pcgi?action=edit&page=SandBox

I don't know how to do that in PHP (it is a PHP based wiki I believe) - though the more complex image based solutions are available. This for me is far prefereable to locking pages for writing using ACLs - which IMO destroys the very purpose of the wiki. --IanAndolina

[Copied from SuggestionBox] "There's also code around that uses GD & that could be built onto Nils' code that generates a "registration password" automatically and outputs it as a distorted graphic image.....the code is intended to befuddle auto spam registers & thus stop open-registration sites from being hit by spam bots that register themselves as users. Ultimately, as the bots become more sophisticated I think we'll have to use something like that or else sites like this one (with open registration) will be victimized. Here and here are examples of what I mean (I like the simplicity of the first version in the second example). -- GmBowen" .... I think at least we need a system like one of these (or like the one Ian suggests) on the user registration page. mb

Spam repair and defense

See also DeleteSpamAction !
1/22/05 - Spam help! I have apparently been attacked by an army of spam bots. Has this happened to anyone else? For now, I am asking for your help with:

a SQL command that will delete all of these edits

a SQL command that will change all of my ACLs to '+' for writing and commenting (I've modified the config file but that only affects new pages AFAIK)

Whatever script they used (on multiple machines, no less) could certainly be used against any Wakka-like site with minimal modifications, so something has to be done...I will do what I can to help you guys combat future attacks as well as implement the new HTML attribute you've probably all heard about. --RichardBerg

Richard: here's the sql to update all your current ACLs (I'm using mysql 4.0.22):

UPDATE acls SET comment_acl="+" WHERE comment_acl="*";
UPDATE acls SET write_acl="+" WHERE write_acl="*";

You'll need to change the table name (acls) to match whatever your table is named. Give me a few to look at the pages table and your site and I should have the sql for removing the edits. :) -- MovieLady

Since Richard has already changed his default ACLs in the configuration, that would apply to any page that did not have ACLs different from the original default (not merely new pages!); your SQL code should take care of any pages that had ACLs different from the original default (because only those would have a record in the ACLs table).
See also JsnX's suggestion about "Clearing ACLs" on the SuggestionBox which explains how this mechanism works. Thanks, MovieLady! --JavaWoman

Correct. Both statements will change only the entries that had the default ACL from his config file in that field. (What the statements are looking for can, of course be changed, as can what the field is being set to. I used it when I went back and changed my default ACLs on all pages that had ACLs to disallow all banned users from writing or commenting after adding ACLsWithUserGroups.) --MovieLady

There is a relevant link to an action at wikini for removing additions by particular IP's or users at CommunityNotes.--GmBowen

Thanks for the link! I've translated and made minor changes to the code, and posted everything to DeleteSpamAction. He's got a very good starting point, I think. One could adapt the code (fairly easily) to allow you to look at all the revisions on a page instead of by user/IP and then delete the histories you don't want to keep, for whatever reason. --MovieLady

Stopping Spammers getting Google Juice

There is a technique to stop spammers from gaining any advantage of spamming, which is to redirect external links to stop them from affecting their PageRank. Great to stop the whole purpose of spamming, but this has the disadvantage that good sites lose their google juice too. Check the comments out on that page for more cons. I've noticed since I enabled this on the Opera 7 wiki that slowly spam volume has dropped out, but I'm not entirely happy at the price paid. Had you thought about this, maybe have it as an option during config? -- IanAndolina

Good point, Ian. I had thought about this, after having seen several Wikis and blogs that use the Google redirection... I do think it should be configurable though - not every Wiki installation may want to do this (in fact, some may welcome external links as long as spam is repaired fast enough). --JavaWoman

I asked an export for SEO and he replied that it should be enough to use a simple internal redirect (e.g. exit.php?url=...) to create this effect. He also said that it might be helpful to disallow any spider access to that file (robots.txt). -- ReimerStegelmann

Unfortunately, search engine robots these days mostly do follow URLs with parameters, and an "internal redirect" done that way would be retrieved by a bot; HTTP redirects are followed, too (which is what you'd have to use with that "internal redirect" method). Meta redirects mostly aren't but you cannot apply this as a general "redirect external links" (especially not since you cannot have any URL parameters in a meta redirect - and you want to allow all valid external links, merely have them not count towards page rank in search engines, mostly Google). Excluding a single file with robots.txt won't work since all of Wikka runs off the single wikka.php file. The Google redirect method gets around all of that (at least for Google's ranking mechanism - which is what spammers are mostly targeting). --JavaWoman

They follow, but that is not the point of spam. The main target of a spammer is the reach a high ranking in search engines. They post links which linktext contains important keywords (e.g. Keyword1 keyword2 http://domain.tld). So, if you enter keyword1 oder keyword2 to a search engine, you will see the homepage of the spammer. By using a simple redirect, spiders will follow the link, but they give a fuck about the keywords and so the spammer gives a fuck about the link.

Exactly - and using the Google redirect prevents the target page from getting a higher ranking from incoming (spam) links because it won't be counted at all. :) --JavaWoman

Yeah, but you don't need Google to make this happen. A simple internal redirect is enough and looks better than a Google-Redirect ;)

Nope, because an internal redirect will be followed by Google and still count for page rank - that's the problem; the Google redirect prevents this. --JavaWoman

I talked to Abakus, a German SEO expert and he said it does not count. There is no difference between an internal redirect oder a Google redirect. Keywords of the link (s. above) only count for the redirect site and not for link behind the redirect. And well, why should a spider follow an internal link (via exit.php?url=...), but not a Google redirect?

A spider will follow any redirect, whether it's local or through an internal redirect. Never mind the keywords, it's still a link into the spammed site; with a local redirect that won't make any difference, but with the Google redirect Google knows to not count it as an incoming link. It's not (just) about keywords but about Page Rank (PR) - and PR is highly dependent on incoming links (and where they come from). That much we know. But no one except some Google employees knows the exact algorithm that determines PR - not even Abakus ;-) --JavaWoman

Maybe the solution is here.

If a user is not registered, to all external links he creates on the wiki will be added the attribute rel="nofollow".

This technique is now adopted by Google, Yahoo and MSN. --DotMG

Thanks, DotMG! This is great news - I had seen this technique being discussed as a proposed possible solution but had missed the news the proposal has actually been adopted now. (Should we worry about Altavista? Probably not too much - these SEs are the ones spammers will target primarily.) One possible hole I can see is that a spammer might write a script to quickly register and then post on a number of pages - but scripted registrations can be defended against with other means. Nothing will probably provide a 100% solution but this is a big step in the right direction. --JavaWoman

Referrer spam

Spammers sometimes visit Wikis and blogs with a tool with "bogus" referer headers containing the sites they want to generate incoming links for - this works on many wikis and blogs since such sites often have a page listing referrers (wikis) or list referrers to a particular post (blogs). If a Search engine indexes such a page, it would find a link to the spammed site, resulting in a higher "score" for that spammed page.

The general solution is to cause such links not to be followed by search engines. The technique outlined below under "Don't let old pages get indexed" already takes care of this for the referrer listings Wikka uses.

Email-gathering spambots

Spambots spider websites looking for email addresses to add to the list (to use, or to sell as a targeted list). A general defense that works well (though not 100%) is to "obfuscate" email addresses so such spambots don't recognize them.

Obfuscating addresses automatically

Wikka 1.1.6.0 comes with a small action to create an obfuscated email "contact" link for the site administrator. Meanwhile, the formatter will simply turn every email address it recognizes into an email link (with the address also used for the link text) - providing nice fodder for spambots.

What we should have is a function that can turn a given email address into an obfuscated link - this could then be used by both the {{contact}} action and the formatter. It would (then) also enable use to change the obfuscating algorithm inside the fuction without affecting either the formatter or the contact action any more, and others can use this in their own extensions as well. --JavaWoman

Resolved Suggestions

Spam-defense measures that are already implemented in Wikka.

Don't let old pages get indexed

Extended method implemented as of Wikka 1.1.6.0 (Both the "noarchive" addition and applying it to the Sandbox)

To make absolutely sure old pages don't get archived (irrespective of your robots.txt) - essential to stopping WikiSpam from still getting juice from archived pages, why not make sure to add meta directives to those pages by adding something like:

<?php if ($this->GetMethod() != 'show' || $this->page["latest"] == "N") echo "<meta name=\"robots\" content=\"noindex, nofollow, noarchive\" />\n<meta name=\"googlebot\" content=\"noarchive, noindex, nofollow\">\n";?>

to header.php. This stops pages with handlers other than show or non current pages from any kind of archiving/cacheing.

Ian, thanks for the suggestion. Wikka has had something similar to this in place since the first release. See Mod033bRobotIndexing. But your suggestion expands the idea and adds the latest page check, "noarchive", and the googlebot part--which seem like good ideas. I'll add this to the upcoming release. By the way, when are you going to switch your site over to Wikka? ;) -- JsnX

Yes, nice idea. But the googlebot part is actually redundant, Google obeys the robots meta directives. (And that second meta tag isn't valid XHTML - it's unclosed.) I suggest we merely add the "noarchive". Apart from that, it would also be nice to stop indexing etc. from the SandBox page. --JavaWoman

The latest page check is important because wiki spammers don't really care if you delete their spam, as long as their links sit on an old archived page waiting to be indexed. The added googlebot directive (thanks for spotting typo btw) is just extra paranoia on my part :). And you are all doing an excellent job with Wikka - the only reason I haven't switched is that quite a lot on my Wakka is heavily customised and I don't have the time to redo that - especially as lots of pages would break without re-jigging of e.g. SafeHTML (my BookMarklets page for example). If I have time, I will eventually migrate...! -- IanAndolina

Further references

Where to read more about Wiki spam.

Meatball — WikiSpam

C2.com — WikiSpam

Goggle Ending Comment Spam [sic]

chongqed.org

chongqed.org blacklist - use this dynamically

Submit a wiki spammer - All your page ranks are belong to us!

Wacko Wiki - SPAM

ReferrerSpam - A Wakka page about preventing referrer spam ... which page is now spammed (ouch!)

CategoryWikka
CategoryDevelopment

Wikka : WikkaSpamFighting

Revision [5134]