WikkaSpamFighting:Wikka

Revision [11150]

This is an old revision of WikkaSpamFighting made by DarTar on 2005-09-22 07:42:13.

Fighting spam in Wikka

see also:

HideReferrers

RemovingUsers

WikkaAndEmail

DeleteSpamAction

AdvancedReferrersHandler

SecurityModules

As it may have dawned on you by now, spam is getting to be a problem in wiki's - both the type of spam that also plagues many blogs in the form of comment spam (only in a wiki it would (also) affect page content), and referrer spam. And then there are spambots intent on gathering email addresses.

Wikka sites are no exception any more (and other WakkaWiki forks seem to be having problems, too).

This page is intended to gather ideas for how to fight spam (of all types) in Wikka, so we can coordinate our efforts and get a spammer-hardened Wikka out there. You can also find some general information about (fighting) wiki spam and what Wikka has already implemented as defense measures.

Spam in Wikka pages

About how to discourage spammers to post links on spam pages in the first place, and what to do when your pages have been spammed already.

Blocking Agents

Bad Behavior is a set of PHP scripts which prevents spambots from accessing your site by analyzing their actual HTTP requests and comparing them to profiles from known spambots. (quote from the homepage)

copied to BadBehavior. --NilsLindenberg

Two Suggestions

See also http://wiki.chongqed.org/Wikka.

Content Filter

Wacko wiki has implemented a content filter based on a word/phrase list. I'm not sure how sophisticated it is (it's not a Bayesian filter), but uses a list updated from chongqed.org. Read more about it here. I thought this might contribute to our conversations about spamfighting. --GmBowen

Mike, I see using the blacklist from chongqed.org mentioned as an option but I don't see any reference that this is what actually has been implemented - merely that some content filtering has been implemented. The best way to use chongqed.org is to use their blacklist dynamically. --JavaWoman

Preliminary list of links to (apparent) content blocking systems in wikis (more as I find them):

BadContent (MoinMoin)

--JavaWoman

Bayesian Filter: Focus on the content

Many of these suggestions will stop a certain degree of spam, but spammers can easily break these anti-spam measures such as adding random tokens (modern spam bots can already scan a page for form elements and submit all of them). Therefore, I suggest analyzing the content based on what might constitute spam (text frequency, link frequency, blacklist, bayesian filter) and then assigning a score to the post. If the post has over, let's say, a 50% chance for spam, then perhaps email validation, post approval, or a captcha can be used to further validate the user.

I'm particularly supportive of the bayesian filter. For instance, many spam fighting programs today use the bayesian filter (ie. Thunderbird). The bayesian algorithm is adaptive and learning which will work best when used in conjunction with other standard filters. The process might be like this:

The standard filters (ie. blacklist) catches a suspicious post. The post is marked for approval.

The admins will review the post at the post moderation panel. If the post is "ham" then the bayesian filters will automatically adapt to allow future posts that resemble the approved post through. However, if the post is "spam", then the bayesian filter will automatically adapt to block future posts with those certain keywords.

Therefore, a bayesian filter cannot be solely implemented, but rather, it requires admin intervention (to help the filter learn) and other standard filters.

Bayesian filters have been extremely successful in eliminating over 98% of common spam after a few weeks of adaptation.
--MikeXstudios

Nice idea Mike - but do you know of a Baysian filter implementation in PHP that could be easily integrated with Wikka? Preferably "lightweight" too, as we don't want an anti-spam solution be more weighty than Wikka itself. --JavaWoman

In fact, I do :). I was playing around with this small Bayesian filter I found last week. It's around 55KB unzipped. Implementing it in Wikka would be less-than easy however. --MikeXstudios

Adding Random Tokens for Form Submissions?

Based on this post, I wonder whether providing randomised session tokens for form submission may provide just one more step to impede spambots. Very simple to implement:

wikka.php:

function FormOpen($method = "", $tag = "", $formMethod = "post")
{
if(!isset($_SESSION['token'])) {
$token = md5(uniqid(rand(), true));
$_SESSION['token'] = $token;
}
$result = "<form action=\"".$this->Href($method, $tag)."\" method=\"".$formMethod."\"><p>\n";
$result .= "<input type=\"hidden\" name=\"token\" value=\"".$_SESSION['token']."\" />";
if (!$this->config["rewrite_mode"]) $result .= "<input type=\"hidden\" name=\"wakka\" value=\"".$this->MiniHref($method, $tag)."\" />\n";
return $result;
}

and then just wrap edit.php and addcomment.php sections using:

if ($_POST['token'] == $_SESSION['token']) { //form spoof protection

}

I'm definitely no expert on security, and I can see how it can be bypassed, but it does require one more step and adds complexity for spambots to spoof the wiki forms at no cost... --IanAndolina

Good point, Ian. I had been thinking about a similar approach (I have a plugin for SquirrelMail installed that essentially does the same thing for the login dialog) - but reading therough the comments on Chris Shiflett's article it's clear that this is no more than another little hurdle easily overcome by the script-writing link spammer (GET the page with the script first, read the token and use that in the scripted POST). That said, it is another hurdle that may deter at least some naive spammers - and with very little code. So, nice. --JavaWoman (who suddenly realizes her Squirrelmail isn't as secure as she thought it was - but at least has built in other layers of security)

Refining Redirection / nofollow modification for links

One issue with the google redirection and newer rel="nofollow" is that good sites also get hit by this procedure. As we can't really tag links on a "trusted user" basis, we have to do that on a trusted server one. I use a whitelist in config.php with a list of "good servers":

"serverre" => "/(nontroppo.org|goodsite.com|etc)/",

And my Link routine in the main wakka.php (wikka.php) is modified to make use of it:

if (preg_match($this->GetConfigValue("serverre"), $tag))
{
$url = $tag; //trusted web sites so no need for redirects
$urlclass= "ext";
}
else
{
$tag = rawurlencode($tag);
$url = "http://www.google.com/url?q=".$tag;
$urlclass= "ext";
$follow = " rel=\"nofollow\" ";
}
return $url ? "<a ".$follow." class=\"".$urlclass."\" href=\"".$url."\">$text</a>" : $text;

This way, trusted sites get full and unadulterated links, but anything else has BOTH google redirection and rel="nofollow" added. The CSS can then contain ways to visually tag those different URLs, so the user can see if a link is trusted or not (I use advanced generated content - not supported in IE):

a.ext:after, a[rel="nofollow"]:after {content:"\00220A";
text-decoration: none !important;
font-size: 0.9em;
color: #888;
position: relative;
bottom: 1ex;}

a[rel="nofollow"]:after {content:"\002209";}

-- IanAndolina

Spam Block for Saving pages

As I was getting a lot of repeat spam of the same domains over and over, I implemented a "link blacklist" to my Wiki for comments and edits:

add to edit.php & addcomment.php:

preg_match_all($this->GetConfigValue("spamre"),$body,$out); //keyword spam block
if (count($out[0])>=1)
{
$this->SetMessage("Go spam somewhere else. You links will never get spidered here anyway.");
$this->redirect($this->href());
return;
}

config.php

Now, what I wanted to do was have an admin only wiki page, where the contents of the spamre regexp could be edited, instead of being hardwired in config.php - but never got round to it. But this would be the better way to do it - have a function that finds a wiki page and builds a regexp from the keywords added by admins to that wiki page (not all of whom may have access to config.php). It is a fairly basic method - but with a couple of vigilant admins can reduce repeat attacks from spam bots considerably. -- IanAndolina

User Validation

I like the ascii-based user validation scheme (Captcha) here:

http://www.moztips.com/wiki/index.pcgi?action=edit&page=SandBox

I don't know how to do that in PHP (it is a PHP based wiki I believe) - though the more complex image based solutions are available. This for me is far prefereable to locking pages for writing using ACLs - which IMO destroys the very purpose of the wiki. --IanAndolina

Whether simply an image or an image turned into "ASCII art", an image-based CAPTCHA that does not provide an alternative is inaccessible for people with visual impairments. Read this story ("Oops, Google Did it Again") to see why we really should not use anything like that. --JavaWoman

[Copied from SuggestionBox] "There's also code around that uses GD & that could be built onto Nils' code that generates a "registration password" automatically and outputs it as a distorted graphic image.....the code is intended to befuddle auto spam registers & thus stop open-registration sites from being hit by spam bots that register themselves as users. Ultimately, as the bots become more sophisticated I think we'll have to use something like that or else sites like this one (with open registration) will be victimized. Here and here are examples of what I mean (I like the simplicity of the first version in the second example). -- GmBowen" .... I think at least we need a system like one of these (or like the one Ian suggests) on the user registration page. mb

[also copied from SuggestionBox] Yes, "Captcha" is an old trick - it will keep out some bots (but not all) and it will keep many people out, too, like those who are visually handicapped (not just people who are totally blind - being colorblind may be enough to be defeated by such tricks). Add a sound equivalent? Well, there are people who are deaf and blind. Are we going to deny them access to our wikis? I'm not in favor of making Wikka inaccessible when we should be working towards making it more accessible. --JavaWoman.

It was for this reason I suggested the ascii captcha - it uses mch larger letters AND it is perfect for the colour-blind. Interestingly, it probably also is better in terms of fooling image recognition algorithms, the artifical observer will find it much harder to parse ascii-art as elements have high-contrast, conflicting orientations AND are unjoined. Thus the first step to object classifications are greatly impeded. I do agree that captcha's can reduce accesibility, but an ascii-based recognition pattern reduces some of the major obstacles I saw in the image-based ones. --IanAndolina

Ian, I looked at that Moztips link you mentioed - and even with my relatively good eyes I found the contrast way too low. And whether you use an image or "ascii art" - screen readers used by visually handicapped people cannot make anything of it - it's only marginally better than an image in that it may defeat OCR utilities used by some spambots, but I don't see it as any more accessible, it could even be less accessible, depending on contrast in the image. --JavaWoman

But spam-flooded wiki's are also inaccessible....we only have to look at wikka's parent to see that....so some sort of more elaborate system is needed to enhance overall accessibility. In the large, I see the overall job of developing wikka as a matter of providing a tool-kit that each wikka-owner can make the decisions around. Providing a registration system that (optionally) uses both a registration code as provided by the wikka-owner by email *or* some sort of visual validation system (I like the character size of captcha, and this one)...

("this" one is quite horrible - I saw three glyphs, two the same that I could not guess were a 3 or a B and one in the middle that I could not decide was a 4 or an A. The only thing that can be said for it is that (at least the one sample I looked at) has high contrast... which doesn't help if the characters are not recognizable themselves -JW)

What I liked about it was that it didn't use GD (as others do, except captcha), so would be usable to all of our users, but rather pulled on a series of image files.....so the graphics could be easily changed to a series that was more universally readable (thereby addressing your concerns)....I also noticed that some of the characters were difficult to discern. That also means, of course, that a bot could be set to read the source image file name...so a potential security shortfall, although code could be written to (a) pull the image from an mysql database instead of a file & (b) have a "random" image file name assigned to it as it was read into the html page. --GmBowen

... at the same time so a user could use either for registration would likely cover most bases for an open registration system, one with just the latter for teachers etc. that want one for their classes. --GmBowen

I agree captcha could be an option (but it would be an extension, not Wikka "core"). Remember though that many people really hate solutions like this - they'll just leave (even if they are already part of a community) and refuse to post any more. A more stringent registration procedure (requiring email confirmation, and recording user's IP) would be be less invasive and sufficient to keep all but the most hardened spambots away; it would also ensure that people actually have a valid email address to use when they lose their password... Combine that with a "banning" option and even a spambot clever enough to sign up once with a valid email address can be kept out once discovered. --JavaWoman

Wouldn't it have to be automatic email confirmation though (for high-use sites)? Anyways, in the scenarios I work with (since we're discussing accessability) that's a less desirable option. There are ethical issues for schools/teachers with "requiring" kids to have an email address (especially yahoo, hotmail, etc) because of the amount of adult-oriented spam they receive. [That's part of the reason I'm working on the private message system to be embedded in wikka. Image verification thereby offers advantages that email verification does not, at least for the communities I'll be using this in. As far as "hating" procedures goes.....I dislike email verification (because of the possibility of being added to spam lists) far more than image verification. --GmBowen

Another Captcha-type technique here (and links to the Pear class to generate ascii-based captchas): http://www.phpguru.org/static/TextualNumbers.html — I have to say, some technique modified from this which uses multiple <span>'s with display states defined in the (seperate) CSS file seem quite powerful to me. Spambot would then have to parse the CSS too and work out what is the real number and what is "noise"...

Spam repair and defense

See also DeleteSpamAction !
1/22/05 - Spam help! I have apparently been attacked by an army of spam bots. Has this happened to anyone else? For now, I am asking for your help with:

a SQL command that will delete all of these edits

a SQL command that will change all of my ACLs to '+' for writing and commenting (I've modified the config file but that only affects new pages AFAIK)

Whatever script they used (on multiple machines, no less) could certainly be used against any Wakka-like site with minimal modifications, so something has to be done...I will do what I can to help you guys combat future attacks as well as implement the new HTML attribute you've probably all heard about. --RichardBerg

Richard: here's the sql to update all your current ACLs (I'm using mysql 4.0.22):

UPDATE acls SET comment_acl="+" WHERE comment_acl="*";
UPDATE acls SET write_acl="+" WHERE write_acl="*";

You'll need to change the table name (acls) to match whatever your table is named. Give me a few to look at the pages table and your site and I should have the sql for removing the edits. :) -- MovieLady

Since Richard has already changed his default ACLs in the configuration, that would apply to any page that did not have ACLs different from the original default (not merely new pages!); your SQL code should take care of any pages that had ACLs different from the original default (because only those would have a record in the ACLs table).
See also JsnX's suggestion about "Clearing ACLs" on the SuggestionBox which explains how this mechanism works. Thanks, MovieLady! --JavaWoman

Correct. Both statements will change only the entries that had the default ACL from his config file in that field. (What the statements are looking for can, of course be changed, as can what the field is being set to. I used it when I went back and changed my default ACLs on all pages that had ACLs to disallow all banned users from writing or commenting after adding ACLsWithUserGroups.) --MovieLady

There is a relevant link to an action at wikini for removing additions by particular IP's or users at CommunityNotes.--GmBowen

Thanks for the link! I've translated and made minor changes to the code, and posted everything to DeleteSpamAction. He's got a very good starting point, I think. One could adapt the code (fairly easily) to allow you to look at all the revisions on a page instead of by user/IP and then delete the histories you don't want to keep, for whatever reason. --MovieLady

Banning users

just so it doesn't get "lost", I'm copying a few comments from another page here. --JW

removed a spam comment - but it should be removed from the database as well!

-- JavaWoman (2005-02-25 08:37:47)

re-removed a spam comment

-- DarTar (2005-02-25 11:53:39)

re-re-removed a spam comment. Grrr! Similar/same content, obviously the same spammer. Do we have any means to "ban" a spammer??

-- JavaWoman (2005-02-28 06:43:2)

Since it's not a registered user and some of us (was it you?) were arguing against banning by IP I don't see how else we might ban a spammer.
This reminds me BTW (and should remind JsnX) that we need a more powerful blacklisting system for the referrers. In many cases URL spammers come from the same 2nd level domain:

spammer1.domain.com

spammer2.domain.com

spammer3.domain.com

etc.

deleting them manually is a horrible fuss (I did it once, I won't do it again), so we should give an option to blacklist alla spammers from the same 2nd level domain.

-- DarTar (2005-02-28 11:31:04)

So, what to do? Banning by IP is indeed fraught with the risk of banning innocent users since with a lot of large ISPs the IP addresses are assigned round-robin and may be different even between subsequent requests (i.e. the request for embedded images may each come from a different IP address, and from a different address than that for the page itself).

Possibly creating and storing a kind of "signature" consisting of not IP address but other request header elements, including user agent string but also accept headers, maybe combining that with a whole IP block rather than a single address might give us some sort of handle. But you'd need to actually store that and watch it for a while before you can tell how reliable (or not) that might be. --JavaWoman

Stopping Spammers getting Google Juice

There is a technique to stop spammers from gaining any advantage of spamming, which is to redirect external links to stop them from affecting their PageRank. Great to stop the whole purpose of spamming, but this has the disadvantage that good sites lose their google juice too. Check the comments out on that page for more cons. I've noticed since I enabled this on the Opera 7 wiki that slowly spam volume has dropped out, but I'm not entirely happy at the price paid. Had you thought about this, maybe have it as an option during config? -- IanAndolina

Good point, Ian. I had thought about this, after having seen several Wikis and blogs that use the Google redirection... I do think it should be configurable though - not every Wiki installation may want to do this (in fact, some may welcome external links as long as spam is repaired fast enough). --JavaWoman

I asked an export for SEO and he replied that it should be enough to use a simple internal redirect (e.g. exit.php?url=...) to create this effect. He also said that it might be helpful to disallow any spider access to that file (robots.txt). -- ReimerStegelmann

Unfortunately, search engine robots these days mostly do follow URLs with parameters, and an "internal redirect" done that way would be retrieved by a bot; HTTP redirects are followed, too (which is what you'd have to use with that "internal redirect" method). Meta redirects mostly aren't but you cannot apply this as a general "redirect external links" (especially not since you cannot have any URL parameters in a meta redirect - and you want to allow all valid external links, merely have them not count towards page rank in search engines, mostly Google). Excluding a single file with robots.txt won't work since all of Wikka runs off the single wikka.php file. The Google redirect method gets around all of that (at least for Google's ranking mechanism - which is what spammers are mostly targeting). --JavaWoman

They follow, but that is not the point of spam. The main target of a spammer is the reach a high ranking in search engines. They post links which linktext contains important keywords (e.g. Keyword1 keyword2 http://domain.tld). So, if you enter keyword1 oder keyword2 to a search engine, you will see the homepage of the spammer. By using a simple redirect, spiders will follow the link, but they give a fuck about the keywords and so the spammer gives a fuck about the link.

Exactly - and using the Google redirect prevents the target page from getting a higher ranking from incoming (spam) links because it won't be counted at all. :) --JavaWoman

Yeah, but you don't need Google to make this happen. A simple internal redirect is enough and looks better than a Google-Redirect ;)

Nope, because an internal redirect will be followed by Google and still count for page rank - that's the problem; the Google redirect prevents this. --JavaWoman

I talked to Abakus, a German SEO expert and he said it does not count. There is no difference between an internal redirect oder a Google redirect. Keywords of the link (s. above) only count for the redirect site and not for link behind the redirect. And well, why should a spider follow an internal link (via exit.php?url=...), but not a Google redirect?

A spider will follow any redirect, whether it's local or through an internal redirect. Never mind the keywords, it's still a link into the spammed site; with a local redirect that won't make any difference, but with the Google redirect Google knows to not count it as an incoming link. It's not (just) about keywords but about Page Rank (PR) - and PR is highly dependent on incoming links (and where they come from). That much we know. But no one except some Google employees knows the exact algorithm that determines PR - not even Abakus ;-) --JavaWoman

Maybe the solution is here.

If a user is not registered, to all external links he creates on the wiki will be added the attribute rel="nofollow".

This technique is now adopted by Google, Yahoo and MSN. --DotMG

Thanks, DotMG! This is great news - I had seen this technique being discussed as a proposed possible solution but had missed the news the proposal has actually been adopted now. (Should we worry about Altavista? Probably not too much - these SEs are the ones spammers will target primarily.) One possible hole I can see is that a spammer might write a script to quickly register and then post on a number of pages - but scripted registrations can be defended against with other means. Nothing will probably provide a 100% solution but this is a big step in the right direction. --JavaWoman

Referrer spam

Spammers sometimes visit Wikis and blogs with a tool with "bogus" referer headers containing the sites they want to generate incoming links for - this works on many wikis and blogs since such sites often have a page listing referrers (wikis) or list referrers to a particular post (blogs). If a Search engine indexes such a page, it would find a link to the spammed site, resulting in a higher "score" for that spammed page.

The general solution is to cause such links not to be followed by search engines. The technique outlined below under "Don't let old pages get indexed" already takes care of this for the referrer listings Wikka uses.

I am trying this technique to detect which referrer is a spam and which isn't :
1) Add a field named remote_addr to referrers table.
2) Modify LogReferrer() method and add set remote_addr = $_SERVER['REMOTE_ADDR'];
3) Change LoadReferrers() to returns only records where the field remote_addr is blank
4) Add this code to header action :

<link rel="stylesheet" href="<?php echo $this->Href("wikka.css", "pseudodir"); ?>" type="text/css" />

5) Create a file named wikka.css.php and put it at ./handler/page. This file will update the table referrers and set remote_addr to blank if remote_addr is the same as $_SERVER['REMOTE_ADDR'] and time is greater than "now() plus six minutes" ... The scripts will return a css file, with header no-cache, must-revalidate, expired, random etag, ... so that it will be requested each time a page is requested. --DotMG

Explanation : The difference between a spambot and a real user is that a spambot just loads a page, and it doesn' t analyse its content, so, with a spambot, all css files linked within the document won't be loaded.

That's not a reliable method since people who use a text-only browser will probably not load any CSS (or images for that matter) either. Looking at a combination of browsing behavior (what's loaded and what not) and user agent string could make this a little more reliable but still not 100%. --JavaWoman

Email-gathering spambots

Spambots spider websites looking for email addresses to add to the list (to use, or to sell as a targeted list). A general defense that works well (though not 100%) is to "obfuscate" email addresses so such spambots don't recognize them.

I would like to use an offensive attack against spams. My wikka would generate some false email address and present them as

<a href="mailto:[email protected]" class="email">[email protected]</a>

, and somewhere in css files, you will find

.email {display: none;}

so that the false email won't be visible by human visitors of the site. The domain of the false email address will be either a domain name of a spam/porn site (tax them bandwidth), or a non existing domain. --DotMG

Obfuscating addresses automatically

Wikka 1.1.6.0 comes with a small action to create an obfuscated email "contact" link for the site administrator. Meanwhile, the formatter will simply turn every email address it recognizes into an email link (with the address also used for the link text) - providing nice fodder for spambots.

What we should have is a function that can turn a given email address into an obfuscated link - this could then be used by both the {{contact}} action and the formatter. It would (then) also enable use to change the obfuscating algorithm inside the fuction without affecting either the formatter or the contact action any more, and others can use this in their own extensions as well. --JavaWoman

Resolved Suggestions

Spam-defense measures that are already implemented in Wikka.

Don't let old pages get indexed

Extended method implemented as of Wikka 1.1.6.0 (Both the "noarchive" addition and applying it to the Sandbox)

To make absolutely sure old pages don't get archived (irrespective of your robots.txt) - essential to stopping WikiSpam from still getting juice from archived pages, why not make sure to add meta directives to those pages by adding something like:

<?php if ($this->GetMethod() != 'show' || $this->page["latest"] == "N") echo "<meta name=\"robots\" content=\"noindex, nofollow, noarchive\" />\n<meta name=\"googlebot\" content=\"noarchive, noindex, nofollow\">\n";?>

to header.php. This stops pages with handlers other than show or non current pages from any kind of archiving/cacheing.

Ian, thanks for the suggestion. Wikka has had something similar to this in place since the first release. See Mod033bRobotIndexing. But your suggestion expands the idea and adds the latest page check, "noarchive", and the googlebot part--which seem like good ideas. I'll add this to the upcoming release. By the way, when are you going to switch your site over to Wikka? ;) -- JsnX

Yes, nice idea. But the googlebot part is actually redundant, Google obeys the robots meta directives. (And that second meta tag isn't valid XHTML - it's unclosed.) I suggest we merely add the "noarchive". Apart from that, it would also be nice to stop indexing etc. from the SandBox page. --JavaWoman

The latest page check is important because wiki spammers don't really care if you delete their spam, as long as their links sit on an old archived page waiting to be indexed. The added googlebot directive (thanks for spotting typo btw) is just extra paranoia on my part :). And you are all doing an excellent job with Wikka - the only reason I haven't switched is that quite a lot on my Wakka is heavily customised and I don't have the time to redo that - especially as lots of pages would break without re-jigging of e.g. SafeHTML (my BookMarklets page for example). If I have time, I will eventually migrate...! -- IanAndolina