Revision [5047]
This is an old revision of WikkaSpamFighting made by JavaWoman on 2005-01-24 10:26:18.
Fighting spam in Wikka
As it may have dawned on you by now, spam is getting to be a problem in wiki's - both the type of spam that also plagues many blogs in the form of comment spam (only in a wiki it woudl (also) affect page content), and referrer spam. And then there are spambots intent on gathering email addresses.
Wikka sites are no exception any more (and other WakkaWiki forks seem to be having problems, too)
This page is intended to gather ideas for how to fight spam (of all types) in Wikka, so we can coordinate our efforts and get a spammer-hardened Wikka out there.
Spam in Wikka pages
Stopping Spammers getting Google Juice
There is a technique to stop spammers from gaining any advantage of spamming, which is to redirect external links to stop them from affecting their PageRank. Great to stop the whole purpose of spamming, but this has the disadvantage that good sites lose their google juice too. Check the comments out on that page for more cons. I've noticed since I enabled this on the Opera 7 wiki that slowly spam volume has dropped out, but I'm not entirely happy at the price paid. Had you thought about this, maybe have it as an option during config? -- IanAndolina- Good point, Ian. I had thought about this, after having seen several Wikis and blogs that use the Google redirection... I do think it should be configurable though - not every Wiki installation may want to do this (in fact, some may welcome external links as long as spam is repaired fast enough). --JavaWoman
- I asked an export for SEO and he replied that it should be enough to use a simple internal redirect (e.g. exit.php?url=...) to create this effect. He also said that it might be helpful to disallow any spider access to that file (robots.txt). -- ReimerStegelmann
- Unfortunately, search engine robots these days mostly do follow URLs with parameters, and an "internal redirect" done that way would be retrieved by a bot; HTTP redirects are followed, too (which is what you'd have to use with that "internal redirect" method). Meta redirects mostly aren't but you cannot apply this as a general "redirect external links" (especially not since you cannot have any URL parameters in a meta redirect - and you want to allow all valid external links, merely have them not count towards page rank in search engines, mostly Google). Excluding a single file with robots.txt won't work since all of Wikka runs off the single wikka.php file. The Google redirect method gets around all of that (at least for Google's ranking mechanism - which is what spammers are mostly targeting). --JavaWoman
- They follow, but that is not the point of spam. The main target of a spammer is the reach a high ranking in search engines. They post links which linktext contains important keywords (e.g. Keyword1 keyword2 http://domain.tld). So, if you enter keyword1 oder keyword2 to a search engine, you will see the homepage of the spammer. By using a simple redirect, spiders will follow the link, but they give a fuck about the keywords and so the spammer gives a fuck about the link.
- Exactly - and using the Google redirect prevents the target page from getting a higher ranking from incoming (spam) links because it won't be counted at all. :) --JavaWoman
- Yeah, but you don't need Google to make this happen. A simple internal redirect is enough and looks better than a Google-Redirect ;)
- Nope, because an internal redirect will be followed by Google and still count for page rank - that's the problem; the Google redirect prevents this. --JavaWoman
- I talked to Abakus, a German SEO expert and he said it does not count. There is no difference between an internal redirect oder a Google redirect. Keywords of the link (s. above) only count for the redirect site and not for link behind the redirect. And well, why should a spider follow an internal link (via exit.php?url=...), but not a Google redirect?
- A spider will follow any redirect, whether it's local or through an internal redirect. Never mind the keywords, it's still a link into the spammed site; with a local redirect that won't make any difference, but with the Google redirect Google knows to not count it as an incoming link. It's not (just) about keywords but about Page Rank (PR) - and PR is highly dependent on incoming links (and where they come from). That much we know. But no one except some Google employees knows the exact algorithm that determines PR - not even Abakus ;-) --JavaWoman
- Maybe the solution is here.
- If a user is not registered, to all external links he creates on the wiki will be added the attribute rel="nofollow".
- This technique is now adopted by Google, Yahoo and MSN. --DotMG
- Thanks, DotMG! This is great news - I had seen this technique being discussed as a proposed possible solution but had missed the news the proposal has actually been adopted now. (Should we worry about Altavista? Probably not too much - these SEs are the ones spammers will target primarily.) One possible hole I can see is that a spammer might write a script to quickly register and then post on a number of pages - but scripted registrations can be defended against with other means. Nothing will probably provide a 100% solution but this is a big step in the right direction. --JavaWoman
Referrer spam
Email-gathering spambots
Resolved Suggestions
Don't let old pages get indexed
To make absolutely sure old pages don't get archived (irrespective of your robots.txt) - essential to stopping WikiSpam from still getting juice from archived pages, why not make sure to add meta directives to those pages by adding something like:<?php if ($this->GetMethod() != 'show' || $this->page["latest"] == "N") echo "<meta name=\"robots\" content=\"noindex, nofollow, noarchive\" />\n<meta name=\"googlebot\" content=\"noarchive, noindex, nofollow\">\n";?>
to header.php. This stops pages with handlers other than show or non current pages from any kind of archiving/cacheing.
- Ian, thanks for the suggestion. Wikka has had something similar to this in place since the first release. See Mod033bRobotIndexing. But your suggestion expands the idea and adds the latest page check, "noarchive", and the googlebot part--which seem like good ideas. I'll add this to the upcoming release. By the way, when are you going to switch your site over to Wikka? ;) -- JsnX
- Yes, nice idea. But the googlebot part is actually redundant, Google obeys the robots meta directives. (And that second meta tag isn't valid XHTML - it's unclosed.) I suggest we merely add the "noarchive". Apart from that, it would also be nice to stop indexing etc. from the SandBox page. --JavaWoman
- The latest page check is important because wiki spammers don't really care if you delete their spam, as long as their links sit on an old archived page waiting to be indexed. The added googlebot directive (thanks for spotting typo btw) is just extra paranoia on my part :). And you are all doing an excellent job with Wikka - the only reason I haven't switched is that quite a lot on my Wakka is heavily customised and I don't have the time to redo that - especially as lots of pages would break without re-jigging of e.g. SafeHTML (my BookMarklets page for example). If I have time, I will eventually migrate...! -- IanAndolina
- Both the "noarchive" addition and applying it to the Sandbox as well as old pages will be in Wikka version 1.1.6.0 - as you can see in the HomePage public beta! --JavaWoman
CategoryWikka
CategoryDevelopment