WikkaSpamFighting:Wikka

Revision [5052]

This is an old revision of WikkaSpamFighting made by JavaWoman on 2005-01-24 10:39:37.

Fighting spam in Wikka

As it may have dawned on you by now, spam is getting to be a problem in wiki's - both the type of spam that also plagues many blogs in the form of comment spam (only in a wiki it woudl (also) affect page content), and referrer spam. And then there are spambots intent on gathering email addresses.

Wikka sites are no exception any more (and other WakkaWiki forks seem to be having problems, too)

This page is intended to gather ideas for how to fight spam (of all types) in Wikka, so we can coordinate our efforts and get a spammer-hardened Wikka out there.

Spam in Wikka pages

Stopping Spammers getting Google Juice

There is a technique to stop spammers from gaining any advantage of spamming, which is to redirect external links to stop them from affecting their PageRank. Great to stop the whole purpose of spamming, but this has the disadvantage that good sites lose their google juice too. Check the comments out on that page for more cons. I've noticed since I enabled this on the Opera 7 wiki that slowly spam volume has dropped out, but I'm not entirely happy at the price paid. Had you thought about this, maybe have it as an option during config? -- IanAndolina

Good point, Ian. I had thought about this, after having seen several Wikis and blogs that use the Google redirection... I do think it should be configurable though - not every Wiki installation may want to do this (in fact, some may welcome external links as long as spam is repaired fast enough). --JavaWoman

I asked an export for SEO and he replied that it should be enough to use a simple internal redirect (e.g. exit.php?url=...) to create this effect. He also said that it might be helpful to disallow any spider access to that file (robots.txt). -- ReimerStegelmann

Unfortunately, search engine robots these days mostly do follow URLs with parameters, and an "internal redirect" done that way would be retrieved by a bot; HTTP redirects are followed, too (which is what you'd have to use with that "internal redirect" method). Meta redirects mostly aren't but you cannot apply this as a general "redirect external links" (especially not since you cannot have any URL parameters in a meta redirect - and you want to allow all valid external links, merely have them not count towards page rank in search engines, mostly Google). Excluding a single file with robots.txt won't work since all of Wikka runs off the single wikka.php file. The Google redirect method gets around all of that (at least for Google's ranking mechanism - which is what spammers are mostly targeting). --JavaWoman

They follow, but that is not the point of spam. The main target of a spammer is the reach a high ranking in search engines. They post links which linktext contains important keywords (e.g. Keyword1 keyword2 http://domain.tld). So, if you enter keyword1 oder keyword2 to a search engine, you will see the homepage of the spammer. By using a simple redirect, spiders will follow the link, but they give a fuck about the keywords and so the spammer gives a fuck about the link.

Exactly - and using the Google redirect prevents the target page from getting a higher ranking from incoming (spam) links because it won't be counted at all. :) --JavaWoman

Yeah, but you don't need Google to make this happen. A simple internal redirect is enough and looks better than a Google-Redirect ;)

Nope, because an internal redirect will be followed by Google and still count for page rank - that's the problem; the Google redirect prevents this. --JavaWoman

I talked to Abakus, a German SEO expert and he said it does not count. There is no difference between an internal redirect oder a Google redirect. Keywords of the link (s. above) only count for the redirect site and not for link behind the redirect. And well, why should a spider follow an internal link (via exit.php?url=...), but not a Google redirect?

A spider will follow any redirect, whether it's local or through an internal redirect. Never mind the keywords, it's still a link into the spammed site; with a local redirect that won't make any difference, but with the Google redirect Google knows to not count it as an incoming link. It's not (just) about keywords but about Page Rank (PR) - and PR is highly dependent on incoming links (and where they come from). That much we know. But no one except some Google employees knows the exact algorithm that determines PR - not even Abakus ;-) --JavaWoman

Maybe the solution is here.

If a user is not registered, to all external links he creates on the wiki will be added the attribute rel="nofollow".

This technique is now adopted by Google, Yahoo and MSN. --DotMG

Thanks, DotMG! This is great news - I had seen this technique being discussed as a proposed possible solution but had missed the news the proposal has actually been adopted now. (Should we worry about Altavista? Probably not too much - these SEs are the ones spammers will target primarily.) One possible hole I can see is that a spammer might write a script to quickly register and then post on a number of pages - but scripted registrations can be defended against with other means. Nothing will probably provide a 100% solution but this is a big step in the right direction. --JavaWoman

Spam repair and defense

1/22/05 - Spam help! I have apparently been attacked by an army of spam bots. Has this happened to anyone else? For now, I am asking for your help with:

a SQL command that will delete all of these edits

a SQL command that will change all of my ACLs to '+' for writing and commenting (I've modified the config file but that only affects new pages AFAIK)

Whatever script they used (on multiple machines, no less) could certainly be used against any Wakka-like site with minimal modifications, so something has to be done...I will do what I can to help you guys combat future attacks as well as implement the new HTML attribute you've probably all heard about. --RichardBerg

Richard: here's the sql to update all your current ACLs (I'm using mysql 4.0.22):

UPDATE acls SET comment_acl="+" WHERE comment_acl="*";
UPDATE acls SET write_acl="+" WHERE write_acl="*";

You'll need to change the table name (acls) to match whatever your table is named. Give me a few to look at the pages table and your site and I should have the sql for removing the edits. :) -- MovieLady

Since Richard has already changed his default ACLs in the configuration, that would apply to any page that did not have ACLs different from the original default (not merely new pages!); your SQL code should take care of any pages that had ACLs different from the original default (because only those would have a record in the ACLs table).
See also JsnX's suggestion about "Clearing ACLs" on the SuggestionBox which explains how this mechanism works. Thanks, MovieLady! --JavaWoman

Correct. Both statements will change only the entries that had the default ACL from his config file in that field. (What the statements are looking for can, of course be changed, as can what the field is being set to. I used it when I went back and changed my default ACLs on all pages that had ACLs to disallow all banned users from writing or commenting after adding ACLsWithUserGroups.) --MovieLady

There is a relevant link to an action at wikini for removing additions by particular IP's or users at CommunityNotes.--GmBowen

Thanks for the link! I've translated and made minor changes to the code, and posted everything to DeleteSpamAction. He's got a very good starting point, I think. One could adapt the code (fairly easily) to allow you to look at all the revisions on a page instead of by user/IP and then delete the histories you don't want to keep, for whatever reason. --MovieLady

to header.php. This stops pages with handlers other than show or non current pages from any kind of archiving/cacheing.

Ian, thanks for the suggestion. Wikka has had something similar to this in place since the first release. See Mod033bRobotIndexing. But your suggestion expands the idea and adds the latest page check, "noarchive", and the googlebot part--which seem like good ideas. I'll add this to the upcoming release. By the way, when are you going to switch your site over to Wikka? ;) -- JsnX

Yes, nice idea. But the googlebot part is actually redundant, Google obeys the robots meta directives. (And that second meta tag isn't valid XHTML - it's unclosed.) I suggest we merely add the "noarchive". Apart from that, it would also be nice to stop indexing etc. from the SandBox page. --JavaWoman

The latest page check is important because wiki spammers don't really care if you delete their spam, as long as their links sit on an old archived page waiting to be indexed. The added googlebot directive (thanks for spotting typo btw) is just extra paranoia on my part :). And you are all doing an excellent job with Wikka - the only reason I haven't switched is that quite a lot on my Wakka is heavily customised and I don't have the time to redo that - especially as lots of pages would break without re-jigging of e.g. SafeHTML (my BookMarklets page for example). If I have time, I will eventually migrate...! -- IanAndolina

Both the "noarchive" addition and applying it to the Sandbox as well as old pages will be in Wikka version 1.1.6.0 - as you can see in the HomePage public beta! --JavaWoman

Further references

Meatball — WikiSpam
C2.com — WikiSpam

CategoryWikka
CategoryDevelopment

Wikka : WikkaSpamFighting

Revision [5052]

Fighting spam in Wikka

Spam in Wikka pages

Stopping Spammers getting Google Juice

Spam repair and defense

Referrer spam

Email-gathering spambots

Resolved Suggestions

Don't let old pages get indexed

Further references