Revision history for WikkaSpamFighting


Revision [23505]

Last edited on 2016-05-20 07:38:48 by BrianKoontz [Replaces old-style internal links with new pipe-split links.]
Additions:
Bad Behavior is a set of PHP scripts which prevents spambots from accessing your site by analyzing their actual HTTP requests and comparing them to profiles from known spambots. (quote from the [[http://www.ioerror.us/software/bad-behavior/ | homepage]])
Wacko wiki has implemented a content filter based on a word/phrase list. I'm not sure how sophisticated it is (it's not a Bayesian filter), but uses a list updated from ++chongqed.org++. Read more about it [[http://wackowiki.com/SPAM?v=1dlz | here]]. I thought this might contribute to our conversations about spamfighting. --GmBowen
~-[[http://esw.w3.org/topic/BadContent | BadContent]] (//MoinMoin//)
~~&In fact, I do :). I was playing around with [[http://www.phpgeek.com/pragmacms/index.php?layout=main&cslot_1=14 | this small Bayesian filter I found last week]]. It's around 55KB unzipped. Implementing it in Wikka would be less-than easy however. --MikeXstudios
Based on [[http://shiflett.org/archive/96 | this post]], I wonder whether providing randomised session tokens for form submission may provide just one more step to impede spambots. Very simple to implement:
~&Good point, Ian. I had been thinking about a similar approach (I have a plugin for SquirrelMail installed that essentially does the same thing for the login dialog) - but reading therough the comments on [[http://shiflett.org/archive/96 | Chris Shiflett's article]] it's clear that this is no more than another little hurdle easily overcome by the script-writing link spammer (GET the page with the script first, read the token and use that in the scripted POST). That said, it //is// another hurdle that may deter at least some naive spammers - and with very little code. So, nice. --JavaWoman (who suddenly realizes her Squirrelmail isn't as secure as she thought it was - but at least has built in other layers of security)
I don't know how to do that in PHP (it is a PHP based wiki I believe) - though the [[http://www.google.com/search?q=captcha+php&sourceid=opera&num=0&ie=utf-8&oe=utf-8 | more complex image based solutions]] are available. This for me is **far** prefereable to locking pages for writing using ACLs - which IMO destroys the very purpose of the wiki. --IanAndolina
~&Whether simply an image or an image turned into "ASCII art", an image-based CAPTCHA that does not provide an alternative is **inaccessible** for people with visual impairments. Read [[http://blog.carrolltech.org/archives/24 | this story]] ("Oops, Google Did it Again") to see why we really should **not** use anything like that. --JavaWoman
[Copied from SuggestionBox] "There's also code around that uses GD & that could be built onto Nils' code that generates a "registration password" automatically and outputs it as a distorted graphic image.....the code is intended to befuddle auto spam registers & thus stop open-registration sites from being hit by spam bots that register themselves as users. Ultimately, as the bots become more sophisticated I think we'll have to use something like that or else sites like this one (with open registration) will be victimized. [[http://www.bluecopia.com/form.php | Here]] and [[http://www.horobey.com/demos/codegen/ | here]] are examples of what I mean (I like the simplicity of the first version in the second example). -- GmBowen" .... I think at least we need a system like one of these (or like the one Ian suggests) on the user registration page. mb
~&But spam-flooded wiki's are also **in**accessible....we only have to look at wikka's parent to see that....so //some// sort of more elaborate system is needed to enhance overall accessibility. In the large, I see the overall job of developing wikka as a matter of providing a tool-kit that each wikka-owner can make the decisions around. Providing a registration system that (optionally) uses both a registration code as provided by the wikka-owner by email *or* some sort of visual validation system (I like the character size of captcha, and [[http://www.horobey.com/demos/codegen/v1/humancheck.php | this]] one)...
1/22/05 - Spam help! I have apparently been attacked by an army of [[http://www.richardberg.net/RecentChanges | spam bots]]. Has this happened to anyone else? For now, I am asking for your help with:
Whatever script they used (on multiple machines, no less) could certainly be used against any Wakka-like site with minimal modifications, so something has to be done...I will do what I can to help you guys combat future attacks as well as implement the new [[http://www.google.com/googleblog/2005/01/preventing-comment-spam.html | HTML attribute]] you've probably all heard about. --RichardBerg
[[http://simon.incutio.com/archive/2004/05/11/approved | There is a technique]] to stop spammers from gaining any advantage of spamming, which is to redirect external links to stop them from affecting their PageRank. Great to stop the whole purpose of spamming, but this has the disadvantage that good sites lose their google juice too. Check the comments out on that page for more cons. I've noticed since I enabled this on the Opera 7 wiki that slowly spam volume has dropped out, but I'm not entirely happy at the price paid. Had you thought about this, maybe have it as an option during config? -- IanAndolina
~~~~&They follow, but that is not the point of spam. The main target of a spammer is the reach a high ranking in search engines. They post links which linktext contains important keywords (e.g. [[Keyword1 | keyword2 http://domain.tld]]). So, if you enter keyword1 oder keyword2 to a search engine, you will see the homepage of the spammer. By using a simple redirect, spiders will follow the link, but they give a fuck about the keywords and so the spammer gives a fuck about the link.
~&Maybe the solution is [[http://www.google.com/googleblog/2005/01/preventing-comment-spam.html | here]].
~-[[http://www.usemod.com/cgi-bin/mb.pl?WikiSpam | Meatball — WikiSpam]]
~-[[http://c2.com/cgi/wiki?WikiSpam | C2.com — WikiSpam]]
~-[[http://chongq.blogspot.com/ | Goggle Ending Comment Spam]] [sic]
~-[[http://wackowiki.com/SPAM/ | Wacko Wiki - SPAM]]
~-[[MoinMoin:AntiSpamGlobalSolution]] - see also [[http://esw.w3.org/topic/WikiSpam | WikiSpam]]""<!--
~-[[http://www.bluestack.org/ReferrerSpam | ReferrerSpam]] - A Wakka page about preventing referrer spam ... which page is now spammed (ouch!) - page seems to have disappeared now ....-->""
~-[[http://the.taoofmac.com.nyud.net:8090/space/HOWTO/Block%20Spam%20Referrers%20in%20PHP | HOWTO: Block Spam Referrers in PHP]] - a method to block entire blocks of IP addresses (written for PhpWiki but easily adaptable)
~-[[http://www.theregister.co.uk/2005/01/31/link_spamer_interview/ | Interview with a Link-Spammer]]
~-[[http://software.newsforge.com/article.pl?sid=05/06/21/1641223 | Stemming the menace of wiki spamming]] - nice article by Rob Sutherland though it misses referrer spam. Some very clueful but also clueless comments.
~-[[http://www.phpguru.org/static/TextualNumbers.html | Using Numbers Converted to Text for a Captcha]]
Deletions:
Bad Behavior is a set of PHP scripts which prevents spambots from accessing your site by analyzing their actual HTTP requests and comparing them to profiles from known spambots. (quote from the [[http://www.ioerror.us/software/bad-behavior/ homepage]])
Wacko wiki has implemented a content filter based on a word/phrase list. I'm not sure how sophisticated it is (it's not a Bayesian filter), but uses a list updated from ++chongqed.org++. Read more about it [[http://wackowiki.com/SPAM?v=1dlz here]]. I thought this might contribute to our conversations about spamfighting. --GmBowen
~-[[http://esw.w3.org/topic/BadContent BadContent]] (//MoinMoin//)
~~&In fact, I do :). I was playing around with [[http://www.phpgeek.com/pragmacms/index.php?layout=main&cslot_1=14 this small Bayesian filter I found last week]]. It's around 55KB unzipped. Implementing it in Wikka would be less-than easy however. --MikeXstudios
Based on [[http://shiflett.org/archive/96 this post]], I wonder whether providing randomised session tokens for form submission may provide just one more step to impede spambots. Very simple to implement:
~&Good point, Ian. I had been thinking about a similar approach (I have a plugin for SquirrelMail installed that essentially does the same thing for the login dialog) - but reading therough the comments on [[http://shiflett.org/archive/96 Chris Shiflett's article]] it's clear that this is no more than another little hurdle easily overcome by the script-writing link spammer (GET the page with the script first, read the token and use that in the scripted POST). That said, it //is// another hurdle that may deter at least some naive spammers - and with very little code. So, nice. --JavaWoman (who suddenly realizes her Squirrelmail isn't as secure as she thought it was - but at least has built in other layers of security)
I don't know how to do that in PHP (it is a PHP based wiki I believe) - though the [[http://www.google.com/search?q=captcha+php&sourceid=opera&num=0&ie=utf-8&oe=utf-8 more complex image based solutions]] are available. This for me is **far** prefereable to locking pages for writing using ACLs - which IMO destroys the very purpose of the wiki. --IanAndolina
~&Whether simply an image or an image turned into "ASCII art", an image-based CAPTCHA that does not provide an alternative is **inaccessible** for people with visual impairments. Read [[http://blog.carrolltech.org/archives/24 this story]] ("Oops, Google Did it Again") to see why we really should **not** use anything like that. --JavaWoman
[Copied from SuggestionBox] "There's also code around that uses GD & that could be built onto Nils' code that generates a "registration password" automatically and outputs it as a distorted graphic image.....the code is intended to befuddle auto spam registers & thus stop open-registration sites from being hit by spam bots that register themselves as users. Ultimately, as the bots become more sophisticated I think we'll have to use something like that or else sites like this one (with open registration) will be victimized. [[http://www.bluecopia.com/form.php Here]] and [[http://www.horobey.com/demos/codegen/ here]] are examples of what I mean (I like the simplicity of the first version in the second example). -- GmBowen" .... I think at least we need a system like one of these (or like the one Ian suggests) on the user registration page. mb
~&But spam-flooded wiki's are also **in**accessible....we only have to look at wikka's parent to see that....so //some// sort of more elaborate system is needed to enhance overall accessibility. In the large, I see the overall job of developing wikka as a matter of providing a tool-kit that each wikka-owner can make the decisions around. Providing a registration system that (optionally) uses both a registration code as provided by the wikka-owner by email *or* some sort of visual validation system (I like the character size of captcha, and [[http://www.horobey.com/demos/codegen/v1/humancheck.php this]] one)...
1/22/05 - Spam help! I have apparently been attacked by an army of [[http://www.richardberg.net/RecentChanges spam bots]]. Has this happened to anyone else? For now, I am asking for your help with:
Whatever script they used (on multiple machines, no less) could certainly be used against any Wakka-like site with minimal modifications, so something has to be done...I will do what I can to help you guys combat future attacks as well as implement the new [[http://www.google.com/googleblog/2005/01/preventing-comment-spam.html HTML attribute]] you've probably all heard about. --RichardBerg
[[http://simon.incutio.com/archive/2004/05/11/approved There is a technique]] to stop spammers from gaining any advantage of spamming, which is to redirect external links to stop them from affecting their PageRank. Great to stop the whole purpose of spamming, but this has the disadvantage that good sites lose their google juice too. Check the comments out on that page for more cons. I've noticed since I enabled this on the Opera 7 wiki that slowly spam volume has dropped out, but I'm not entirely happy at the price paid. Had you thought about this, maybe have it as an option during config? -- IanAndolina
~~~~&They follow, but that is not the point of spam. The main target of a spammer is the reach a high ranking in search engines. They post links which linktext contains important keywords (e.g. [[Keyword1 keyword2 http://domain.tld]]). So, if you enter keyword1 oder keyword2 to a search engine, you will see the homepage of the spammer. By using a simple redirect, spiders will follow the link, but they give a fuck about the keywords and so the spammer gives a fuck about the link.
~&Maybe the solution is [[http://www.google.com/googleblog/2005/01/preventing-comment-spam.html here]].
~-[[http://www.usemod.com/cgi-bin/mb.pl?WikiSpam Meatball — WikiSpam]]
~-[[http://c2.com/cgi/wiki?WikiSpam C2.com — WikiSpam]]
~-[[http://chongq.blogspot.com/ Goggle Ending Comment Spam]] [sic]
~-[[http://wackowiki.com/SPAM/ Wacko Wiki - SPAM]]
~-[[MoinMoin:AntiSpamGlobalSolution]] - see also [[http://esw.w3.org/topic/WikiSpam WikiSpam]]""<!--
~-[[http://www.bluestack.org/ReferrerSpam ReferrerSpam]] - A Wakka page about preventing referrer spam ... which page is now spammed (ouch!) - page seems to have disappeared now ....-->""
~-[[http://the.taoofmac.com.nyud.net:8090/space/HOWTO/Block%20Spam%20Referrers%20in%20PHP HOWTO: Block Spam Referrers in PHP]] - a method to block entire blocks of IP addresses (written for PhpWiki but easily adaptable)
~-[[http://www.theregister.co.uk/2005/01/31/link_spamer_interview/ Interview with a Link-Spammer]]
~-[[http://software.newsforge.com/article.pl?sid=05/06/21/1641223 Stemming the menace of wiki spamming]] - nice article by Rob Sutherland though it misses referrer spam. Some very clueful but also clueless comments.
~-[[http://www.phpguru.org/static/TextualNumbers.html Using Numbers Converted to Text for a Captcha]]


Revision [21690]

Edited on 2012-02-26 20:48:58 by BrianKoontz [removed dead links]
Additions:
Wacko wiki has implemented a content filter based on a word/phrase list. I'm not sure how sophisticated it is (it's not a Bayesian filter), but uses a list updated from ++chongqed.org++. Read more about it [[http://wackowiki.com/SPAM?v=1dlz here]]. I thought this might contribute to our conversations about spamfighting. --GmBowen
~&Mike, I see using the blacklist from ++chongqed.org++ mentioned as an option but I don't see any reference that this is what actually has been implemented - merely that //some// content filtering has been implemented. The best way to use ++chongqed.org++ is to use their blacklist dynamically. --JavaWoman
Deletions:
Wacko wiki has implemented a content filter based on a word/phrase list. I'm not sure how sophisticated it is (it's not a Bayesian filter), but uses a list updated from chongqed.org. Read more about it [[http://wackowiki.com/SPAM?v=1dlz here]]. I thought this might contribute to our conversations about spamfighting. --GmBowen
~&Mike, I see using the blacklist from chongqed.org mentioned as an option but I don't see any reference that this is what actually has been implemented - merely that //some// content filtering has been implemented. The best way to use chongqed.org is to use their blacklist dynamically. --JavaWoman
~-[[http://chongqed.org/ chongqed.org]]
~-[[http://blacklist.chongqed.org/ chongqed.org blacklist]] - use this dynamically
~-[[http://chongqed.org/submit.html Submit a wiki spammer]] - All your page ranks are belong to us!
~-[[http://wiki.chongqed.org//WikiSpam chongqed.org wiki: WikiSpam]] - A nice overview of various approaches to prevent or fight wiki spam, with links to discussions on other wikis


Revision [21689]

Edited on 2012-02-26 20:41:25 by BrianKoontz [removed dead link]
Deletions:
See also http://wiki.chongqed.org/Wikka.


Revision [20240]

Edited on 2008-09-13 11:53:12 by NilsLindenberg [link update]
Additions:
~- EditModeration


Revision [18806]

Edited on 2008-01-28 00:12:36 by NilsLindenberg [Modified links pointing to docs server]

No Differences

Revision [13626]

Edited on 2006-03-27 10:47:33 by NilsLindenberg [adding link to trac ticket]
Additions:
// [[Ticket:154]] //


Revision [11151]

Edited on 2005-09-22 07:42:37 by DarTar [link]
Additions:
~-SecurityModules
Deletions:
~SecurityModules


Revision [11150]

Edited on 2005-09-22 07:42:13 by DarTar [link]
Additions:
~SecurityModules


Revision [9748]

Edited on 2005-07-04 07:41:41 by JavaWoman [adding item about banning users, plus a few other things]
Additions:
~-AdvancedReferrersHandler
==Banning users==
//just so it doesn't get "lost", I'm copying a few comments from another page here. --JW//
~&removed a spam comment - but it should be removed from the database as well!
~&-- JavaWoman (2005-02-25 08:37:47)---
~&re-removed a spam comment
~&-- DarTar (2005-02-25 11:53:39)---
~&re-re-removed a spam comment. Grrr! Similar/same content, obviously the same spammer. Do we have any means to "ban" a spammer??
~&-- JavaWoman (2005-02-28 06:43:2)---
~&Since it's not a registered user and some of us (was it you?) were arguing against banning by IP I don't see how else we might ban a spammer. ---This reminds me BTW (and should remind JsnX) that we need a more powerful blacklisting system for the referrers. In many cases URL spammers come from the same 2nd level domain:
~&---
~&spammer1.domain.com
~&spammer2.domain.com
~&spammer3.domain.com
~&etc.
~&---
~&deleting them manually is a horrible fuss (I did it once, I won't do it again), so we should give an option to blacklist alla spammers from the same 2nd level domain.
~&-- DarTar (2005-02-28 11:31:04)
So, what to do? Banning by IP is indeed fraught with the risk of banning innocent users since with a lot of large ISPs the IP addresses are assigned round-robin and may be different even between subsequent requests (i.e. the request for embedded images may each come from a different IP address, and from a different address than that for the page itself).
Possibly creating and storing a kind of "signature" consisting of **not** IP address but other request header elements, including user agent string but also accept headers, maybe combining that with a whole IP **block** rather than a single address might give us some sort of handle. But you'd need to actually store that and watch it for a while before you can tell how reliable (or not) that might be. --JavaWoman
~&That's not a reliable method since people who use a text-only browser will probably not load any CSS (or images for that matter) either. Looking at a combination of browsing behavior (what's loaded and what not) and user agent string could make this a little more reliable but still not 100%. --JavaWoman


Revision [9726]

Edited on 2005-07-01 15:36:00 by JavaWoman [layout, new article]
Additions:
===Two Suggestions===
Explanation : The difference between a spambot and a real user is that a spambot just loads a page, and it doesn' t analyse its content, so, with a spambot, all css files linked within the document won't be loaded.

~-[[http://software.newsforge.com/article.pl?sid=05/06/21/1641223 Stemming the menace of wiki spamming]] - nice article by Rob Sutherland though it misses referrer spam. Some very clueful but also clueless comments.
Deletions:
===Two Suggestion===
Explanation : The difference between a spambot and a real user is that a spambot just loads a page, and it doesn' t analyse its content, so, with a spambot, all css files linked within the document won't be loaded.


Revision [8851]

Edited on 2005-06-06 10:03:02 by NilsLindenberg [link to BadBehavior]
Additions:
Bad Behavior is a set of PHP scripts which prevents spambots from accessing your site by analyzing their actual HTTP requests and comparing them to profiles from known spambots. (quote from the [[http://www.ioerror.us/software/bad-behavior/ homepage]])
~&copied to BadBehavior. --NilsLindenberg
Deletions:
Bad Behavior is a set of PHP scripts which prevents spambots from accessing your site by analyzing their actual HTTP requests and comparing them to profiles from known spambots. It goes far beyond User-Agent and Referer, however. Bad Behavior is available for several PHP-based software packages, and also can be integrated in seconds into any PHP script. (quote from the [[http://www.ioerror.us/software/bad-behavior/ homepage]])
~&Surely worth testing. --NilsLindenberg


Revision [8849]

Edited on 2005-06-06 07:42:15 by NilsLindenberg [added "bad behavior" suggestion]
Additions:
===Blocking Agents===
Bad Behavior is a set of PHP scripts which prevents spambots from accessing your site by analyzing their actual HTTP requests and comparing them to profiles from known spambots. It goes far beyond User-Agent and Referer, however. Bad Behavior is available for several PHP-based software packages, and also can be integrated in seconds into any PHP script. (quote from the [[http://www.ioerror.us/software/bad-behavior/ homepage]])
~&Surely worth testing. --NilsLindenberg


Revision [8664]

Edited on 2005-05-29 09:51:15 by JavaWoman [move to subcategory]
Additions:
CategoryWikka CategoryDevelopmentArchitecture
Deletions:
CategoryWikka
CategoryDevelopment


Revision [8529]

Edited on 2005-05-28 08:37:30 by JavaWoman [adding reference links]
Additions:
~-[[MoinMoin:AntiSpamGlobalSolution]] - see also [[http://esw.w3.org/topic/WikiSpam WikiSpam]]""<!--
Deletions:
""<!--


Revision [8528]

Edited on 2005-05-28 07:06:11 by JavaWoman [starting link list to Wiki anti-spam content filtering]
Additions:
~-[[http://esw.w3.org/topic/BadContent BadContent]] (//MoinMoin//)
Deletions:
~-[[http://esw.w3.org/topic/BadContent BadContent]]


Revision [8527]

Edited on 2005-05-28 07:03:49 by JavaWoman [starting link list to Wiki anti-spam content filtering]
Additions:
===Two Suggestion===
See also http://wiki.chongqed.org/Wikka.
Preliminary list of links to (apparent) content blocking systems in wikis (more as I find them):
~-[[http://esw.w3.org/topic/BadContent BadContent]]
--JavaWoman
Deletions:
==Two Suggestion==
see http://wiki.chongqed.org/Wikka.


Revision [7965]

Edited on 2005-05-09 10:19:02 by JavaWoman [no CAPTCHA!]
Additions:
~&Whether simply an image or an image turned into "ASCII art", an image-based CAPTCHA that does not provide an alternative is **inaccessible** for people with visual impairments. Read [[http://blog.carrolltech.org/archives/24 this story]] ("Oops, Google Did it Again") to see why we really should **not** use anything like that. --JavaWoman


Revision [7961]

Edited on 2005-05-09 07:42:37 by NilsLindenberg [added link]
Additions:
==Two Suggestion==
see http://wiki.chongqed.org/Wikka.


Revision [7647]

Edited on 2005-04-26 18:35:29 by JavaWoman [adding link]
Additions:
~-[[http://wiki.chongqed.org//WikiSpam chongqed.org wiki: WikiSpam]] - A nice overview of various approaches to prevent or fight wiki spam, with links to discussions on other wikis


Revision [7162]

Edited on 2005-04-09 07:26:22 by JavaWoman [link disappeared]
Additions:
""<!--
~-[[http://www.bluestack.org/ReferrerSpam ReferrerSpam]] - A Wakka page about preventing referrer spam ... which page is now spammed (ouch!) - page seems to have disappeared now ....-->""
Deletions:
~-[[http://www.bluestack.org/ReferrerSpam ReferrerSpam]] - A Wakka page about preventing referrer spam ... which page is now spammed (ouch!)


Revision [7161]

Edited on 2005-04-09 07:23:42 by JavaWoman [adding link]
Additions:
~-[[http://the.taoofmac.com.nyud.net:8090/space/HOWTO/Block%20Spam%20Referrers%20in%20PHP HOWTO: Block Spam Referrers in PHP]] - a method to block entire blocks of IP addresses (written for PhpWiki but easily adaptable)


Revision [6589]

Edited on 2005-03-08 09:03:12 by DotMG [See Referrer Spam and email_gathering spambots section]
Additions:
I am trying this technique to detect which referrer is a spam and which isn't :
1) Add a field named remote_addr to referrers table.
2) Modify LogReferrer() method and add set remote_addr = $_SERVER['REMOTE_ADDR'];
3) Change LoadReferrers() to returns only records where the field remote_addr is blank
4) Add this code to header action : %%<link rel="stylesheet" href="<?php echo $this->Href("wikka.css", "pseudodir"); ?>" type="text/css" />%%
5) Create a file named wikka.css.php and put it at ./handler/page. This file will update the table referrers and set remote_addr to blank if remote_addr is the same as $_SERVER['REMOTE_ADDR'] and time is greater than "now() plus six minutes" ... The scripts will return a css file, with header no-cache, must-revalidate, expired, random etag, ... so that it will be requested each time a page is requested. --DotMG
Explanation : The difference between a spambot and a real user is that a spambot just loads a page, and it doesn' t analyse its content, so, with a spambot, all css files linked within the document won't be loaded.
I would like to use an offensive attack against spams. My wikka would generate some false email address and present them as %%<a href="mailto:unexistingemail@thesnake.us" class="email">unexistingemail@thesnake.us</a>%%, and somewhere in css files, you will find
%%(css) .email {display: none;}%% so that the false email won't be visible by human visitors of the site. The domain of the false email address will be either a domain name of a spam/porn site (tax them bandwidth), or a non existing domain. --DotMG


Revision [6008]

Edited on 2005-02-15 19:02:21 by IanAndolina [Updated captcha section with a link to new article on a captcha-like technique of number<->text entr]
Additions:
~& Another Captcha-type technique here (and links to the Pear class to generate ascii-based captchas): http://www.phpguru.org/static/TextualNumbers.html — I have to say, some technique modified from this which uses multiple <span>'s with display states defined in the (seperate) CSS file seem quite powerful to me. Spambot would then have to parse the CSS too and work out what is the real number and what is "noise"...
~-[[http://www.usemod.com/cgi-bin/mb.pl?WikiSpam Meatball — WikiSpam]]
~-[[http://c2.com/cgi/wiki?WikiSpam C2.com — WikiSpam]]
~-[[http://www.phpguru.org/static/TextualNumbers.html Using Numbers Converted to Text for a Captcha]]
Deletions:
~-[[http://www.usemod.com/cgi-bin/mb.pl?WikiSpam Meatball — WikiSpam]]
~-[[http://c2.com/cgi/wiki?WikiSpam C2.com — WikiSpam]]


Revision [5728]

Edited on 2005-02-07 10:11:58 by JavaWoman [comment about Wacko's supposed content filter]
Additions:
~&Mike, I see using the blacklist from chongqed.org mentioned as an option but I don't see any reference that this is what actually has been implemented - merely that //some// content filtering has been implemented. The best way to use chongqed.org is to use their blacklist dynamically. --JavaWoman


Revision [5713]

Edited on 2005-02-07 06:30:33 by GmBowen [content filter @ wacko wiki (tpyo)]
Additions:
==Content Filter==
Deletions:
====Content Filter====


Revision [5712]

Edited on 2005-02-07 06:29:48 by GmBowen [content filter @ wackowiki]
Additions:
=====Fighting spam in Wikka=====

>>**see also:**
~-HideReferrers
~-RemovingUsers
~-WikkaAndEmail
~-DeleteSpamAction
>>As it may have dawned on you by now, spam is getting to be a problem in wiki's - both the type of spam that also plagues many blogs in the form of comment spam (only in a wiki it would (also) affect page content), and referrer spam. And then there are spambots intent on gathering email addresses.

Wikka sites are no exception any more (and other WakkaWiki forks seem to be having problems, too).

This page is intended to gather ideas for how to fight spam (of all types) in Wikka, so we can coordinate our efforts and get a spammer-hardened Wikka out there. You can also find some general information about (fighting) wiki spam and what Wikka has already implemented as defense measures.

====Spam in Wikka pages====
~//About how to discourage spammers to post links on spam pages in the first place, and what to do when your pages have been spammed already.//

====Content Filter====
Wacko wiki has implemented a content filter based on a word/phrase list. I'm not sure how sophisticated it is (it's not a Bayesian filter), but uses a list updated from chongqed.org. Read more about it [[http://wackowiki.com/SPAM?v=1dlz here]]. I thought this might contribute to our conversations about spamfighting. --GmBowen

==Bayesian Filter: Focus on the content==
Many of these suggestions will stop a certain degree of spam, but spammers can easily break these anti-spam measures such as adding random tokens (modern spam bots can already scan a page for form elements and submit all of them). Therefore, I suggest analyzing the content based on what might constitute spam (text frequency, link frequency, blacklist, bayesian filter) and then assigning a score to the post. If the post has over, let's say, a 50% chance for spam, then perhaps email validation, post approval, or a captcha can be used to further validate the user.

I'm particularly supportive of the bayesian filter. For instance, many spam fighting programs today use the bayesian filter (ie. Thunderbird). The bayesian algorithm is adaptive and learning which will work best when used in conjunction with other standard filters. The process might be like this:
1) The standard filters (ie. blacklist) catches a suspicious post. The post is marked for approval.
2) The admins will review the post at the post moderation panel. If the post is "ham" then the bayesian filters will automatically adapt to allow future posts that resemble the approved post through. However, if the post is "spam", then the bayesian filter will automatically adapt to block future posts with those certain keywords.

Therefore, a bayesian filter cannot be solely implemented, but rather, it requires admin intervention (to help the filter learn) and other standard filters.

Bayesian filters have been extremely successful in eliminating over 98% of common spam after a few weeks of adaptation.
--MikeXstudios
~&Nice idea Mike - but do you know of a Baysian filter implementation in PHP that could be easily integrated with Wikka? Preferably "lightweight" too, as we don't want an anti-spam solution be more weighty than Wikka itself. --JavaWoman
~~&In fact, I do :). I was playing around with [[http://www.phpgeek.com/pragmacms/index.php?layout=main&cslot_1=14 this small Bayesian filter I found last week]]. It's around 55KB unzipped. Implementing it in Wikka would be less-than easy however. --MikeXstudios

==Adding Random Tokens for Form Submissions?==
Based on [[http://shiflett.org/archive/96 this post]], I wonder whether providing randomised session tokens for form submission may provide just one more step to impede spambots. Very simple to implement:

wikka.php:
%%(php)function FormOpen($method = "", $tag = "", $formMethod = "post")
{
if(!isset($_SESSION['token'])) {
$token = md5(uniqid(rand(), true));
$_SESSION['token'] = $token;
}
$result = "<form action=\"".$this->Href($method, $tag)."\" method=\"".$formMethod."\"><p>\n";
$result .= "<input type=\"hidden\" name=\"token\" value=\"".$_SESSION['token']."\" />";
if (!$this->config["rewrite_mode"]) $result .= "<input type=\"hidden\" name=\"wakka\" value=\"".$this->MiniHref($method, $tag)."\" />\n";
return $result;
}%%

and then just wrap edit.php and addcomment.php sections using:
%%(php)if ($_POST['token'] == $_SESSION['token']) { //form spoof protection

}%%

I'm definitely no expert on security, and I can see how it can be bypassed, but it does require one more step and adds complexity for spambots to spoof the wiki forms at no cost... --IanAndolina
~&Good point, Ian. I had been thinking about a similar approach (I have a plugin for SquirrelMail installed that essentially does the same thing for the login dialog) - but reading therough the comments on [[http://shiflett.org/archive/96 Chris Shiflett's article]] it's clear that this is no more than another little hurdle easily overcome by the script-writing link spammer (GET the page with the script first, read the token and use that in the scripted POST). That said, it //is// another hurdle that may deter at least some naive spammers - and with very little code. So, nice. --JavaWoman (who suddenly realizes her Squirrelmail isn't as secure as she thought it was - but at least has built in other layers of security)


==Refining Redirection / nofollow modification for links==
One issue with the google redirection and newer rel="nofollow" is that good sites also get hit by this procedure. As we can't really tag links on a "trusted user" basis, we have to do that on a trusted server one. I use a whitelist in config.php with a list of "good servers":

%%(php)"serverre" => "/(nontroppo.org|goodsite.com|etc)/",%%

And my Link routine in the main wakka.php (wikka.php) is modified to make use of it:

%%(php)if (preg_match($this->GetConfigValue("serverre"), $tag))
{
$url = $tag; //trusted web sites so no need for redirects
$urlclass= "ext";
}
else
{
$tag = rawurlencode($tag);
$url = "http://www.google.com/url?q=".$tag;
$urlclass= "ext";
$follow = " rel=\"nofollow\" ";
}
return $url ? "<a ".$follow." class=\"".$urlclass."\" href=\"".$url."\">$text</a>" : $text;%%

This way, trusted sites get full and unadulterated links, but anything else has BOTH google redirection and rel="nofollow" added. The CSS can then contain ways to visually tag those different URLs, so the user can see if a link is trusted or not (I use advanced generated content - not supported in IE):

%%(css)
a.ext:after, a[rel="nofollow"]:after {content:"\00220A";
text-decoration: none !important;
font-size: 0.9em;
color: #888;
position: relative;
bottom: 1ex;}

a[rel="nofollow"]:after {content:"\002209";}
%% -- IanAndolina

==Spam Block for Saving pages==
As I was getting a lot of repeat spam of the same domains over and over, I implemented a "link blacklist" to my Wiki for comments and edits:

add to edit.php & addcomment.php:
%%(php)preg_match_all($this->GetConfigValue("spamre"),$body,$out); //keyword spam block
if (count($out[0])>=1)
{
$this->SetMessage("Go spam somewhere else. You links will never get spidered here anyway.");
$this->redirect($this->href());
return;
}%%

config.php
%%(php)"spamre" => "/(voip99|zhiliaotuofa|mycv|princeofprussia|imobissimo|valeofglamorganconservatives|68l|8cx|online-deals99).(net|cn|com|org)|(phentermine)/m",%%

Now, what I wanted to do was have an //admin only// wiki page, where the contents of the spamre regexp could be edited, instead of being hardwired in config.php - but never got round to it. **But** this would be the better way to do it - have a function that finds a wiki page and builds a regexp from the keywords added by admins to that wiki page (not all of whom may have access to config.php). It is a fairly basic method - but with a couple of vigilant admins can reduce repeat attacks from spam bots considerably. -- IanAndolina

==User Validation==

I like the ascii-based user validation scheme (Captcha) here:

http://www.moztips.com/wiki/index.pcgi?action=edit&page=SandBox

I don't know how to do that in PHP (it is a PHP based wiki I believe) - though the [[http://www.google.com/search?q=captcha+php&sourceid=opera&num=0&ie=utf-8&oe=utf-8 more complex image based solutions]] are available. This for me is **far** prefereable to locking pages for writing using ACLs - which IMO destroys the very purpose of the wiki. --IanAndolina

[Copied from SuggestionBox] "There's also code around that uses GD & that could be built onto Nils' code that generates a "registration password" automatically and outputs it as a distorted graphic image.....the code is intended to befuddle auto spam registers & thus stop open-registration sites from being hit by spam bots that register themselves as users. Ultimately, as the bots become more sophisticated I think we'll have to use something like that or else sites like this one (with open registration) will be victimized. [[http://www.bluecopia.com/form.php Here]] and [[http://www.horobey.com/demos/codegen/ here]] are examples of what I mean (I like the simplicity of the first version in the second example). -- GmBowen" .... I think at least we need a system like one of these (or like the one Ian suggests) on the user registration page. mb
~&[also copied from SuggestionBox] Yes, "Captcha" is an old trick - it will keep out some bots (but not all) and it will keep many people out, too, like those who are visually handicapped (not just people who are totally blind - being colorblind may be enough to be defeated by such tricks). Add a sound equivalent? Well, there are people who are deaf **and** blind. Are we going to deny them access to our wikis? I'm not in favor of making Wikka **in**accessible when we should be working towards making it **more** accessible. --JavaWoman.
~~&It was for this reason I suggested the ascii captcha - it uses mch larger letters AND it is perfect for the colour-blind. Interestingly, it probably also is better in terms of fooling image recognition algorithms, the artifical observer will find it much harder to parse ascii-art as elements have high-contrast, conflicting orientations AND are unjoined. Thus the first step to object classifications are greatly impeded. I do agree that captcha's can reduce accesibility, but an ascii-based recognition pattern reduces some of the major obstacles I saw in the image-based ones. --IanAndolina
~~~&Ian, I looked at that Moztips link you mentioed - and even with my relatively good eyes I found the contrast **way** too low. And whether you use an image or "ascii art" - screen readers used by visually handicapped people cannot make anything of it - it's only marginally better than an image in that it may defeat OCR utilities used by some spambots, but I don't see it as any more accessible, it could even be less accessible, depending on contrast in the image. --JavaWoman
~&But spam-flooded wiki's are also **in**accessible....we only have to look at wikka's parent to see that....so //some// sort of more elaborate system is needed to enhance overall accessibility. In the large, I see the overall job of developing wikka as a matter of providing a tool-kit that each wikka-owner can make the decisions around. Providing a registration system that (optionally) uses both a registration code as provided by the wikka-owner by email *or* some sort of visual validation system (I like the character size of captcha, and [[http://www.horobey.com/demos/codegen/v1/humancheck.php this]] one)...
~~&("this" one is quite horrible - I saw three glyphs, two the same that I could not guess were a 3 or a B and one in the middle that I could not decide was a 4 or an A. The only thing that can be said for it is that (at least the one sample I looked at) has high contrast... which doesn't help if the characters are not recognizable themselves -JW)
~~~&What I liked about it was that it didn't use GD (as others do, except captcha), so would be usable to all of our users, but rather pulled on a series of image files.....so the graphics could be easily changed to a series that was more universally readable (thereby addressing your concerns)....I also noticed that some of the characters were difficult to discern. That also means, of course, that a bot could be set to read the source image file name...so a potential security shortfall, although code could be written to (a) pull the image from an mysql database instead of a file & (b) have a "random" image file name assigned to it as it was read into the html page. --GmBowen
~&... at the __same__ time so a user could use either for registration would likely cover most bases for an open registration system, one with just the latter for teachers etc. that want one for their classes. --GmBowen
~~&I agree captcha **could** be an option (but it would be an extension, not Wikka "core"). Remember though that many people really hate solutions like this - they'll just leave (even if they are already part of a community) and refuse to post any more. A more stringent registration procedure (requiring email confirmation, and recording user's IP) would be be less invasive and sufficient to keep all but the most hardened spambots away; it would also ensure that people actually have a valid email address to use when they lose their password... Combine that with a "banning" option and even a spambot clever enough to sign up once with a valid email address can be kept out once discovered. --JavaWoman
~~~&Wouldn't it have to be automatic email confirmation though (for high-use sites)? Anyways, in the scenarios I work with (since we're discussing accessability) that's a less desirable option. There are ethical issues for schools/teachers with "requiring" kids to have an email address (especially yahoo, hotmail, etc) because of the amount of adult-oriented spam they receive. [That's part of the reason I'm working on the private message system to be embedded in wikka. Image verification thereby offers advantages that email verification does not, at least for the communities I'll be using this in. As far as "hating" procedures goes.....I dislike email verification (because of the possibility of being added to spam lists) **far** more than image verification. --GmBowen

==Spam repair and defense==
//See also DeleteSpamAction !//
1/22/05 - Spam help! I have apparently been attacked by an army of [[http://www.richardberg.net/RecentChanges spam bots]]. Has this happened to anyone else? For now, I am asking for your help with:
- a SQL command that will delete all of these edits
- a SQL command that will change all of my ACLs to '+' for writing and commenting (I've modified the config file but that only affects new pages AFAIK)

Whatever script they used (on multiple machines, no less) could certainly be used against any Wakka-like site with minimal modifications, so something has to be done...I will do what I can to help you guys combat future attacks as well as implement the new [[http://www.google.com/googleblog/2005/01/preventing-comment-spam.html HTML attribute]] you've probably all heard about. --RichardBerg

~&Richard: here's the sql to update all your current ACLs (I'm using mysql 4.0.22):
~~%%(sql) UPDATE acls SET comment_acl="+" WHERE comment_acl="*";
UPDATE acls SET write_acl="+" WHERE write_acl="*"; %%
~&You'll need to change the table name (acls) to match whatever your table is named. Give me a few to look at the pages table and your site and I should have the sql for removing the edits. :) -- MovieLady
~~&Since Richard has //already// changed his default ACLs in the configuration, that would apply to any page that did not have ACLs different from the **original** default (not merely new pages!); your SQL code should take care of any pages that had ACLs //different// from the original default (because only those would have a record in the ACLs table). --- See also JsnX's suggestion about "Clearing ACLs" on the SuggestionBox which explains how this mechanism works. Thanks, MovieLady! --JavaWoman
~~~& Correct. Both statements will change only the entries that had the default ACL from his config file in that field. (What the statements are looking for can, of course be changed, as can what the field is being set to. I used it when I went back and changed my default ACLs on all pages that had ACLs to disallow all banned users from writing or commenting after adding ACLsWithUserGroups.) --MovieLady

~&There is a relevant link to an action at wikini for removing additions by particular IP's or users at CommunityNotes.--GmBowen
~~& Thanks for the link! I've translated and made minor changes to the code, and posted everything to DeleteSpamAction. He's got a very good starting point, I think. One could adapt the code (fairly easily) to allow you to look at all the revisions on a page instead of by user/IP and then delete the histories you don't want to keep, for whatever reason. --MovieLady

==Stopping Spammers getting Google Juice==
[[http://simon.incutio.com/archive/2004/05/11/approved There is a technique]] to stop spammers from gaining any advantage of spamming, which is to redirect external links to stop them from affecting their PageRank. Great to stop the whole purpose of spamming, but this has the disadvantage that good sites lose their google juice too. Check the comments out on that page for more cons. I've noticed since I enabled this on the Opera 7 wiki that slowly spam volume has dropped out, but I'm not entirely happy at the price paid. Had you thought about this, maybe have it as an option during config? -- IanAndolina

~&Good point, Ian. I had thought about this, after having seen several Wikis and blogs that use the Google redirection... I do think it should be configurable though - not every Wiki installation may want to do this (in fact, some may welcome external links as long as spam is repaired fast enough). --JavaWoman
~~&I asked an export for SEO and he replied that it should be enough to use a simple internal redirect (e.g. exit.php?url=...) to create this effect. He also said that it might be helpful to disallow any spider access to that file (robots.txt). -- ReimerStegelmann
~~~&Unfortunately, search engine robots these days mostly **do** follow URLs with parameters, and an "internal redirect" done that way would be retrieved by a bot; HTTP redirects are followed, too (which is what you'd have to use with that "internal redirect" method). **Meta** redirects mostly aren't but you cannot apply this as a general "redirect external links" (especially not since you cannot have any URL parameters in a meta redirect - and you want to allow all //valid// external links, merely have them not count towards page rank in search engines, mostly Google). Excluding a single file with robots.txt won't work since all of Wikka runs off the single wikka.php file. The Google redirect method gets around all of that (at least for Google's ranking mechanism - which is what spammers are mostly targeting). --JavaWoman
~~~~&They follow, but that is not the point of spam. The main target of a spammer is the reach a high ranking in search engines. They post links which linktext contains important keywords (e.g. [[Keyword1 keyword2 http://domain.tld]]). So, if you enter keyword1 oder keyword2 to a search engine, you will see the homepage of the spammer. By using a simple redirect, spiders will follow the link, but they give a fuck about the keywords and so the spammer gives a fuck about the link.
~~~~~&Exactly - and using the Google redirect **prevents** the target page from getting a higher ranking from incoming (spam) links because it won't be counted at all. :) --JavaWoman
~~~~~~&Yeah, but you don't need Google to make this happen. A simple internal redirect is enough and looks better than a Google-Redirect ;)
~~~~~~~&Nope, because an internal redirect //will// be followed by Google and //still// count for page rank - that's the problem; the Google redirect prevents this. --JavaWoman
~~~~~~~~&I talked to Abakus, a German SEO expert and he said it does not count. There is no difference between an internal redirect oder a Google redirect. Keywords of the link (s. above) only count for the redirect site and not for link behind the redirect. And well, why should a spider follow an internal link (via exit.php?url=...), but not a Google redirect?
~~~~~~~~~& A spider will follow any redirect, whether it's local or through an internal redirect. Never mind the keywords, it's still a link //into// the spammed site; with a local redirect that won't make any difference, but with the Google redirect Google knows to **not** count it as an incoming link. It's not (just) about keywords but about **Page Rank** (PR) - and PR is highly dependent on incoming links (and where they come from). That much we know. But no one except some Google employees knows the exact algorithm that determines PR - not even Abakus ;-) --JavaWoman
~&Maybe the solution is [[http://www.google.com/googleblog/2005/01/preventing-comment-spam.html here]].
~&If a user is not registered, to all external links he creates on the wiki will be added the attribute rel="nofollow".
~&This technique is now adopted by Google, Yahoo and MSN. --DotMG
~~&Thanks, DotMG! This is great news - I had seen this technique being discussed as a proposed possible solution but had missed the news the proposal has actually been adopted now. (Should we worry about Altavista? Probably not too much - these SEs are the ones spammers will target primarily.) One possible hole I can see is that a spammer might write a script to quickly register and then post on a number of pages - but scripted registrations can be defended against with other means. Nothing will probably provide a 100% solution but this is a big step in the right direction. --JavaWoman

====Referrer spam====
~//Spammers sometimes visit Wikis and blogs with a tool with "bogus" referer headers containing the sites they want to generate incoming links for - this works on many wikis and blogs since such sites often have a page listing referrers (wikis) or list referrers to a particular post (blogs). If a Search engine indexes such a page, it would find a link to the spammed site, resulting in a higher "score" for that spammed page.//

The general solution is to cause such links **not** to be followed by search engines. The technique outlined below under "Don't let old pages get indexed" already takes care of this for the referrer listings Wikka uses.

====Email-gathering spambots====
~//Spambots spider websites looking for email addresses to add to the list (to use, or to sell as a targeted list). A general defense that works well (though not 100%) is to "obfuscate" email addresses so such spambots don't recognize them.//

==Obfuscating addresses automatically==
Wikka 1.1.6.0 comes with a small action to create an obfuscated email "contact" link for the site administrator. Meanwhile, the formatter will simply turn every email address //it// recognizes into an email link (with the address also used for the link text) - providing nice fodder for spambots.

What we should have is a function that can turn a given email address into an obfuscated link - this could then be used by both the ""{{contact}}"" action and the formatter. It would (then) also enable use to change the obfuscating algorithm inside the fuction without affecting either the formatter or the contact action any more, and others can use this in their own extensions as well. --JavaWoman

----
====Resolved Suggestions====
~//Spam-defense measures that are already implemented in Wikka.//

==Don't let old pages get indexed==
//Extended method implemented as of Wikka 1.1.6.0 (Both the "noarchive" addition and applying it to the Sandbox)//

To make absolutely sure old pages don't get archived (irrespective of your robots.txt) - essential to stopping WikiSpam from still getting juice from archived pages, why not make sure to add meta directives to those pages by adding something like:
%%(php)
<?php if ($this->GetMethod() != 'show' || $this->page["latest"] == "N") echo "<meta name=\"robots\" content=\"noindex, nofollow, noarchive\" />\n<meta name=\"googlebot\" content=\"noarchive, noindex, nofollow\">\n";?>%%
to header.php. This stops pages with handlers other than show or non current pages from any kind of archiving/cacheing.

~&Ian, thanks for the suggestion. Wikka has had something similar to this in place since the first release. See Mod033bRobotIndexing. But your suggestion expands the idea and adds the latest page check, "noarchive", and the googlebot part--which seem like good ideas. I'll add this to the upcoming release. By the way, when are you going to switch your site over to Wikka? ;) -- JsnX
~~&Yes, nice idea. But the googlebot part is actually redundant, Google obeys the robots meta directives. (And that second meta tag isn't valid XHTML - it's unclosed.) I suggest we merely add the "noarchive". Apart from that, it would also be nice to stop indexing etc. from the SandBox page. --JavaWoman
~~~&The latest page check is important because wiki spammers don't really care if you delete their spam, as long as their links sit on an old archived page waiting to be indexed. The added googlebot directive (thanks for spotting typo btw) is just **extra paranoia on my part** :). And you are all doing an **excellent** job with Wikka - the only reason I haven't switched is that quite a lot on my Wakka is heavily customised and I don't have the time to redo that - especially as lots of pages would break without re-jigging of e.g. SafeHTML (my BookMarklets page for example). If I have time, I will eventually migrate...! -- IanAndolina

----
====Further references====
~//Where to read more about Wiki spam.//

~-[[http://www.usemod.com/cgi-bin/mb.pl?WikiSpam Meatball — WikiSpam]]
~-[[http://c2.com/cgi/wiki?WikiSpam C2.com — WikiSpam]]

~-[[http://chongq.blogspot.com/ Goggle Ending Comment Spam]] [sic]
~-[[http://chongqed.org/ chongqed.org]]
~-[[http://blacklist.chongqed.org/ chongqed.org blacklist]] - use this dynamically
~-[[http://chongqed.org/submit.html Submit a wiki spammer]] - All your page ranks are belong to us!

~-[[http://wackowiki.com/SPAM/ Wacko Wiki - SPAM]]

~-[[http://www.bluestack.org/ReferrerSpam ReferrerSpam]] - A Wakka page about preventing referrer spam ... which page is now spammed (ouch!)

~-[[http://www.theregister.co.uk/2005/01/31/link_spamer_interview/ Interview with a Link-Spammer]]

----
CategoryWikka
Deletions:
=====Fighting spam in Wikka=====

>>**see also:**
~-HideReferrers
~-RemovingUsers
~-WikkaAndEmail
~-DeleteSpamAction
>>As it may have dawned on you by now, spam is getting to be a problem in wiki's - both the type of spam that also plagues many blogs in the form of comment spam (only in a wiki it would (also) affect page content), and referrer spam. And then there are spambots intent on gathering email addresses.

Wikka sites are no exception any more (and other WakkaWiki forks seem to be having problems, too).

This page is intended to gather ideas for how to fight spam (of all types) in Wikka, so we can coordinate our efforts and get a spammer-hardened Wikka out there. You can also find some general information about (fighting) wiki spam and what Wikka has already implemented as defense measures.

====Spam in Wikka pages====
~//About how to discourage spammers to post links on spam pages in the first place, and what to do when your pages have been spammed already.//

==Bayesian Filter: Focus on the content==
Many of these suggestions will stop a certain degree of spam, but spammers can easily break these anti-spam measures such as adding random tokens (modern spam bots can already scan a page for form elements and submit all of them). Therefore, I suggest analyzing the content based on what might constitute spam (text frequency, link frequency, blacklist, bayesian filter) and then assigning a score to the post. If the post has over, let's say, a 50% chance for spam, then perhaps email validation, post approval, or a captcha can be used to further validate the user.

I'm particularly supportive of the bayesian filter. For instance, many spam fighting programs today use the bayesian filter (ie. Thunderbird). The bayesian algorithm is adaptive and learning which will work best when used in conjunction with other standard filters. The process might be like this:
1) The standard filters (ie. blacklist) catches a suspicious post. The post is marked for approval.
2) The admins will review the post at the post moderation panel. If the post is "ham" then the bayesian filters will automatically adapt to allow future posts that resemble the approved post through. However, if the post is "spam", then the bayesian filter will automatically adapt to block future posts with those certain keywords.

Therefore, a bayesian filter cannot be solely implemented, but rather, it requires admin intervention (to help the filter learn) and other standard filters.

Bayesian filters have been extremely successful in eliminating over 98% of common spam after a few weeks of adaptation.
--MikeXstudios
~&Nice idea Mike - but do you know of a Baysian filter implementation in PHP that could be easily integrated with Wikka? Preferably "lightweight" too, as we don't want an anti-spam solution be more weighty than Wikka itself. --JavaWoman
~~&In fact, I do :). I was playing around with [[http://www.phpgeek.com/pragmacms/index.php?layout=main&cslot_1=14 this small Bayesian filter I found last week]]. It's around 55KB unzipped. Implementing it in Wikka would be less-than easy however. --MikeXstudios

==Adding Random Tokens for Form Submissions?==
Based on [[http://shiflett.org/archive/96 this post]], I wonder whether providing randomised session tokens for form submission may provide just one more step to impede spambots. Very simple to implement:

wikka.php:
%%(php)function FormOpen($method = "", $tag = "", $formMethod = "post")
{
if(!isset($_SESSION['token'])) {
$token = md5(uniqid(rand(), true));
$_SESSION['token'] = $token;
}
$result = "<form action=\"".$this->Href($method, $tag)."\" method=\"".$formMethod."\"><p>\n";
$result .= "<input type=\"hidden\" name=\"token\" value=\"".$_SESSION['token']."\" />";
if (!$this->config["rewrite_mode"]) $result .= "<input type=\"hidden\" name=\"wakka\" value=\"".$this->MiniHref($method, $tag)."\" />\n";
return $result;
}%%

and then just wrap edit.php and addcomment.php sections using:
%%(php)if ($_POST['token'] == $_SESSION['token']) { //form spoof protection

}%%

I'm definitely no expert on security, and I can see how it can be bypassed, but it does require one more step and adds complexity for spambots to spoof the wiki forms at no cost... --IanAndolina
~&Good point, Ian. I had been thinking about a similar approach (I have a plugin for SquirrelMail installed that essentially does the same thing for the login dialog) - but reading therough the comments on [[http://shiflett.org/archive/96 Chris Shiflett's article]] it's clear that this is no more than another little hurdle easily overcome by the script-writing link spammer (GET the page with the script first, read the token and use that in the scripted POST). That said, it //is// another hurdle that may deter at least some naive spammers - and with very little code. So, nice. --JavaWoman (who suddenly realizes her Squirrelmail isn't as secure as she thought it was - but at least has built in other layers of security)


==Refining Redirection / nofollow modification for links==
One issue with the google redirection and newer rel="nofollow" is that good sites also get hit by this procedure. As we can't really tag links on a "trusted user" basis, we have to do that on a trusted server one. I use a whitelist in config.php with a list of "good servers":

%%(php)"serverre" => "/(nontroppo.org|goodsite.com|etc)/",%%

And my Link routine in the main wakka.php (wikka.php) is modified to make use of it:

%%(php)if (preg_match($this->GetConfigValue("serverre"), $tag))
{
$url = $tag; //trusted web sites so no need for redirects
$urlclass= "ext";
}
else
{
$tag = rawurlencode($tag);
$url = "http://www.google.com/url?q=".$tag;
$urlclass= "ext";
$follow = " rel=\"nofollow\" ";
}
return $url ? "<a ".$follow." class=\"".$urlclass."\" href=\"".$url."\">$text</a>" : $text;%%

This way, trusted sites get full and unadulterated links, but anything else has BOTH google redirection and rel="nofollow" added. The CSS can then contain ways to visually tag those different URLs, so the user can see if a link is trusted or not (I use advanced generated content - not supported in IE):

%%(css)
a.ext:after, a[rel="nofollow"]:after {content:"\00220A";
text-decoration: none !important;
font-size: 0.9em;
color: #888;
position: relative;
bottom: 1ex;}

a[rel="nofollow"]:after {content:"\002209";}
%% -- IanAndolina

==Spam Block for Saving pages==
As I was getting a lot of repeat spam of the same domains over and over, I implemented a "link blacklist" to my Wiki for comments and edits:

add to edit.php & addcomment.php:
%%(php)preg_match_all($this->GetConfigValue("spamre"),$body,$out); //keyword spam block
if (count($out[0])>=1)
{
$this->SetMessage("Go spam somewhere else. You links will never get spidered here anyway.");
$this->redirect($this->href());
return;
}%%

config.php
%%(php)"spamre" => "/(voip99|zhiliaotuofa|mycv|princeofprussia|imobissimo|valeofglamorganconservatives|68l|8cx|online-deals99).(net|cn|com|org)|(phentermine)/m",%%

Now, what I wanted to do was have an //admin only// wiki page, where the contents of the spamre regexp could be edited, instead of being hardwired in config.php - but never got round to it. **But** this would be the better way to do it - have a function that finds a wiki page and builds a regexp from the keywords added by admins to that wiki page (not all of whom may have access to config.php). It is a fairly basic method - but with a couple of vigilant admins can reduce repeat attacks from spam bots considerably. -- IanAndolina

==User Validation==

I like the ascii-based user validation scheme (Captcha) here:

http://www.moztips.com/wiki/index.pcgi?action=edit&page=SandBox

I don't know how to do that in PHP (it is a PHP based wiki I believe) - though the [[http://www.google.com/search?q=captcha+php&sourceid=opera&num=0&ie=utf-8&oe=utf-8 more complex image based solutions]] are available. This for me is **far** prefereable to locking pages for writing using ACLs - which IMO destroys the very purpose of the wiki. --IanAndolina

[Copied from SuggestionBox] "There's also code around that uses GD & that could be built onto Nils' code that generates a "registration password" automatically and outputs it as a distorted graphic image.....the code is intended to befuddle auto spam registers & thus stop open-registration sites from being hit by spam bots that register themselves as users. Ultimately, as the bots become more sophisticated I think we'll have to use something like that or else sites like this one (with open registration) will be victimized. [[http://www.bluecopia.com/form.php Here]] and [[http://www.horobey.com/demos/codegen/ here]] are examples of what I mean (I like the simplicity of the first version in the second example). -- GmBowen" .... I think at least we need a system like one of these (or like the one Ian suggests) on the user registration page. mb
~&[also copied from SuggestionBox] Yes, "Captcha" is an old trick - it will keep out some bots (but not all) and it will keep many people out, too, like those who are visually handicapped (not just people who are totally blind - being colorblind may be enough to be defeated by such tricks). Add a sound equivalent? Well, there are people who are deaf **and** blind. Are we going to deny them access to our wikis? I'm not in favor of making Wikka **in**accessible when we should be working towards making it **more** accessible. --JavaWoman.
~~&It was for this reason I suggested the ascii captcha - it uses mch larger letters AND it is perfect for the colour-blind. Interestingly, it probably also is better in terms of fooling image recognition algorithms, the artifical observer will find it much harder to parse ascii-art as elements have high-contrast, conflicting orientations AND are unjoined. Thus the first step to object classifications are greatly impeded. I do agree that captcha's can reduce accesibility, but an ascii-based recognition pattern reduces some of the major obstacles I saw in the image-based ones. --IanAndolina
~~~&Ian, I looked at that Moztips link you mentioed - and even with my relatively good eyes I found the contrast **way** too low. And whether you use an image or "ascii art" - screen readers used by visually handicapped people cannot make anything of it - it's only marginally better than an image in that it may defeat OCR utilities used by some spambots, but I don't see it as any more accessible, it could even be less accessible, depending on contrast in the image. --JavaWoman
~&But spam-flooded wiki's are also **in**accessible....we only have to look at wikka's parent to see that....so //some// sort of more elaborate system is needed to enhance overall accessibility. In the large, I see the overall job of developing wikka as a matter of providing a tool-kit that each wikka-owner can make the decisions around. Providing a registration system that (optionally) uses both a registration code as provided by the wikka-owner by email *or* some sort of visual validation system (I like the character size of captcha, and [[http://www.horobey.com/demos/codegen/v1/humancheck.php this]] one)...
~~&("this" one is quite horrible - I saw three glyphs, two the same that I could not guess were a 3 or a B and one in the middle that I could not decide was a 4 or an A. The only thing that can be said for it is that (at least the one sample I looked at) has high contrast... which doesn't help if the characters are not recognizable themselves -JW)
~~~&What I liked about it was that it didn't use GD (as others do, except captcha), so would be usable to all of our users, but rather pulled on a series of image files.....so the graphics could be easily changed to a series that was more universally readable (thereby addressing your concerns)....I also noticed that some of the characters were difficult to discern. That also means, of course, that a bot could be set to read the source image file name...so a potential security shortfall, although code could be written to (a) pull the image from an mysql database instead of a file & (b) have a "random" image file name assigned to it as it was read into the html page. --GmBowen
~&... at the __same__ time so a user could use either for registration would likely cover most bases for an open registration system, one with just the latter for teachers etc. that want one for their classes. --GmBowen
~~&I agree captcha **could** be an option (but it would be an extension, not Wikka "core"). Remember though that many people really hate solutions like this - they'll just leave (even if they are already part of a community) and refuse to post any more. A more stringent registration procedure (requiring email confirmation, and recording user's IP) would be be less invasive and sufficient to keep all but the most hardened spambots away; it would also ensure that people actually have a valid email address to use when they lose their password... Combine that with a "banning" option and even a spambot clever enough to sign up once with a valid email address can be kept out once discovered. --JavaWoman
~~~&Wouldn't it have to be automatic email confirmation though (for high-use sites)? Anyways, in the scenarios I work with (since we're discussing accessability) that's a less desirable option. There are ethical issues for schools/teachers with "requiring" kids to have an email address (especially yahoo, hotmail, etc) because of the amount of adult-oriented spam they receive. [That's part of the reason I'm working on the private message system to be embedded in wikka. Image verification thereby offers advantages that email verification does not, at least for the communities I'll be using this in. As far as "hating" procedures goes.....I dislike email verification (because of the possibility of being added to spam lists) **far** more than image verification. --GmBowen

==Spam repair and defense==
//See also DeleteSpamAction !//
1/22/05 - Spam help! I have apparently been attacked by an army of [[http://www.richardberg.net/RecentChanges spam bots]]. Has this happened to anyone else? For now, I am asking for your help with:
- a SQL command that will delete all of these edits
- a SQL command that will change all of my ACLs to '+' for writing and commenting (I've modified the config file but that only affects new pages AFAIK)

Whatever script they used (on multiple machines, no less) could certainly be used against any Wakka-like site with minimal modifications, so something has to be done...I will do what I can to help you guys combat future attacks as well as implement the new [[http://www.google.com/googleblog/2005/01/preventing-comment-spam.html HTML attribute]] you've probably all heard about. --RichardBerg

~&Richard: here's the sql to update all your current ACLs (I'm using mysql 4.0.22):
~~%%(sql) UPDATE acls SET comment_acl="+" WHERE comment_acl="*";
UPDATE acls SET write_acl="+" WHERE write_acl="*"; %%
~&You'll need to change the table name (acls) to match whatever your table is named. Give me a few to look at the pages table and your site and I should have the sql for removing the edits. :) -- MovieLady
~~&Since Richard has //already// changed his default ACLs in the configuration, that would apply to any page that did not have ACLs different from the **original** default (not merely new pages!); your SQL code should take care of any pages that had ACLs //different// from the original default (because only those would have a record in the ACLs table). --- See also JsnX's suggestion about "Clearing ACLs" on the SuggestionBox which explains how this mechanism works. Thanks, MovieLady! --JavaWoman
~~~& Correct. Both statements will change only the entries that had the default ACL from his config file in that field. (What the statements are looking for can, of course be changed, as can what the field is being set to. I used it when I went back and changed my default ACLs on all pages that had ACLs to disallow all banned users from writing or commenting after adding ACLsWithUserGroups.) --MovieLady

~&There is a relevant link to an action at wikini for removing additions by particular IP's or users at CommunityNotes.--GmBowen
~~& Thanks for the link! I've translated and made minor changes to the code, and posted everything to DeleteSpamAction. He's got a very good starting point, I think. One could adapt the code (fairly easily) to allow you to look at all the revisions on a page instead of by user/IP and then delete the histories you don't want to keep, for whatever reason. --MovieLady

==Stopping Spammers getting Google Juice==
[[http://simon.incutio.com/archive/2004/05/11/approved There is a technique]] to stop spammers from gaining any advantage of spamming, which is to redirect external links to stop them from affecting their PageRank. Great to stop the whole purpose of spamming, but this has the disadvantage that good sites lose their google juice too. Check the comments out on that page for more cons. I've noticed since I enabled this on the Opera 7 wiki that slowly spam volume has dropped out, but I'm not entirely happy at the price paid. Had you thought about this, maybe have it as an option during config? -- IanAndolina

~&Good point, Ian. I had thought about this, after having seen several Wikis and blogs that use the Google redirection... I do think it should be configurable though - not every Wiki installation may want to do this (in fact, some may welcome external links as long as spam is repaired fast enough). --JavaWoman
~~&I asked an export for SEO and he replied that it should be enough to use a simple internal redirect (e.g. exit.php?url=...) to create this effect. He also said that it might be helpful to disallow any spider access to that file (robots.txt). -- ReimerStegelmann
~~~&Unfortunately, search engine robots these days mostly **do** follow URLs with parameters, and an "internal redirect" done that way would be retrieved by a bot; HTTP redirects are followed, too (which is what you'd have to use with that "internal redirect" method). **Meta** redirects mostly aren't but you cannot apply this as a general "redirect external links" (especially not since you cannot have any URL parameters in a meta redirect - and you want to allow all //valid// external links, merely have them not count towards page rank in search engines, mostly Google). Excluding a single file with robots.txt won't work since all of Wikka runs off the single wikka.php file. The Google redirect method gets around all of that (at least for Google's ranking mechanism - which is what spammers are mostly targeting). --JavaWoman
~~~~&They follow, but that is not the point of spam. The main target of a spammer is the reach a high ranking in search engines. They post links which linktext contains important keywords (e.g. [[Keyword1 keyword2 http://domain.tld]]). So, if you enter keyword1 oder keyword2 to a search engine, you will see the homepage of the spammer. By using a simple redirect, spiders will follow the link, but they give a fuck about the keywords and so the spammer gives a fuck about the link.
~~~~~&Exactly - and using the Google redirect **prevents** the target page from getting a higher ranking from incoming (spam) links because it won't be counted at all. :) --JavaWoman
~~~~~~&Yeah, but you don't need Google to make this happen. A simple internal redirect is enough and looks better than a Google-Redirect ;)
~~~~~~~&Nope, because an internal redirect //will// be followed by Google and //still// count for page rank - that's the problem; the Google redirect prevents this. --JavaWoman
~~~~~~~~&I talked to Abakus, a German SEO expert and he said it does not count. There is no difference between an internal redirect oder a Google redirect. Keywords of the link (s. above) only count for the redirect site and not for link behind the redirect. And well, why should a spider follow an internal link (via exit.php?url=...), but not a Google redirect?
~~~~~~~~~& A spider will follow any redirect, whether it's local or through an internal redirect. Never mind the keywords, it's still a link //into// the spammed site; with a local redirect that won't make any difference, but with the Google redirect Google knows to **not** count it as an incoming link. It's not (just) about keywords but about **Page Rank** (PR) - and PR is highly dependent on incoming links (and where they come from). That much we know. But no one except some Google employees knows the exact algorithm that determines PR - not even Abakus ;-) --JavaWoman
~&Maybe the solution is [[http://www.google.com/googleblog/2005/01/preventing-comment-spam.html here]].
~&If a user is not registered, to all external links he creates on the wiki will be added the attribute rel="nofollow".
~&This technique is now adopted by Google, Yahoo and MSN. --DotMG
~~&Thanks, DotMG! This is great news - I had seen this technique being discussed as a proposed possible solution but had missed the news the proposal has actually been adopted now. (Should we worry about Altavista? Probably not too much - these SEs are the ones spammers will target primarily.) One possible hole I can see is that a spammer might write a script to quickly register and then post on a number of pages - but scripted registrations can be defended against with other means. Nothing will probably provide a 100% solution but this is a big step in the right direction. --JavaWoman

====Referrer spam====
~//Spammers sometimes visit Wikis and blogs with a tool with "bogus" referer headers containing the sites they want to generate incoming links for - this works on many wikis and blogs since such sites often have a page listing referrers (wikis) or list referrers to a particular post (blogs). If a Search engine indexes such a page, it would find a link to the spammed site, resulting in a higher "score" for that spammed page.//

The general solution is to cause such links **not** to be followed by search engines. The technique outlined below under "Don't let old pages get indexed" already takes care of this for the referrer listings Wikka uses.

====Email-gathering spambots====
~//Spambots spider websites looking for email addresses to add to the list (to use, or to sell as a targeted list). A general defense that works well (though not 100%) is to "obfuscate" email addresses so such spambots don't recognize them.//

==Obfuscating addresses automatically==
Wikka 1.1.6.0 comes with a small action to create an obfuscated email "contact" link for the site administrator. Meanwhile, the formatter will simply turn every email address //it// recognizes into an email link (with the address also used for the link text) - providing nice fodder for spambots.

What we should have is a function that can turn a given email address into an obfuscated link - this could then be used by both the ""{{contact}}"" action and the formatter. It would (then) also enable use to change the obfuscating algorithm inside the fuction without affecting either the formatter or the contact action any more, and others can use this in their own extensions as well. --JavaWoman

----
====Resolved Suggestions====
~//Spam-defense measures that are already implemented in Wikka.//

==Don't let old pages get indexed==
//Extended method implemented as of Wikka 1.1.6.0 (Both the "noarchive" addition and applying it to the Sandbox)//

To make absolutely sure old pages don't get archived (irrespective of your robots.txt) - essential to stopping WikiSpam from still getting juice from archived pages, why not make sure to add meta directives to those pages by adding something like:
%%(php)
<?php if ($this->GetMethod() != 'show' || $this->page["latest"] == "N") echo "<meta name=\"robots\" content=\"noindex, nofollow, noarchive\" />\n<meta name=\"googlebot\" content=\"noarchive, noindex, nofollow\">\n";?>%%
to header.php. This stops pages with handlers other than show or non current pages from any kind of archiving/cacheing.

~&Ian, thanks for the suggestion. Wikka has had something similar to this in place since the first release. See Mod033bRobotIndexing. But your suggestion expands the idea and adds the latest page check, "noarchive", and the googlebot part--which seem like good ideas. I'll add this to the upcoming release. By the way, when are you going to switch your site over to Wikka? ;) -- JsnX
~~&Yes, nice idea. But the googlebot part is actually redundant, Google obeys the robots meta directives. (And that second meta tag isn't valid XHTML - it's unclosed.) I suggest we merely add the "noarchive". Apart from that, it would also be nice to stop indexing etc. from the SandBox page. --JavaWoman
~~~&The latest page check is important because wiki spammers don't really care if you delete their spam, as long as their links sit on an old archived page waiting to be indexed. The added googlebot directive (thanks for spotting typo btw) is just **extra paranoia on my part** :). And you are all doing an **excellent** job with Wikka - the only reason I haven't switched is that quite a lot on my Wakka is heavily customised and I don't have the time to redo that - especially as lots of pages would break without re-jigging of e.g. SafeHTML (my BookMarklets page for example). If I have time, I will eventually migrate...! -- IanAndolina

----
====Further references====
~//Where to read more about Wiki spam.//

~-[[http://www.usemod.com/cgi-bin/mb.pl?WikiSpam Meatball — WikiSpam]]
~-[[http://c2.com/cgi/wiki?WikiSpam C2.com — WikiSpam]]

~-[[http://chongq.blogspot.com/ Goggle Ending Comment Spam]] [sic]
~-[[http://chongqed.org/ chongqed.org]]
~-[[http://blacklist.chongqed.org/ chongqed.org blacklist]] - use this dynamically
~-[[http://chongqed.org/submit.html Submit a wiki spammer]] - All your page ranks are belong to us!

~-[[http://wackowiki.com/SPAM/ Wacko Wiki - SPAM]]

~-[[http://www.bluestack.org/ReferrerSpam ReferrerSpam]] - A Wakka page about preventing referrer spam ... which page is now spammed (ouch!)

~-[[http://www.theregister.co.uk/2005/01/31/link_spamer_interview/ Interview with a Link-Spammer]]

----
CategoryWikka


Revision [5695]

Edited on 2005-02-06 22:06:52 by MikeXstudios [added my name to comment]
Additions:
~~&In fact, I do :). I was playing around with [[http://www.phpgeek.com/pragmacms/index.php?layout=main&cslot_1=14 this small Bayesian filter I found last week]]. It's around 55KB unzipped. Implementing it in Wikka would be less-than easy however. --MikeXstudios
Deletions:
~~&In fact, I do :). I was playing around with [[http://www.phpgeek.com/pragmacms/index.php?layout=main&cslot_1=14 this small Bayesian filter I found last week]]. It's around 55KB unzipped. Implementing it in Wikka would be less-than easy however.


Revision [5694]

Edited on 2005-02-06 22:02:43 by MikeXstudios [added my name to comment]
Additions:
~~&In fact, I do :). I was playing around with [[http://www.phpgeek.com/pragmacms/index.php?layout=main&cslot_1=14 this small Bayesian filter I found last week]]. It's around 55KB unzipped. Implementing it in Wikka would be less-than easy however.


Revision [5688]

Edited on 2005-02-06 17:34:25 by JavaWoman [layout]
Additions:
~&Nice idea Mike - but do you know of a Baysian filter implementation in PHP that could be easily integrated with Wikka? Preferably "lightweight" too, as we don't want an anti-spam solution be more weighty than Wikka itself. --JavaWoman
Deletions:
-&Nice idea Mike - but do you know of a Baysian filter implementation in PHP that could be easily integrated with Wikka? Preferably "lightweight" too, as we don't want an anti-spam solution be more weighty than Wikka itself. --JavaWoman


Revision [5687]

Edited on 2005-02-06 17:33:56 by JavaWoman [layout]
Additions:
-&Nice idea Mike - but do you know of a Baysian filter implementation in PHP that could be easily integrated with Wikka? Preferably "lightweight" too, as we don't want an anti-spam solution be more weighty than Wikka itself. --JavaWoman
Deletions:
-~Nice idea Mike - but do you know of a Baysian filter implementation in PHP that could be easily integrated with Wikka? Preferably "lightweight" too, as we don't want an anti-spam solution be more weighty than Wikka itself. --JavaWoman


Revision [5686]

Edited on 2005-02-06 17:33:12 by JavaWoman [comment on Bayesian filter idea]
Additions:
-~Nice idea Mike - but do you know of a Baysian filter implementation in PHP that could be easily integrated with Wikka? Preferably "lightweight" too, as we don't want an anti-spam solution be more weighty than Wikka itself. --JavaWoman
~&Good point, Ian. I had been thinking about a similar approach (I have a plugin for SquirrelMail installed that essentially does the same thing for the login dialog) - but reading therough the comments on [[http://shiflett.org/archive/96 Chris Shiflett's article]] it's clear that this is no more than another little hurdle easily overcome by the script-writing link spammer (GET the page with the script first, read the token and use that in the scripted POST). That said, it //is// another hurdle that may deter at least some naive spammers - and with very little code. So, nice. --JavaWoman (who suddenly realizes her Squirrelmail isn't as secure as she thought it was - but at least has built in other layers of security)
Deletions:
~_Good point, Ian. I had been thinking about a similar approach (I have a plugin for SquirrelMail installed that essentially does the same thing for the login dialog) - but reading therough the comments on [[http://shiflett.org/archive/96 Chris Shiflett's article]] it's clear that this is no more than another little hurdle easily overcome by the script-writing link spammer (GET the page with the script first, read the token and use that in the scripted POST). That said, it //is// another hurdle that may deter at least some naive spammers - and with very little code. So, nice. --JavaWoman (who suddenly realizes her Squirrelmail isn't as secure as she thought it was - but at least has built in other layers of security)


Revision [5685]

Edited on 2005-02-06 17:29:07 by JavaWoman [reply to secret token idea]
Additions:
~_Good point, Ian. I had been thinking about a similar approach (I have a plugin for SquirrelMail installed that essentially does the same thing for the login dialog) - but reading therough the comments on [[http://shiflett.org/archive/96 Chris Shiflett's article]] it's clear that this is no more than another little hurdle easily overcome by the script-writing link spammer (GET the page with the script first, read the token and use that in the scripted POST). That said, it //is// another hurdle that may deter at least some naive spammers - and with very little code. So, nice. --JavaWoman (who suddenly realizes her Squirrelmail isn't as secure as she thought it was - but at least has built in other layers of security)


Revision [5681]

Edited on 2005-02-06 14:57:49 by MikeXstudios [reply to secret token idea]
Additions:
==Bayesian Filter: Focus on the content==
Many of these suggestions will stop a certain degree of spam, but spammers can easily break these anti-spam measures such as adding random tokens (modern spam bots can already scan a page for form elements and submit all of them). Therefore, I suggest analyzing the content based on what might constitute spam (text frequency, link frequency, blacklist, bayesian filter) and then assigning a score to the post. If the post has over, let's say, a 50% chance for spam, then perhaps email validation, post approval, or a captcha can be used to further validate the user.
I'm particularly supportive of the bayesian filter. For instance, many spam fighting programs today use the bayesian filter (ie. Thunderbird). The bayesian algorithm is adaptive and learning which will work best when used in conjunction with other standard filters. The process might be like this:
1) The standard filters (ie. blacklist) catches a suspicious post. The post is marked for approval.
2) The admins will review the post at the post moderation panel. If the post is "ham" then the bayesian filters will automatically adapt to allow future posts that resemble the approved post through. However, if the post is "spam", then the bayesian filter will automatically adapt to block future posts with those certain keywords.
Therefore, a bayesian filter cannot be solely implemented, but rather, it requires admin intervention (to help the filter learn) and other standard filters.
Bayesian filters have been extremely successful in eliminating over 98% of common spam after a few weeks of adaptation.
--MikeXstudios
~-[[http://www.usemod.com/cgi-bin/mb.pl?WikiSpam Meatball — WikiSpam]]
~-[[http://c2.com/cgi/wiki?WikiSpam C2.com — WikiSpam]]
Deletions:
~-[[http://www.usemod.com/cgi-bin/mb.pl?WikiSpam Meatball — WikiSpam]]
~-[[http://c2.com/cgi/wiki?WikiSpam C2.com — WikiSpam]]


Revision [5677]

Edited on 2005-02-06 11:41:31 by IanAndolina [Randomised token submission for forms suggestion]
Additions:
==Adding Random Tokens for Form Submissions?==
Based on [[http://shiflett.org/archive/96 this post]], I wonder whether providing randomised session tokens for form submission may provide just one more step to impede spambots. Very simple to implement:
wikka.php:
%%(php)function FormOpen($method = "", $tag = "", $formMethod = "post")
if(!isset($_SESSION['token'])) {
$token = md5(uniqid(rand(), true));
$_SESSION['token'] = $token;
}
$result = "<form action=\"".$this->Href($method, $tag)."\" method=\"".$formMethod."\"><p>\n";
$result .= "<input type=\"hidden\" name=\"token\" value=\"".$_SESSION['token']."\" />";
if (!$this->config["rewrite_mode"]) $result .= "<input type=\"hidden\" name=\"wakka\" value=\"".$this->MiniHref($method, $tag)."\" />\n";
return $result;
}%%
and then just wrap edit.php and addcomment.php sections using:
%%(php)if ($_POST['token'] == $_SESSION['token']) { //form spoof protection
}%%
I'm definitely no expert on security, and I can see how it can be bypassed, but it does require one more step and adds complexity for spambots to spoof the wiki forms at no cost... --IanAndolina
%%(php)"serverre" => "/(nontroppo.org|goodsite.com|etc)/",%%
%%(php)if (preg_match($this->GetConfigValue("serverre"), $tag))
return $url ? "<a ".$follow." class=\"".$urlclass."\" href=\"".$url."\">$text</a>" : $text;%%
%%(php)preg_match_all($this->GetConfigValue("spamre"),$body,$out); //keyword spam block
}%%
%%(php)"spamre" => "/(voip99|zhiliaotuofa|mycv|princeofprussia|imobissimo|valeofglamorganconservatives|68l|8cx|online-deals99).(net|cn|com|org)|(phentermine)/m",%%
~-[[http://www.usemod.com/cgi-bin/mb.pl?WikiSpam Meatball — WikiSpam]]
~-[[http://c2.com/cgi/wiki?WikiSpam C2.com — WikiSpam]]
Deletions:
<?php
"serverre" => "/(nontroppo.org|goodsite.com|etc)/",
?>
%%
<?php
if (preg_match($this->GetConfigValue("serverre"), $tag))
return $url ? "<a ".$follow." class=\"".$urlclass."\" href=\"".$url."\">$text</a>" : $text;
?>
%%
<?php preg_match_all($this->GetConfigValue("spamre"),$body,$out); //keyword spam block
}?>
%%
<?php
"spamre" => "/(voip99|zhiliaotuofa|mycv|princeofprussia|imobissimo|valeofglamorganconservatives|68l|8cx|online-deals99).(net|cn|com|org)|(phentermine)/m",
?>
%%
~-[[http://www.usemod.com/cgi-bin/mb.pl?WikiSpam Meatball — WikiSpam]]
~-[[http://c2.com/cgi/wiki?WikiSpam C2.com — WikiSpam]]


Revision [5628]

Edited on 2005-02-04 18:49:13 by JavaWoman [let's start reading the interview on page 1 :)]
Additions:
~-[[http://www.theregister.co.uk/2005/01/31/link_spamer_interview/ Interview with a Link-Spammer]]
Deletions:
~-[[http://www.theregister.co.uk/2005/01/31/link_spamer_interview/page2.html Interview with a Link-Spammer]]


Revision [5584]

Edited on 2005-02-04 15:06:27 by NilsLindenberg [added link to interview wit ha link-spammer]
Additions:
~-[[http://www.theregister.co.uk/2005/01/31/link_spamer_interview/page2.html Interview with a Link-Spammer]]


Revision [5168]

Edited on 2005-01-25 13:54:18 by GmBowen [reply to JW]
Additions:
~~~&What I liked about it was that it didn't use GD (as others do, except captcha), so would be usable to all of our users, but rather pulled on a series of image files.....so the graphics could be easily changed to a series that was more universally readable (thereby addressing your concerns)....I also noticed that some of the characters were difficult to discern. That also means, of course, that a bot could be set to read the source image file name...so a potential security shortfall, although code could be written to (a) pull the image from an mysql database instead of a file & (b) have a "random" image file name assigned to it as it was read into the html page. --GmBowen
~~~&Wouldn't it have to be automatic email confirmation though (for high-use sites)? Anyways, in the scenarios I work with (since we're discussing accessability) that's a less desirable option. There are ethical issues for schools/teachers with "requiring" kids to have an email address (especially yahoo, hotmail, etc) because of the amount of adult-oriented spam they receive. [That's part of the reason I'm working on the private message system to be embedded in wikka. Image verification thereby offers advantages that email verification does not, at least for the communities I'll be using this in. As far as "hating" procedures goes.....I dislike email verification (because of the possibility of being added to spam lists) **far** more than image verification. --GmBowen


Revision [5159]

Edited on 2005-01-25 09:15:28 by JavaWoman [a few captcha replies]
Additions:
~~~&Ian, I looked at that Moztips link you mentioed - and even with my relatively good eyes I found the contrast **way** too low. And whether you use an image or "ascii art" - screen readers used by visually handicapped people cannot make anything of it - it's only marginally better than an image in that it may defeat OCR utilities used by some spambots, but I don't see it as any more accessible, it could even be less accessible, depending on contrast in the image. --JavaWoman
~&But spam-flooded wiki's are also **in**accessible....we only have to look at wikka's parent to see that....so //some// sort of more elaborate system is needed to enhance overall accessibility. In the large, I see the overall job of developing wikka as a matter of providing a tool-kit that each wikka-owner can make the decisions around. Providing a registration system that (optionally) uses both a registration code as provided by the wikka-owner by email *or* some sort of visual validation system (I like the character size of captcha, and [[http://www.horobey.com/demos/codegen/v1/humancheck.php this]] one)...
~~&("this" one is quite horrible - I saw three glyphs, two the same that I could not guess were a 3 or a B and one in the middle that I could not decide was a 4 or an A. The only thing that can be said for it is that (at least the one sample I looked at) has high contrast... which doesn't help if the characters are not recognizable themselves -JW)
~&... at the __same__ time so a user could use either for registration would likely cover most bases for an open registration system, one with just the latter for teachers etc. that want one for their classes. --GmBowen
~~&I agree captcha **could** be an option (but it would be an extension, not Wikka "core"). Remember though that many people really hate solutions like this - they'll just leave (even if they are already part of a community) and refuse to post any more. A more stringent registration procedure (requiring email confirmation, and recording user's IP) would be be less invasive and sufficient to keep all but the most hardened spambots away; it would also ensure that people actually have a valid email address to use when they lose their password... Combine that with a "banning" option and even a spambot clever enough to sign up once with a valid email address can be kept out once discovered. --JavaWoman
Deletions:
~&But spam-flooded wiki's are also **in**accessible....we only have to look at wikka's parent to see that....so //some// sort of more elaborate system is needed to enhance overall accessibility. In the large, I see the overall job of developing wikka as a matter of providing a tool-kit that each wikka-owner can make the decisions around. Providing a registration system that (optionally) uses both a registration code as provided by the wikka-owner by email *or* some sort of visual validation system (I like the character size of captcha, and [[http://www.horobey.com/demos/codegen/v1/humancheck.php this]] one) at the __same__ time so a user could use either for registration would likely cover most bases for an open registration system, one with just the latter for teachers etc. that want one for their classes. --GmBowen


Revision [5142]

Edited on 2005-01-24 20:52:21 by GmBowen [furthering the conversation on obfuscating registration]
Additions:
~&But spam-flooded wiki's are also **in**accessible....we only have to look at wikka's parent to see that....so //some// sort of more elaborate system is needed to enhance overall accessibility. In the large, I see the overall job of developing wikka as a matter of providing a tool-kit that each wikka-owner can make the decisions around. Providing a registration system that (optionally) uses both a registration code as provided by the wikka-owner by email *or* some sort of visual validation system (I like the character size of captcha, and [[http://www.horobey.com/demos/codegen/v1/humancheck.php this]] one) at the __same__ time so a user could use either for registration would likely cover most bases for an open registration system, one with just the latter for teachers etc. that want one for their classes. --GmBowen


Revision [5141]

Edited on 2005-01-24 19:59:09 by JavaWoman [DeleteSpamAction added to see also box]
Additions:
>>As it may have dawned on you by now, spam is getting to be a problem in wiki's - both the type of spam that also plagues many blogs in the form of comment spam (only in a wiki it would (also) affect page content), and referrer spam. And then there are spambots intent on gathering email addresses.
Deletions:
>>As it may have dawned on you by now, spam is getting to be a problem in wiki's - both the type of spam that also plagues many blogs in the form of comment spam (only in a wiki it woudl (also) affect page content), and referrer spam. And then there are spambots intent on gathering email addresses.


Revision [5140]

Edited on 2005-01-24 19:58:28 by JavaWoman [DeleteSpamAction added to see also box]
Additions:
~-DeleteSpamAction


Revision [5139]

Edited on 2005-01-24 19:56:30 by JavaWoman [structure (headings)]
Additions:
====Referrer spam====
====Email-gathering spambots====
~-[[http://www.usemod.com/cgi-bin/mb.pl?WikiSpam Meatball — WikiSpam]]
~-[[http://c2.com/cgi/wiki?WikiSpam C2.com — WikiSpam]]
Deletions:
==Referrer spam==
===Email-gathering spambots===
~-[[http://www.usemod.com/cgi-bin/mb.pl?WikiSpam Meatball — WikiSpam]]
~-[[http://c2.com/cgi/wiki?WikiSpam C2.com — WikiSpam]]


Revision [5138]

Edited on 2005-01-24 19:47:46 by IanAndolina [reply to javawoman re. captchas]
Additions:
~~&It was for this reason I suggested the ascii captcha - it uses mch larger letters AND it is perfect for the colour-blind. Interestingly, it probably also is better in terms of fooling image recognition algorithms, the artifical observer will find it much harder to parse ascii-art as elements have high-contrast, conflicting orientations AND are unjoined. Thus the first step to object classifications are greatly impeded. I do agree that captcha's can reduce accesibility, but an ascii-based recognition pattern reduces some of the major obstacles I saw in the image-based ones. --IanAndolina
~-[[http://www.usemod.com/cgi-bin/mb.pl?WikiSpam Meatball — WikiSpam]]
~-[[http://c2.com/cgi/wiki?WikiSpam C2.com — WikiSpam]]
Deletions:
~-[[http://www.usemod.com/cgi-bin/mb.pl?WikiSpam Meatball — WikiSpam]]
~-[[http://c2.com/cgi/wiki?WikiSpam C2.com — WikiSpam]]


Revision [5137]

Edited on 2005-01-24 19:39:09 by JavaWoman [captcha response]
Additions:
~&[also copied from SuggestionBox] Yes, "Captcha" is an old trick - it will keep out some bots (but not all) and it will keep many people out, too, like those who are visually handicapped (not just people who are totally blind - being colorblind may be enough to be defeated by such tricks). Add a sound equivalent? Well, there are people who are deaf **and** blind. Are we going to deny them access to our wikis? I'm not in favor of making Wikka **in**accessible when we should be working towards making it **more** accessible. --JavaWoman.


Revision [5134]

Edited on 2005-01-24 19:29:03 by GmBowen [adding links to Ian's suggestion for user validation]
Additions:
[Copied from SuggestionBox] "There's also code around that uses GD & that could be built onto Nils' code that generates a "registration password" automatically and outputs it as a distorted graphic image.....the code is intended to befuddle auto spam registers & thus stop open-registration sites from being hit by spam bots that register themselves as users. Ultimately, as the bots become more sophisticated I think we'll have to use something like that or else sites like this one (with open registration) will be victimized. [[http://www.bluecopia.com/form.php Here]] and [[http://www.horobey.com/demos/codegen/ here]] are examples of what I mean (I like the simplicity of the first version in the second example). -- GmBowen" .... I think at least we need a system like one of these (or like the one Ian suggests) on the user registration page. mb
~-[[http://www.usemod.com/cgi-bin/mb.pl?WikiSpam Meatball — WikiSpam]]
~-[[http://c2.com/cgi/wiki?WikiSpam C2.com — WikiSpam]]
Deletions:
~-[[http://www.usemod.com/cgi-bin/mb.pl?WikiSpam Meatball — WikiSpam]]
~-[[http://c2.com/cgi/wiki?WikiSpam C2.com — WikiSpam]]


Revision [5131]

Edited on 2005-01-24 19:08:32 by IanAndolina [added 3 more thoughts / suggestions]
Additions:
==Refining Redirection / nofollow modification for links==
One issue with the google redirection and newer rel="nofollow" is that good sites also get hit by this procedure. As we can't really tag links on a "trusted user" basis, we have to do that on a trusted server one. I use a whitelist in config.php with a list of "good servers":
<?php
"serverre" => "/(nontroppo.org|goodsite.com|etc)/",
?>
%%
And my Link routine in the main wakka.php (wikka.php) is modified to make use of it:
<?php
if (preg_match($this->GetConfigValue("serverre"), $tag))
{
$url = $tag; //trusted web sites so no need for redirects
$urlclass= "ext";
}
else
{
$tag = rawurlencode($tag);
$url = "http://www.google.com/url?q=".$tag;
$urlclass= "ext";
$follow = " rel=\"nofollow\" ";
}
return $url ? "<a ".$follow." class=\"".$urlclass."\" href=\"".$url."\">$text</a>" : $text;
?>
%%
This way, trusted sites get full and unadulterated links, but anything else has BOTH google redirection and rel="nofollow" added. The CSS can then contain ways to visually tag those different URLs, so the user can see if a link is trusted or not (I use advanced generated content - not supported in IE):
%%(css)
a.ext:after, a[rel="nofollow"]:after {content:"\00220A";
text-decoration: none !important;
font-size: 0.9em;
color: #888;
position: relative;
bottom: 1ex;}
a[rel="nofollow"]:after {content:"\002209";}
%% -- IanAndolina
==Spam Block for Saving pages==
As I was getting a lot of repeat spam of the same domains over and over, I implemented a "link blacklist" to my Wiki for comments and edits:
add to edit.php & addcomment.php:
<?php preg_match_all($this->GetConfigValue("spamre"),$body,$out); //keyword spam block
if (count($out[0])>=1)
{
$this->SetMessage("Go spam somewhere else. You links will never get spidered here anyway.");
$this->redirect($this->href());
return;
}?>
%%
config.php
<?php
"spamre" => "/(voip99|zhiliaotuofa|mycv|princeofprussia|imobissimo|valeofglamorganconservatives|68l|8cx|online-deals99).(net|cn|com|org)|(phentermine)/m",
?>
%%
Now, what I wanted to do was have an //admin only// wiki page, where the contents of the spamre regexp could be edited, instead of being hardwired in config.php - but never got round to it. **But** this would be the better way to do it - have a function that finds a wiki page and builds a regexp from the keywords added by admins to that wiki page (not all of whom may have access to config.php). It is a fairly basic method - but with a couple of vigilant admins can reduce repeat attacks from spam bots considerably. -- IanAndolina
==User Validation==
I like the ascii-based user validation scheme (Captcha) here:
http://www.moztips.com/wiki/index.pcgi?action=edit&page=SandBox
I don't know how to do that in PHP (it is a PHP based wiki I believe) - though the [[http://www.google.com/search?q=captcha+php&sourceid=opera&num=0&ie=utf-8&oe=utf-8 more complex image based solutions]] are available. This for me is **far** prefereable to locking pages for writing using ACLs - which IMO destroys the very purpose of the wiki. --IanAndolina
==Referrer spam==
===Email-gathering spambots===
~-[[http://www.usemod.com/cgi-bin/mb.pl?WikiSpam Meatball — WikiSpam]]
~-[[http://c2.com/cgi/wiki?WikiSpam C2.com — WikiSpam]]
Deletions:
====Referrer spam====
====Email-gathering spambots====
~-[[http://www.usemod.com/cgi-bin/mb.pl?WikiSpam Meatball — WikiSpam]]
~-[[http://c2.com/cgi/wiki?WikiSpam C2.com — WikiSpam]]


Revision [5128]

Edited on 2005-01-24 18:24:23 by JavaWoman [adding references to related pages]
Additions:
>>**see also:**
~-HideReferrers
~-RemovingUsers
~-WikkaAndEmail
>>As it may have dawned on you by now, spam is getting to be a problem in wiki's - both the type of spam that also plagues many blogs in the form of comment spam (only in a wiki it woudl (also) affect page content), and referrer spam. And then there are spambots intent on gathering email addresses.
Wikka sites are no exception any more (and other WakkaWiki forks seem to be having problems, too).
Deletions:
As it may have dawned on you by now, spam is getting to be a problem in wiki's - both the type of spam that also plagues many blogs in the form of comment spam (only in a wiki it woudl (also) affect page content), and referrer spam. And then there are spambots intent on gathering email addresses.
Wikka sites are no exception any more (and other WakkaWiki forks seem to be having problems, too)


Revision [5124]

Edited on 2005-01-24 17:13:55 by JavaWoman [adding more links from my collection]
Additions:
~-[[http://chongq.blogspot.com/ Goggle Ending Comment Spam]] [sic]
~-[[http://chongqed.org/ chongqed.org]]
~-[[http://blacklist.chongqed.org/ chongqed.org blacklist]] - use this dynamically
~-[[http://chongqed.org/submit.html Submit a wiki spammer]] - All your page ranks are belong to us!
~-[[http://wackowiki.com/SPAM/ Wacko Wiki - SPAM]]
~-[[http://www.bluestack.org/ReferrerSpam ReferrerSpam]] - A Wakka page about preventing referrer spam ... which page is now spammed (ouch!)


Revision [5056]

Edited on 2005-01-24 11:12:04 by JavaWoman [adding link]
Additions:
//See also DeleteSpamAction !//


Revision [5055]

Edited on 2005-01-24 11:06:04 by JavaWoman [re-arranging stuff]

No Differences

Revision [5054]

Edited on 2005-01-24 11:04:56 by JavaWoman [adding more content and a suggestion]
Additions:
This page is intended to gather ideas for how to fight spam (of all types) in Wikka, so we can coordinate our efforts and get a spammer-hardened Wikka out there. You can also find some general information about (fighting) wiki spam and what Wikka has already implemented as defense measures.
~//About how to discourage spammers to post links on spam pages in the first place, and what to do when your pages have been spammed already.//
~//Spammers sometimes visit Wikis and blogs with a tool with "bogus" referer headers containing the sites they want to generate incoming links for - this works on many wikis and blogs since such sites often have a page listing referrers (wikis) or list referrers to a particular post (blogs). If a Search engine indexes such a page, it would find a link to the spammed site, resulting in a higher "score" for that spammed page.//
The general solution is to cause such links **not** to be followed by search engines. The technique outlined below under "Don't let old pages get indexed" already takes care of this for the referrer listings Wikka uses.
~//Spambots spider websites looking for email addresses to add to the list (to use, or to sell as a targeted list). A general defense that works well (though not 100%) is to "obfuscate" email addresses so such spambots don't recognize them.//
==Obfuscating addresses automatically==
Wikka 1.1.6.0 comes with a small action to create an obfuscated email "contact" link for the site administrator. Meanwhile, the formatter will simply turn every email address //it// recognizes into an email link (with the address also used for the link text) - providing nice fodder for spambots.
What we should have is a function that can turn a given email address into an obfuscated link - this could then be used by both the ""{{contact}}"" action and the formatter. It would (then) also enable use to change the obfuscating algorithm inside the fuction without affecting either the formatter or the contact action any more, and others can use this in their own extensions as well. --JavaWoman
~//Spam-defense measures that are already implemented in Wikka.//
//Extended method implemented as of Wikka 1.1.6.0 (Both the "noarchive" addition and applying it to the Sandbox)//
====Further references====
~//Where to read more about Wiki spam.//
~-[[http://www.usemod.com/cgi-bin/mb.pl?WikiSpam Meatball — WikiSpam]]
~-[[http://c2.com/cgi/wiki?WikiSpam C2.com — WikiSpam]]
Deletions:
This page is intended to gather ideas for how to fight spam (of all types) in Wikka, so we can coordinate our efforts and get a spammer-hardened Wikka out there.
~&Both the "noarchive" addition and applying it to the Sandbox as well as old pages will be in Wikka version 1.1.6.0 - as you can see in the [[HomePage public beta]]! --JavaWoman
===Further references===
[[http://www.usemod.com/cgi-bin/mb.pl?WikiSpam Meatball — WikiSpam]]
[[http://c2.com/cgi/wiki?WikiSpam C2.com — WikiSpam]]


Revision [5052]

Edited on 2005-01-24 10:39:37 by JavaWoman [adding "spam repair and defense"]
Additions:
==Spam repair and defense==
1/22/05 - Spam help! I have apparently been attacked by an army of [[http://www.richardberg.net/RecentChanges spam bots]]. Has this happened to anyone else? For now, I am asking for your help with:
- a SQL command that will delete all of these edits
- a SQL command that will change all of my ACLs to '+' for writing and commenting (I've modified the config file but that only affects new pages AFAIK)
Whatever script they used (on multiple machines, no less) could certainly be used against any Wakka-like site with minimal modifications, so something has to be done...I will do what I can to help you guys combat future attacks as well as implement the new [[http://www.google.com/googleblog/2005/01/preventing-comment-spam.html HTML attribute]] you've probably all heard about. --RichardBerg
~&Richard: here's the sql to update all your current ACLs (I'm using mysql 4.0.22):
~~%%(sql) UPDATE acls SET comment_acl="+" WHERE comment_acl="*";
UPDATE acls SET write_acl="+" WHERE write_acl="*"; %%
~&You'll need to change the table name (acls) to match whatever your table is named. Give me a few to look at the pages table and your site and I should have the sql for removing the edits. :) -- MovieLady
~~&Since Richard has //already// changed his default ACLs in the configuration, that would apply to any page that did not have ACLs different from the **original** default (not merely new pages!); your SQL code should take care of any pages that had ACLs //different// from the original default (because only those would have a record in the ACLs table). --- See also JsnX's suggestion about "Clearing ACLs" on the SuggestionBox which explains how this mechanism works. Thanks, MovieLady! --JavaWoman
~~~& Correct. Both statements will change only the entries that had the default ACL from his config file in that field. (What the statements are looking for can, of course be changed, as can what the field is being set to. I used it when I went back and changed my default ACLs on all pages that had ACLs to disallow all banned users from writing or commenting after adding ACLsWithUserGroups.) --MovieLady
~&There is a relevant link to an action at wikini for removing additions by particular IP's or users at CommunityNotes.--GmBowen
~~& Thanks for the link! I've translated and made minor changes to the code, and posted everything to DeleteSpamAction. He's got a very good starting point, I think. One could adapt the code (fairly easily) to allow you to look at all the revisions on a page instead of by user/IP and then delete the histories you don't want to keep, for whatever reason. --MovieLady


Revision [5048]

Edited on 2005-01-24 10:27:04 by DarTar [Adding external links]
Additions:
===Further references===
[[http://www.usemod.com/cgi-bin/mb.pl?WikiSpam Meatball — WikiSpam]]
[[http://c2.com/cgi/wiki?WikiSpam C2.com — WikiSpam]]


Revision [5047]

Edited on 2005-01-24 10:26:18 by JavaWoman [adding some content from SuggestionBox]
Additions:
==Stopping Spammers getting Google Juice==
[[http://simon.incutio.com/archive/2004/05/11/approved There is a technique]] to stop spammers from gaining any advantage of spamming, which is to redirect external links to stop them from affecting their PageRank. Great to stop the whole purpose of spamming, but this has the disadvantage that good sites lose their google juice too. Check the comments out on that page for more cons. I've noticed since I enabled this on the Opera 7 wiki that slowly spam volume has dropped out, but I'm not entirely happy at the price paid. Had you thought about this, maybe have it as an option during config? -- IanAndolina
~&Good point, Ian. I had thought about this, after having seen several Wikis and blogs that use the Google redirection... I do think it should be configurable though - not every Wiki installation may want to do this (in fact, some may welcome external links as long as spam is repaired fast enough). --JavaWoman
~~&I asked an export for SEO and he replied that it should be enough to use a simple internal redirect (e.g. exit.php?url=...) to create this effect. He also said that it might be helpful to disallow any spider access to that file (robots.txt). -- ReimerStegelmann
~~~&Unfortunately, search engine robots these days mostly **do** follow URLs with parameters, and an "internal redirect" done that way would be retrieved by a bot; HTTP redirects are followed, too (which is what you'd have to use with that "internal redirect" method). **Meta** redirects mostly aren't but you cannot apply this as a general "redirect external links" (especially not since you cannot have any URL parameters in a meta redirect - and you want to allow all //valid// external links, merely have them not count towards page rank in search engines, mostly Google). Excluding a single file with robots.txt won't work since all of Wikka runs off the single wikka.php file. The Google redirect method gets around all of that (at least for Google's ranking mechanism - which is what spammers are mostly targeting). --JavaWoman
~~~~&They follow, but that is not the point of spam. The main target of a spammer is the reach a high ranking in search engines. They post links which linktext contains important keywords (e.g. [[Keyword1 keyword2 http://domain.tld]]). So, if you enter keyword1 oder keyword2 to a search engine, you will see the homepage of the spammer. By using a simple redirect, spiders will follow the link, but they give a fuck about the keywords and so the spammer gives a fuck about the link.
~~~~~&Exactly - and using the Google redirect **prevents** the target page from getting a higher ranking from incoming (spam) links because it won't be counted at all. :) --JavaWoman
~~~~~~&Yeah, but you don't need Google to make this happen. A simple internal redirect is enough and looks better than a Google-Redirect ;)
~~~~~~~&Nope, because an internal redirect //will// be followed by Google and //still// count for page rank - that's the problem; the Google redirect prevents this. --JavaWoman
~~~~~~~~&I talked to Abakus, a German SEO expert and he said it does not count. There is no difference between an internal redirect oder a Google redirect. Keywords of the link (s. above) only count for the redirect site and not for link behind the redirect. And well, why should a spider follow an internal link (via exit.php?url=...), but not a Google redirect?
~~~~~~~~~& A spider will follow any redirect, whether it's local or through an internal redirect. Never mind the keywords, it's still a link //into// the spammed site; with a local redirect that won't make any difference, but with the Google redirect Google knows to **not** count it as an incoming link. It's not (just) about keywords but about **Page Rank** (PR) - and PR is highly dependent on incoming links (and where they come from). That much we know. But no one except some Google employees knows the exact algorithm that determines PR - not even Abakus ;-) --JavaWoman
~&Maybe the solution is [[http://www.google.com/googleblog/2005/01/preventing-comment-spam.html here]].
~&If a user is not registered, to all external links he creates on the wiki will be added the attribute rel="nofollow".
~&This technique is now adopted by Google, Yahoo and MSN. --DotMG
~~&Thanks, DotMG! This is great news - I had seen this technique being discussed as a proposed possible solution but had missed the news the proposal has actually been adopted now. (Should we worry about Altavista? Probably not too much - these SEs are the ones spammers will target primarily.) One possible hole I can see is that a spammer might write a script to quickly register and then post on a number of pages - but scripted registrations can be defended against with other means. Nothing will probably provide a 100% solution but this is a big step in the right direction. --JavaWoman
====Resolved Suggestions====
==Don't let old pages get indexed==
To make absolutely sure old pages don't get archived (irrespective of your robots.txt) - essential to stopping WikiSpam from still getting juice from archived pages, why not make sure to add meta directives to those pages by adding something like:
%%(php)
<?php if ($this->GetMethod() != 'show' || $this->page["latest"] == "N") echo "<meta name=\"robots\" content=\"noindex, nofollow, noarchive\" />\n<meta name=\"googlebot\" content=\"noarchive, noindex, nofollow\">\n";?>%%
to header.php. This stops pages with handlers other than show or non current pages from any kind of archiving/cacheing.
~&Ian, thanks for the suggestion. Wikka has had something similar to this in place since the first release. See Mod033bRobotIndexing. But your suggestion expands the idea and adds the latest page check, "noarchive", and the googlebot part--which seem like good ideas. I'll add this to the upcoming release. By the way, when are you going to switch your site over to Wikka? ;) -- JsnX
~~&Yes, nice idea. But the googlebot part is actually redundant, Google obeys the robots meta directives. (And that second meta tag isn't valid XHTML - it's unclosed.) I suggest we merely add the "noarchive". Apart from that, it would also be nice to stop indexing etc. from the SandBox page. --JavaWoman
~~~&The latest page check is important because wiki spammers don't really care if you delete their spam, as long as their links sit on an old archived page waiting to be indexed. The added googlebot directive (thanks for spotting typo btw) is just **extra paranoia on my part** :). And you are all doing an **excellent** job with Wikka - the only reason I haven't switched is that quite a lot on my Wakka is heavily customised and I don't have the time to redo that - especially as lots of pages would break without re-jigging of e.g. SafeHTML (my BookMarklets page for example). If I have time, I will eventually migrate...! -- IanAndolina
~&Both the "noarchive" addition and applying it to the Sandbox as well as old pages will be in Wikka version 1.1.6.0 - as you can see in the [[HomePage public beta]]! --JavaWoman


Revision [5046]

The oldest known version of this page was created on 2005-01-24 10:13:11 by JavaWoman [adding some content from SuggestionBox]
Valid XHTML :: Valid CSS: :: Powered by WikkaWiki