Pagename validation

Last edited by MovieLady:
Modified links pointing to docs server
Mon, 28 Jan 2008 00:13 UTC [diff]




I open this page to discuss problems related to pagename validation and the underlying regex that are needed to validate and format both camelcase and forced links.


Current pattern for valid pagetags

$validtag = "/^[A-Z,a-z,ÄÖÜ,ßäöü]+[A-Z,a-z,0-9,ÄÖÜ,ßäöü]*$/s";


Some considerations off the cuff:

Apart from a possible "German" origin, I never understood the bias here to allowing German characters but not non-ASCII characters used in other languages. That said, I don't think an RE should look for a "word" but merely a "string-consisting-of-letters-and-digits-and-starting-with-a-letter". By using a hex encoding inside the RE for "letters" we would also make this encoding-independent, thus not limiting to ISO-8859-1 (why not a Turkish Wiki with Turkish page (and user) names?).
I don't know, I'm a little uncomfortable with the idea of allowing any kind of character in a WikiName. AndreaRossato pointed out that a Pagetag and a WikiName should only contain ASCII characters. The question of pagenames in different charsets cannot be addressed IMO without taking some decisions concerning multilanguage support and UTF-8 encoding. Or am I misunderstanding your proposal? -- DarTar
Well, of course Wikka (and Wakka before it, obviously) already allows non-7-bit-ASCII page names - quite obviously intentionally. I'm a bit doubtful about a string like 'ÄrgerMich' not being alllowed in something like http://wikka.jsnx.com/ÄrgerMich since this will be rewritten into a query anyway - and a query string can contain anything... In other words, even if a user pastes the string into a browser address bar, it should still work. (URL-encoded, it is of course correct anyway, whichever format is used.)
But I'm not proposing to use "any kind of character" - just letters and digits; only taking a clue from the German bias and extending it to anything that can be written in any 8-bit ISO-8859 character set. No UTF-8 needed, it would be completely transparent as long as a character encoding is set (as it should be). It won't allow Chinese (yet), but it will allow Turkish and Icelandic.
Just as the fact that there are already many Wakka/Wikka-based Wikis out there with non-camelcase names is an argument for not imposing camelcase for a valid page name, the fact that there are also Wikis out there using German, Russian, and what not including in page names is an argument for not limiting to 7-bit US-ASCII. --JavaWoman
Also, the commas in that RE are puzzling - do we allow a Wiki name to start with or contain a comma? I think not - and in that case they should go. (See my latest addition to WikkaBugs on this phenomenon of commas in REs!)
Another thing I find a bit strange is that this RE requires that a tag starts with two letters, and may be followed by any number of letters and digits - why not start with a single letter and require at least two alphanumeric characters?

Building on that, let's first set up some RE building blocks:
define('PATTERN_LCLETTER', 'a-z\xe0-\xf6\xf8-\xff');
define('PATTERN_UCLETTER', 'A-Z\xc0-\xd6\xd8-\xdf');
define('PATTERN_LETTER', PATTERN_LCLETTER.PATTERN_UCLETTER);
define('PATTERN_DIGIT', '0-9');

Now we can use those to build an expression for a valid tag:
$validtag = '/^['.PATTERN_LETTER.']['.PATTERN_LETTER.PATTERN_DIGIT.']+$/';
Note I've also discarded the 's' modifier: if we need to match something that is a string without any whitespace, we don't need to treat multiple lines as a single one. --JavaWoman

I have now implemented JavaWoman's define patterns and it is working (http://nontroppo.org/wiki/ØýêTeßær), though it is important to make sure utf8_encode and utf8_decode are used in various RegEx's, e.g. in the formatter:
        // wiki links!
        else if (preg_match("/^[".UCLETTER."]+[".LCLETTER."]+[".UCLETTER.DIGITS."][".ALLCHARS."]*$/", utf8_decode($thing)))
        {
            return $wakka->Link($thing);
        }

The problem is then that actions such as PageIndex will not correctly sort the page and it instead goes under numerics: http://nontroppo.org/wiki/PageIndex - this order comes from MySQL and I haven't tried resorting it (though I imagine PHP is pretty poor at sorting utf8!) I wonder what else may break?--IanAndolina
Did you try using collate in your select statement for the PageIndex? This should be a relatively easy (I think, but don't quote me, as I haven't done anything with charsets in MySQL yet) way to fix that problem by using a defined constant for the character sets supported by MySQL, and therefore providing more consistent and flexible support for international users. :) --MovieLady

References:
Uniform Resource Identifiers (URI): Generic Syntax
 




Current pattern for valid usernames

JavaWoman pointed out that Wikka currently restricts valid usernames to camelcase-formatted WikiNames. Is this consistent with the fact that we actually do allow valid pagetags in forced links beyond the camelcase format? And what about special characters in usernames?


Using the patterns outlined above should fix this. :) --JavaWoman




CategoryDevelopmentCore CategoryRegex
Comments
Comment by TimoK
2005-03-28 00:37:54
In whatever direction you go, would it be possible to make the regex part of wakkaConfig, so it can be used in actions etc more easy (and to allow a simple tweaking in one place if someone wants to allow more or less characters)?
Comment by DarTar
2005-03-28 10:42:58
TimoK, we are actually planning to centralize page validation functions and RE libraries, so they can be easily modified/customized across the whole Wikka installation.
Comment by TimoK
2005-03-28 11:05:16
That's very good. :) I would still like the RE to be in the wakkaConfig - or alternativly a method isValidPageName() - it might sometimes be useful in actions to check if a given string _could_ be a pagename without actually creating the page or checking its existance.
Comment by cynebeald.kosire.net
2005-11-11 09:00:01
About the sorting problem - MySQL is able to return properly sorted results (even with respect to the different characters) as long as the correct character set and collation is set for the field. This might not be an option on "international" sites, but it's ok if you have a localized site.
Comment by 212.254.172.236
2006-03-27 05:03:58
hello, i don't understand if and how could be possible to have page names starting with digits...
thnx, KraaK
Comment by 134.76.60.193
2006-03-27 05:17:06
at the current version not. But we are planing to make it possible again in one of the next releases.
Comment by AlanBiocca
2006-07-13 21:13:32
Forgive me if this is not an appropriate suggestion, but is there any possibility that underline can be allowed in ValidPageNames?
Comment by DarTar
2006-07-15 09:06:11
Alan, see my comments on your userpage.
Valid XHTML :: Valid CSS: :: Powered by WikkaWiki