Revision history for ValidPageNames
Revision [18879]
Last edited on 2008-01-28 00:13:01 by MovieLady [Modified links pointing to docs server]Additions:
JavaWoman pointed out that Wikka currently restricts valid usernames to camelcase-formatted [[Docs:WikiName WikiNames]]. Is this consistent with the fact that we actually do allow valid pagetags in forced links beyond the camelcase format? And what about special characters in usernames?
Deletions:
Revision [14516]
Edited on 2006-06-09 19:35:33 by MovieLady [suggestion for IanAndolina on charset sorting]Additions:
~ ''Did you try using [[http://dev.mysql.com/doc/refman/4.1/en/charset-collate.html collate]] in your select statement for the PageIndex? This should be a relatively easy (I think, but don't quote me, as I haven't done anything with charsets in MySQL yet) way to fix that problem by using a defined constant for the [[http://dev.mysql.com/doc/refman/4.1/en/charset.html character sets]] supported by MySQL, and therefore providing more consistent and flexible support for international users. :) --MovieLady''
Additions:
>>**See also**
[[RegexLibrary Regular Expression Library]]
>>
[[RegexLibrary Regular Expression Library]]
>>
Additions:
CategoryDevelopmentCore CategoryRegex
Deletions:
Revision [6266]
Edited on 2005-02-23 14:01:09 by IanAndolina [Added my experiences of expanding WikiName character support, works but more is needed...]Additions:
===== Pagename validation =====
{{lastedit}}
I open this page to discuss problems related to pagename validation and the underlying regex that are needed to validate and format both camelcase and forced links.
----
== Current pattern for valid pagetags ==
%%$validtag = "/^[A-Z,a-z,ÄÖÜ,ßäöü]+[A-Z,a-z,0-9,ÄÖÜ,ßäöü]*$/s";%%
Some considerations off the cuff:
~-The German //eszed// (ß) can't appear at the beginning of a word in any language, so we might drop it from the first character class.
~-If we are to allow accented characters in valid page tags (are we?), we should consider allowing also other characters like for instance èéêëñç that are part of the extended ASCII charset (iso-8859-1).
~-We should prevent non-escaped URIs to be parsed as pagetags or at least encode them before applying a validator: http://wikka.jsnx.com/ÄrgerMich is correctly encoded, but what if a user pastes this URL directly in the address field of a browser?
~''Apart from a possible "German" origin, I never understood the bias here to allowing German characters but not non-ASCII characters used in other languages. That said, I don't think an RE should look for a "word" but merely a "string-consisting-of-letters-and-digits-and-starting-with-a-letter". By using a hex encoding inside the RE for "letters" we would also make this encoding-independent, thus not limiting to ISO-8859-1 (why not a Turkish Wiki with Turkish page (and user) names?).''
~~I don't know, I'm a little uncomfortable with the idea of allowing //any// kind of character in a WikiName. AndreaRossato pointed out that a Pagetag and a WikiName should only contain ASCII characters. The question of pagenames in different charsets cannot be addressed IMO without taking some decisions concerning multilanguage support and UTF-8 encoding. Or am I misunderstanding your proposal? -- DarTar
~~~''Well, of course Wikka (and Wakka before it, obviously) //already// allows non-7-bit-ASCII page names - quite obviously intentionally. I'm a bit doubtful about a string like 'ÄrgerMich' not being alllowed in something like http://wikka.jsnx.com/ÄrgerMich since this will be rewritten into a query anyway - and a query string can contain anything... In other words, even if a user pastes the string into a browser address bar, it should still //work//. (URL-encoded, it is of course correct anyway, whichever format is used.)
~~~But I'm not proposing to use "any kind of character" - just letters and digits; only taking a clue from the German bias and extending it to anything that can be written in any 8-bit ISO-8859 character set. No UTF-8 needed, it would be completely transparent as long as a character encoding is set (as it should be). It won't allow Chinese (yet), but it will allow Turkish and Icelandic.
~~~Just as the fact that there are already many Wakka/Wikka-based Wikis out there with non-camelcase names is an argument for not imposing camelcase for a valid page name, the fact that there are also Wikis out there using German, Russian, and what not including in page names is an argument for not limiting to 7-bit US-ASCII. --JavaWoman''
~''Also, the commas in that RE are puzzling - do we allow a Wiki name to start with or contain a comma? I think not - and in that case they should go. (See my latest addition to WikkaBugs on this phenomenon of commas in REs!)
~Another thing I find a bit strange is that this RE requires that a tag starts with **two** letters, and may be followed by any number of letters and digits - why not start with a single letter and require at least two alphanumeric characters?''
~''Building on that, let's first set up some RE building blocks:'' %%(php)define('PATTERN_LCLETTER', 'a-z\xe0-\xf6\xf8-\xff');
define('PATTERN_UCLETTER', 'A-Z\xc0-\xd6\xd8-\xdf');
define('PATTERN_LETTER', PATTERN_LCLETTER.PATTERN_UCLETTER);
define('PATTERN_DIGIT', '0-9');%%
~''Now we can use those to build an expression for a valid tag:'' %%(php)$validtag = '/^['.PATTERN_LETTER.']['.PATTERN_LETTER.PATTERN_DIGIT.']+$/';%% ''Note I've also discarded the 's' modifier: if we need to match something that is a string without any whitespace, we don't need to treat multiple lines as a single one. --JavaWoman''
I have now implemented JavaWoman's define patterns and it is working ([[http://nontroppo.org/wiki/ØýêTeßær]]), though it is important to make sure utf8_encode and utf8_decode are used in various RegEx's, e.g. in the formatter:
%%(php) // wiki links!
else if (preg_match("/^[".UCLETTER."]+[".LCLETTER."]+[".UCLETTER.DIGITS."][".ALLCHARS."]*$/", utf8_decode($thing)))
{
return $wakka->Link($thing);
}%%
The problem is then that actions such as PageIndex will not correctly sort the page and it instead goes under numerics: http://nontroppo.org/wiki/PageIndex - this order comes from MySQL and I haven't tried resorting it (though I imagine PHP is pretty poor at sorting utf8!) I wonder what else may break?--IanAndolina
<<References:
~[[http://www.faqs.org/rfcs/rfc2396.html RFC2396]]
~Uniform Resource Identifiers (URI): Generic Syntax
<<::c::
----
== Current pattern for valid usernames ==
JavaWoman pointed out that Wikka currently restricts valid usernames to camelcase-formatted [[WikiName WikiNames]]. Is this consistent with the fact that we actually do allow valid pagetags in forced links beyond the camelcase format? And what about special characters in usernames?
~-DarTar is an allowed username and it is correctly parsed as a link.
~-SchönesMädchen is an allowed username (you can actually register with this name) and is parsed as a link, also if you force it as ""[[SchönesMädchen]]"": [[SchönesMädchen]] .
~-Because of the currently used validation pattern, French users are discriminated while German users aren't :) - SchönesMädchen is allowed (with the above restrictions) while BelleFrançaise or NiñaHermosa aren't (look BTW at the incorrect WikiName segmentation produced by the //cedille//). On the other hand, they produce inconsistent links if you force them as ""[[BelleFrançaise]]"" and ""[[NiñaHermosa]]"": [[BelleFrançaise]] [[NiñaHermosa]]). This should be IMO fixed as soon as possible.
~''Using the patterns outlined above should fix this. :) --JavaWoman''
----
{{lastedit}}
I open this page to discuss problems related to pagename validation and the underlying regex that are needed to validate and format both camelcase and forced links.
----
== Current pattern for valid pagetags ==
%%$validtag = "/^[A-Z,a-z,ÄÖÜ,ßäöü]+[A-Z,a-z,0-9,ÄÖÜ,ßäöü]*$/s";%%
Some considerations off the cuff:
~-The German //eszed// (ß) can't appear at the beginning of a word in any language, so we might drop it from the first character class.
~-If we are to allow accented characters in valid page tags (are we?), we should consider allowing also other characters like for instance èéêëñç that are part of the extended ASCII charset (iso-8859-1).
~-We should prevent non-escaped URIs to be parsed as pagetags or at least encode them before applying a validator: http://wikka.jsnx.com/ÄrgerMich is correctly encoded, but what if a user pastes this URL directly in the address field of a browser?
~''Apart from a possible "German" origin, I never understood the bias here to allowing German characters but not non-ASCII characters used in other languages. That said, I don't think an RE should look for a "word" but merely a "string-consisting-of-letters-and-digits-and-starting-with-a-letter". By using a hex encoding inside the RE for "letters" we would also make this encoding-independent, thus not limiting to ISO-8859-1 (why not a Turkish Wiki with Turkish page (and user) names?).''
~~I don't know, I'm a little uncomfortable with the idea of allowing //any// kind of character in a WikiName. AndreaRossato pointed out that a Pagetag and a WikiName should only contain ASCII characters. The question of pagenames in different charsets cannot be addressed IMO without taking some decisions concerning multilanguage support and UTF-8 encoding. Or am I misunderstanding your proposal? -- DarTar
~~~''Well, of course Wikka (and Wakka before it, obviously) //already// allows non-7-bit-ASCII page names - quite obviously intentionally. I'm a bit doubtful about a string like 'ÄrgerMich' not being alllowed in something like http://wikka.jsnx.com/ÄrgerMich since this will be rewritten into a query anyway - and a query string can contain anything... In other words, even if a user pastes the string into a browser address bar, it should still //work//. (URL-encoded, it is of course correct anyway, whichever format is used.)
~~~But I'm not proposing to use "any kind of character" - just letters and digits; only taking a clue from the German bias and extending it to anything that can be written in any 8-bit ISO-8859 character set. No UTF-8 needed, it would be completely transparent as long as a character encoding is set (as it should be). It won't allow Chinese (yet), but it will allow Turkish and Icelandic.
~~~Just as the fact that there are already many Wakka/Wikka-based Wikis out there with non-camelcase names is an argument for not imposing camelcase for a valid page name, the fact that there are also Wikis out there using German, Russian, and what not including in page names is an argument for not limiting to 7-bit US-ASCII. --JavaWoman''
~''Also, the commas in that RE are puzzling - do we allow a Wiki name to start with or contain a comma? I think not - and in that case they should go. (See my latest addition to WikkaBugs on this phenomenon of commas in REs!)
~Another thing I find a bit strange is that this RE requires that a tag starts with **two** letters, and may be followed by any number of letters and digits - why not start with a single letter and require at least two alphanumeric characters?''
~''Building on that, let's first set up some RE building blocks:'' %%(php)define('PATTERN_LCLETTER', 'a-z\xe0-\xf6\xf8-\xff');
define('PATTERN_UCLETTER', 'A-Z\xc0-\xd6\xd8-\xdf');
define('PATTERN_LETTER', PATTERN_LCLETTER.PATTERN_UCLETTER);
define('PATTERN_DIGIT', '0-9');%%
~''Now we can use those to build an expression for a valid tag:'' %%(php)$validtag = '/^['.PATTERN_LETTER.']['.PATTERN_LETTER.PATTERN_DIGIT.']+$/';%% ''Note I've also discarded the 's' modifier: if we need to match something that is a string without any whitespace, we don't need to treat multiple lines as a single one. --JavaWoman''
I have now implemented JavaWoman's define patterns and it is working ([[http://nontroppo.org/wiki/ØýêTeßær]]), though it is important to make sure utf8_encode and utf8_decode are used in various RegEx's, e.g. in the formatter:
%%(php) // wiki links!
else if (preg_match("/^[".UCLETTER."]+[".LCLETTER."]+[".UCLETTER.DIGITS."][".ALLCHARS."]*$/", utf8_decode($thing)))
{
return $wakka->Link($thing);
}%%
The problem is then that actions such as PageIndex will not correctly sort the page and it instead goes under numerics: http://nontroppo.org/wiki/PageIndex - this order comes from MySQL and I haven't tried resorting it (though I imagine PHP is pretty poor at sorting utf8!) I wonder what else may break?--IanAndolina
<<References:
~[[http://www.faqs.org/rfcs/rfc2396.html RFC2396]]
~Uniform Resource Identifiers (URI): Generic Syntax
<<::c::
----
== Current pattern for valid usernames ==
JavaWoman pointed out that Wikka currently restricts valid usernames to camelcase-formatted [[WikiName WikiNames]]. Is this consistent with the fact that we actually do allow valid pagetags in forced links beyond the camelcase format? And what about special characters in usernames?
~-DarTar is an allowed username and it is correctly parsed as a link.
~-SchönesMädchen is an allowed username (you can actually register with this name) and is parsed as a link, also if you force it as ""[[SchönesMädchen]]"": [[SchönesMädchen]] .
~-Because of the currently used validation pattern, French users are discriminated while German users aren't :) - SchönesMädchen is allowed (with the above restrictions) while BelleFrançaise or NiñaHermosa aren't (look BTW at the incorrect WikiName segmentation produced by the //cedille//). On the other hand, they produce inconsistent links if you force them as ""[[BelleFrançaise]]"" and ""[[NiñaHermosa]]"": [[BelleFrançaise]] [[NiñaHermosa]]). This should be IMO fixed as soon as possible.
~''Using the patterns outlined above should fix this. :) --JavaWoman''
----
Deletions:
{{lastedit}}
I open this page to discuss problems related to pagename validation and the underlying regex that are needed to validate and format both camelcase and forced links.
----
== Current pattern for valid pagetags ==
%%$validtag = "/^[A-Z,a-z,ÄÖÜ,ßäöü]+[A-Z,a-z,0-9,ÄÖÜ,ßäöü]*$/s";%%
Some considerations off the cuff:
~-The German //eszed// (ß) can't appear at the beginning of a word in any language, so we might drop it from the first character class.
~-If we are to allow accented characters in valid page tags (are we?), we should consider allowing also other characters like for instance èéêëñç that are part of the extended ASCII charset (iso-8859-1).
~-We should prevent non-escaped URIs to be parsed as pagetags or at least encode them before applying a validator: http://wikka.jsnx.com/ÄrgerMich is correctly encoded, but what if a user pastes this URL directly in the address field of a browser?
~''Apart from a possible "German" origin, I never understood the bias here to allowing German characters but not non-ASCII characters used in other languages. That said, I don't think an RE should look for a "word" but merely a "string-consisting-of-letters-and-digits-and-starting-with-a-letter". By using a hex encoding inside the RE for "letters" we would also make this encoding-independent, thus not limiting to ISO-8859-1 (why not a Turkish Wiki with Turkish page (and user) names?).''
~~I don't know, I'm a little uncomfortable with the idea of allowing //any// kind of character in a WikiName. AndreaRossato pointed out that a Pagetag and a WikiName should only contain ASCII characters. The question of pagenames in different charsets cannot be addressed IMO without taking some decisions concerning multilanguage support and UTF-8 encoding. Or am I misunderstanding your proposal? -- DarTar
~~~''Well, of course Wikka (and Wakka before it, obviously) //already// allows non-7-bit-ASCII page names - quite obviously intentionally. I'm a bit doubtful about a string like 'ÄrgerMich' not being alllowed in something like http://wikka.jsnx.com/ÄrgerMich since this will be rewritten into a query anyway - and a query string can contain anything... In other words, even if a user pastes the string into a browser address bar, it should still //work//. (URL-encoded, it is of course correct anyway, whichever format is used.)
~~~But I'm not proposing to use "any kind of character" - just letters and digits; only taking a clue from the German bias and extending it to anything that can be written in any 8-bit ISO-8859 character set. No UTF-8 needed, it would be completely transparent as long as a character encoding is set (as it should be). It won't allow Chinese (yet), but it will allow Turkish and Icelandic.
~~~Just as the fact that there are already many Wakka/Wikka-based Wikis out there with non-camelcase names is an argument for not imposing camelcase for a valid page name, the fact that there are also Wikis out there using German, Russian, and what not including in page names is an argument for not limiting to 7-bit US-ASCII. --JavaWoman''
~''Also, the commas in that RE are puzzling - do we allow a Wiki name to start with or contain a comma? I think not - and in that case they should go. (See my latest addition to WikkaBugs on this phenomenon of commas in REs!)
~Another thing I find a bit strange is that this RE requires that a tag starts with **two** letters, and may be followed by any number of letters and digits - why not start with a single letter and require at least two alphanumeric characters?''
~''Building on that, let's first set up some RE building blocks:'' %%(php)define('PATTERN_LCLETTER', 'a-z\xdf-\xf6\xf8-\xff');
define('PATTERN_UCLETTER', 'A-Z\xc0-\xd6\xd8-\xdf');
define('PATTERN_LETTER', PATTERN_LCLETTER.PATTERN_UCLETTER);
define('PATTERN_DIGIT', '0-9');%%
~''Now we can use those to build an expression for a valid tag:'' %%(php)$validtag = '/^['.PATTERN_LETTER.']['.PATTERN_LETTER.PATTERN_DIGIT.']+$/';%% ''Note I've also discarded the 's' modifier: if we need to match something that is a string without any whitespace, we don't need to treat multiple lines as a single one.
~ --JavaWoman''
<<References:
~[[http://www.faqs.org/rfcs/rfc2396.html RFC2396]]
~Uniform Resource Identifiers (URI): Generic Syntax
<<::c::
----
== Current pattern for valid usernames ==
JavaWoman pointed out that Wikka currently restricts valid usernames to camelcase-formatted [[WikiName WikiNames]]. Is this consistent with the fact that we actually do allow valid pagetags in forced links beyond the camelcase format? And what about special characters in usernames?
~-DarTar is an allowed username and it is correctly parsed as a link.
~-SchönesMädchen is an allowed username (you can actually register with this name) and is parsed as a link, also if you force it as ""[[SchönesMädchen]]"": [[SchönesMädchen]] .
~-Because of the currently used validation pattern, French users are discriminated while German users aren't :) - SchönesMädchen is allowed (with the above restrictions) while BelleFrançaise or NiñaHermosa aren't (look BTW at the incorrect WikiName segmentation produced by the //cedille//). On the other hand, they produce inconsistent links if you force them as ""[[BelleFrançaise]]"" and ""[[NiñaHermosa]]"": [[BelleFrançaise]] [[NiñaHermosa]]). This should be IMO fixed as soon as possible.
~''Using the patterns outlined above should fix this. :) --JavaWoman''
----
Revision [3272]
Edited on 2004-12-15 22:34:18 by DarTar [Moving link formatter discussion to WantedFormatters]Additions:
===== Pagename validation =====
Deletions:
Revision [3271]
Edited on 2004-12-15 22:33:15 by DarTar [Moving link formatter discussion to WantedFormatters]Deletions:
I think that the current forced link formatter should be improved to allow //GET parameters//, //anchors// and //titles// to be parsed as part of valid internal links.
For example it would be nice if we could not only use forced links like:
~##""[[HomePage Internal forced link]]""##
or
~##""[[http://www.google.com External forced link]]""##
but also the following:
~-Forced internal link with URL parameter
~~##""[[HomePage (? "par1=ba,par2=bo") Internal forced link]]""##
~~=> http://wikka.jsnx.com/HomePage?par1=ba&par2=bo
~-Forced internal link with anchor
~~##""[[HomePage (# "this") Internal forced link]]""##
~~=> http://wikka.jsnx.com/HomePage#this
~-Forced internal link with Title
~~##""[[HomePage (§ "This is a link to the HomePage") Internal forced link]]""##
But I don't have a clue on how to modify the current formatter to send to the ##Link()## function all this stuff.
~''I like this idea very much, especially being able to add a title. A few remarks, no particular order:
~~-The paragraph sign § is not present on many keyboards (though probably on yours); I propose to use an exclamation mark instead.
~~~-Good point --DarTar
~~-Adding a title would also be useful for (forced) external links, not just internal ones; could use the same syntax, of course.
~~~-Good point as well --DarTar
~~-How to combine these various options? Each in a separate pair of brackets, all in a single pair of brackets together? I have a preference for the latter but haven't looked at any implications for the Formatter yet.
~~-To combine query parameters they should be separated with ##&##, not a single ### which is invalid in HTML--- --JavaWoman
~~~-This has to be done by the formatter, not by the user. -- DarTar
~~~~-Ah, but when you give an example of what it //should// result in, that example should show what this responsibe formatter would do. ;-) Currently it has a single ### which is clearly incorrect. --JW''
-- DarTar
Revision [2794]
Edited on 2004-12-03 13:20:36 by JavaWoman [replies to DarTar, reference to comma bug, ypots]Additions:
~''Apart from a possible "German" origin, I never understood the bias here to allowing German characters but not non-ASCII characters used in other languages. That said, I don't think an RE should look for a "word" but merely a "string-consisting-of-letters-and-digits-and-starting-with-a-letter". By using a hex encoding inside the RE for "letters" we would also make this encoding-independent, thus not limiting to ISO-8859-1 (why not a Turkish Wiki with Turkish page (and user) names?).''
~~I don't know, I'm a little uncomfortable with the idea of allowing //any// kind of character in a WikiName. AndreaRossato pointed out that a Pagetag and a WikiName should only contain ASCII characters. The question of pagenames in different charsets cannot be addressed IMO without taking some decisions concerning multilanguage support and UTF-8 encoding. Or am I misunderstanding your proposal? -- DarTar
~~~''Well, of course Wikka (and Wakka before it, obviously) //already// allows non-7-bit-ASCII page names - quite obviously intentionally. I'm a bit doubtful about a string like 'ÄrgerMich' not being alllowed in something like http://wikka.jsnx.com/ÄrgerMich since this will be rewritten into a query anyway - and a query string can contain anything... In other words, even if a user pastes the string into a browser address bar, it should still //work//. (URL-encoded, it is of course correct anyway, whichever format is used.)
~~~But I'm not proposing to use "any kind of character" - just letters and digits; only taking a clue from the German bias and extending it to anything that can be written in any 8-bit ISO-8859 character set. No UTF-8 needed, it would be completely transparent as long as a character encoding is set (as it should be). It won't allow Chinese (yet), but it will allow Turkish and Icelandic.
~~~Just as the fact that there are already many Wakka/Wikka-based Wikis out there with non-camelcase names is an argument for not imposing camelcase for a valid page name, the fact that there are also Wikis out there using German, Russian, and what not including in page names is an argument for not limiting to 7-bit US-ASCII. --JavaWoman''
~''Also, the commas in that RE are puzzling - do we allow a Wiki name to start with or contain a comma? I think not - and in that case they should go. (See my latest addition to WikkaBugs on this phenomenon of commas in REs!)
~~-Adding a title would also be useful for (forced) external links, not just internal ones; could use the same syntax, of course.
~~-To combine query parameters they should be separated with ##&##, not a single ### which is invalid in HTML--- --JavaWoman
~~~~-Ah, but when you give an example of what it //should// result in, that example should show what this responsibe formatter would do. ;-) Currently it has a single ### which is clearly incorrect. --JW''
~~I don't know, I'm a little uncomfortable with the idea of allowing //any// kind of character in a WikiName. AndreaRossato pointed out that a Pagetag and a WikiName should only contain ASCII characters. The question of pagenames in different charsets cannot be addressed IMO without taking some decisions concerning multilanguage support and UTF-8 encoding. Or am I misunderstanding your proposal? -- DarTar
~~~''Well, of course Wikka (and Wakka before it, obviously) //already// allows non-7-bit-ASCII page names - quite obviously intentionally. I'm a bit doubtful about a string like 'ÄrgerMich' not being alllowed in something like http://wikka.jsnx.com/ÄrgerMich since this will be rewritten into a query anyway - and a query string can contain anything... In other words, even if a user pastes the string into a browser address bar, it should still //work//. (URL-encoded, it is of course correct anyway, whichever format is used.)
~~~But I'm not proposing to use "any kind of character" - just letters and digits; only taking a clue from the German bias and extending it to anything that can be written in any 8-bit ISO-8859 character set. No UTF-8 needed, it would be completely transparent as long as a character encoding is set (as it should be). It won't allow Chinese (yet), but it will allow Turkish and Icelandic.
~~~Just as the fact that there are already many Wakka/Wikka-based Wikis out there with non-camelcase names is an argument for not imposing camelcase for a valid page name, the fact that there are also Wikis out there using German, Russian, and what not including in page names is an argument for not limiting to 7-bit US-ASCII. --JavaWoman''
~''Also, the commas in that RE are puzzling - do we allow a Wiki name to start with or contain a comma? I think not - and in that case they should go. (See my latest addition to WikkaBugs on this phenomenon of commas in REs!)
~~-Adding a title would also be useful for (forced) external links, not just internal ones; could use the same syntax, of course.
~~-To combine query parameters they should be separated with ##&##, not a single ### which is invalid in HTML--- --JavaWoman
~~~~-Ah, but when you give an example of what it //should// result in, that example should show what this responsibe formatter would do. ;-) Currently it has a single ### which is clearly incorrect. --JW''
Deletions:
~~I don't know, I'm a little uncomfortable with the idea of allowing //any// kind of character in a WikiName. AndreaRossato pointed out that a Pagetag and a WikiName should only contain ASCII characters. The question of pagenames in different charsets cannot be addressed IMO without taking some decisions concerning multilanguage support and UTF-8 encoding. Or am I misunderstanding your proposal? -- DarTar
~Also, the commas in that RE are puzzling - do we allow a Wiki name to start with or contain a comma? I think not - and in that case they should go.
~~-Adding a title would also be usefult for (forced) external links, not just internal ones; could use teh same syntax, of course.
~~-To combine query parameters they should be separated with ##&##, not a single ### which is invalid in HTML--- --JavaWoman''
Additions:
~~I don't know, I'm a little uncomfortable with the idea of allowing //any// kind of character in a WikiName. AndreaRossato pointed out that a Pagetag and a WikiName should only contain ASCII characters. The question of pagenames in different charsets cannot be addressed IMO without taking some decisions concerning multilanguage support and UTF-8 encoding. Or am I misunderstanding your proposal? -- DarTar
~~~-Good point --DarTar
~~~-Good point as well --DarTar
~~~-This has to be done by the formatter, not by the user. -- DarTar
~~~-Good point --DarTar
~~~-Good point as well --DarTar
~~~-This has to be done by the formatter, not by the user. -- DarTar
Additions:
~''I like this idea very much, especially being able to add a title. A few remarks, no particular order:
~~-The paragraph sign § is not present on many keyboards (though probably on yours); I propose to use an exclamation mark instead.
~~-Adding a title would also be usefult for (forced) external links, not just internal ones; could use teh same syntax, of course.
~~-How to combine these various options? Each in a separate pair of brackets, all in a single pair of brackets together? I have a preference for the latter but haven't looked at any implications for the Formatter yet.
~~-To combine query parameters they should be separated with ##&##, not a single ### which is invalid in HTML--- --JavaWoman''
~~-The paragraph sign § is not present on many keyboards (though probably on yours); I propose to use an exclamation mark instead.
~~-Adding a title would also be usefult for (forced) external links, not just internal ones; could use teh same syntax, of course.
~~-How to combine these various options? Each in a separate pair of brackets, all in a single pair of brackets together? I have a preference for the latter but haven't looked at any implications for the Formatter yet.
~~-To combine query parameters they should be separated with ##&##, not a single ### which is invalid in HTML--- --JavaWoman''
Additions:
~''Using the patterns outlined above should fix this. :) --JavaWoman''
Additions:
~''Building on that, let's first set up some RE building blocks:'' %%(php)define('PATTERN_LCLETTER', 'a-z\xdf-\xf6\xf8-\xff');
define('PATTERN_UCLETTER', 'A-Z\xc0-\xd6\xd8-\xdf');
define('PATTERN_LETTER', PATTERN_LCLETTER.PATTERN_UCLETTER);
define('PATTERN_DIGIT', '0-9');%%
~''Now we can use those to build an expression for a valid tag:'' %%(php)$validtag = '/^['.PATTERN_LETTER.']['.PATTERN_LETTER.PATTERN_DIGIT.']+$/';%% ''Note I've also discarded the 's' modifier: if we need to match something that is a string without any whitespace, we don't need to treat multiple lines as a single one.
define('PATTERN_UCLETTER', 'A-Z\xc0-\xd6\xd8-\xdf');
define('PATTERN_LETTER', PATTERN_LCLETTER.PATTERN_UCLETTER);
define('PATTERN_DIGIT', '0-9');%%
~''Now we can use those to build an expression for a valid tag:'' %%(php)$validtag = '/^['.PATTERN_LETTER.']['.PATTERN_LETTER.PATTERN_DIGIT.']+$/';%% ''Note I've also discarded the 's' modifier: if we need to match something that is a string without any whitespace, we don't need to treat multiple lines as a single one.
Deletions:
define('UCLETTER', 'A-Z\xc0-\xd6\xd8-\xdf');
define('LETTER', LCLETTER.UCLETTER);
define('DIGIT', '0-9');%%
~''Now we can use those to build an expression for a valid tag:'' %%(php)$validtag = '/^['.LETTER.']['.LETTER.DIGIT.']+$/';%% ''Note I've also discarded the 's' modifier: if we need to match something that is a string without any whitespace, we don't need to treat multiple lines as a single one.
Additions:
~''Apart from a possible "German" origin, I never understood the bias here to allowing German characters but not non-ASCII characters used in other languages. That said, I don't think an RE should look for a "word" but merely a "string-consisting-of-letters-and-digits-and-starting-with-a-letter". By using a hex encoding inside the RE for "letters" we would also make this encoding-independent, thus not limiting to ISO-8859-1 (why not a Turkish Wiki with Turkish page (and user) names?).
~Another thing I find a bit strange is that this RE requires that a tag starts with **two** letters, and may be followed by any number of letters and digits - why not start with a single letter and require at least two alphanumeric characters?''
~''Building on that, let's first set up some RE building blocks:'' %%(php)define('LCLETTER', 'a-z\xdf-\xf6\xf8-\xff');
define('UCLETTER', 'A-Z\xc0-\xd6\xd8-\xdf');
define('LETTER', LCLETTER.UCLETTER);
define('DIGIT', '0-9');%%
~''Now we can use those to build an expression for a valid tag:'' %%(php)$validtag = '/^['.LETTER.']['.LETTER.DIGIT.']+$/';%% ''Note I've also discarded the 's' modifier: if we need to match something that is a string without any whitespace, we don't need to treat multiple lines as a single one.
~ --JavaWoman''
~Another thing I find a bit strange is that this RE requires that a tag starts with **two** letters, and may be followed by any number of letters and digits - why not start with a single letter and require at least two alphanumeric characters?''
~''Building on that, let's first set up some RE building blocks:'' %%(php)define('LCLETTER', 'a-z\xdf-\xf6\xf8-\xff');
define('UCLETTER', 'A-Z\xc0-\xd6\xd8-\xdf');
define('LETTER', LCLETTER.UCLETTER);
define('DIGIT', '0-9');%%
~''Now we can use those to build an expression for a valid tag:'' %%(php)$validtag = '/^['.LETTER.']['.LETTER.DIGIT.']+$/';%% ''Note I've also discarded the 's' modifier: if we need to match something that is a string without any whitespace, we don't need to treat multiple lines as a single one.
~ --JavaWoman''
Deletions:
~Give me a moment and I'll come up with an alternative RE to match what I propose... --JavaWoman''
Additions:
~''Apart from a possible "German" origin, I never understood the bias here to allowing German characters but not non-ASCII characters used in other languages. That said, I don't think an RE should look for a "word" but merely a "string-consisting-of-letters-and-digits-and-starting-with-a-letter". By using a hex encoding inside the RE for "letters" we would also make thsi encoding-independent, thus not limiting to ISO-8859-1 (why not a Turkish Wiki with Turkish page (and user) names?).
~Also, the commas in that RE are puzzling - do we allow a Wiki name to start with or contain a comma? I think not - and in that case they should go.
~Give me a moment and I'll come up with an alternative RE to match what I propose... --JavaWoman''
~Also, the commas in that RE are puzzling - do we allow a Wiki name to start with or contain a comma? I think not - and in that case they should go.
~Give me a moment and I'll come up with an alternative RE to match what I propose... --JavaWoman''
Additions:
~-SchönesMädchen is an allowed username (you can actually register with this name) and is parsed as a link, also if you force it as ""[[SchönesMädchen]]"": [[SchönesMädchen]] .
Deletions:
Additions:
JavaWoman pointed out that Wikka currently restricts valid usernames to camelcase-formatted [[WikiName WikiNames]]. Is this consistent with the fact that we actually do allow valid pagetags in forced links beyond the camelcase format? And what about special characters in usernames?
~-DarTar is an allowed username and it is correctly parsed as a link.
~-SchönesMädchen is an allowed username (you can actually register with this name) but won't be parsed as a link, not even if you force it as ""[[SchönesMädchen]]"": [[SchönesMädchen]] .
~-Because of the currently used validation pattern, French users are discriminated while German users aren't :) - SchönesMädchen is allowed (with the above restrictions) while BelleFrançaise or NiñaHermosa aren't (look BTW at the incorrect WikiName segmentation produced by the //cedille//). On the other hand, they produce inconsistent links if you force them as ""[[BelleFrançaise]]"" and ""[[NiñaHermosa]]"": [[BelleFrançaise]] [[NiñaHermosa]]). This should be IMO fixed as soon as possible.
~-DarTar is an allowed username and it is correctly parsed as a link.
~-SchönesMädchen is an allowed username (you can actually register with this name) but won't be parsed as a link, not even if you force it as ""[[SchönesMädchen]]"": [[SchönesMädchen]] .
~-Because of the currently used validation pattern, French users are discriminated while German users aren't :) - SchönesMädchen is allowed (with the above restrictions) while BelleFrançaise or NiñaHermosa aren't (look BTW at the incorrect WikiName segmentation produced by the //cedille//). On the other hand, they produce inconsistent links if you force them as ""[[BelleFrançaise]]"" and ""[[NiñaHermosa]]"": [[BelleFrançaise]] [[NiñaHermosa]]). This should be IMO fixed as soon as possible.
Deletions:
Additions:
== Current pattern for valid usernames ==
JavaWoman pointed out that Wikka currently restricts valid usernames to camelcase pagetags. Is this consistent with the fact that we actually do allow valid pagetags in forced links beyond the camelcase format?
JavaWoman pointed out that Wikka currently restricts valid usernames to camelcase pagetags. Is this consistent with the fact that we actually do allow valid pagetags in forced links beyond the camelcase format?
Additions:
~-We should prevent non-escaped URIs to be parsed as pagetags or at least encode them before applying a validator: http://wikka.jsnx.com/ÄrgerMich is correctly encoded, but what if a user pastes this URL directly in the address field of a browser?
Deletions:
Additions:
~-We should prevent non-escaped URIs to be parsed as pagetags or at least encode them before applying a validator: try http://wikka.jsnx.com/ÄrgerMich.
Deletions:
Additions:
<<References:
~[[http://www.faqs.org/rfcs/rfc2396.html RFC2396]]
~Uniform Resource Identifiers (URI): Generic Syntax
<<::c::
~[[http://www.faqs.org/rfcs/rfc2396.html RFC2396]]
~Uniform Resource Identifiers (URI): Generic Syntax
<<::c::
Additions:
~-If we are to allow accented characters in valid page tags (are we?), we should consider allowing also other characters like for instance èéêëñç that are part of the extended ASCII charset (iso-8859-1).
Deletions:
Additions:
~-The German //eszed// (ß) can't appear at the beginning of a word in any language, so we might drop it from the first character class.
~-The current validation pattern does not properly handle words beginning with //umlaut//, try with http://wikka.jsnx.com/Ärger/edit
~-If we are to allow accented characters in valid page tags (are we?), we should consider allowing also other characters like for instance èéêëñç that are part of the extended ASCII charset.
~-The current validation pattern does not properly handle words beginning with //umlaut//, try with http://wikka.jsnx.com/Ärger/edit
~-If we are to allow accented characters in valid page tags (are we?), we should consider allowing also other characters like for instance èéêëñç that are part of the extended ASCII charset.
Deletions:
~The current validation pattern does not properly handle words beginning with //umlaut//, try with http://wikka.jsnx.com/Ärger/edit
~If we are to allow accented characters in valid page tags (are we?), we should consider allowing also other characters like for instance èéêëñç that are part of the extended ASCII charset.
Additions:
%%$validtag = "/^[A-Z,a-z,ÄÖÜ,ßäöü]+[A-Z,a-z,0-9,ÄÖÜ,ßäöü]*$/s";%%
Deletions:
Additions:
===== Pagename validation and link formatters =====
== Current pattern for valid pagetags ==
%%$validtag = "/^[A-Z,a-z,ÄÖÜ,ßäöü]+[A-Z,a-z,0-9,ÄÖÜ,ßäöü]*$/s"%%
Some considerations off the cuff:
~The German //eszed// (ß) cant' appear at the beginning of a word in any language, so we might drop it from the first character class.
~The current validation pattern does not properly handle words beginning with //umlaut//, try with http://wikka.jsnx.com/Ärger/edit
~If we are to allow accented characters in valid page tags (are we?), we should consider allowing also other characters like for instance èéêëñç that are part of the extended ASCII charset.
== Current pattern for valid pagetags ==
%%$validtag = "/^[A-Z,a-z,ÄÖÜ,ßäöü]+[A-Z,a-z,0-9,ÄÖÜ,ßäöü]*$/s"%%
Some considerations off the cuff:
~The German //eszed// (ß) cant' appear at the beginning of a word in any language, so we might drop it from the first character class.
~The current validation pattern does not properly handle words beginning with //umlaut//, try with http://wikka.jsnx.com/Ärger/edit
~If we are to allow accented characters in valid page tags (are we?), we should consider allowing also other characters like for instance èéêëñç that are part of the extended ASCII charset.
Deletions:
Revision [2747]
Edited on 2004-12-02 10:26:43 by DarTar [New page + proposal on improvement of forced links]Additions:
~~##""[[HomePage (? "par1=ba,par2=bo") Internal forced link]]""##
~~=> http://wikka.jsnx.com/HomePage?par1=ba&par2=bo
~~##""[[HomePage (# "this") Internal forced link]]""##
~~##""[[HomePage (§ "This is a link to the HomePage") Internal forced link]]""##
~~=> http://wikka.jsnx.com/HomePage?par1=ba&par2=bo
~~##""[[HomePage (# "this") Internal forced link]]""##
~~##""[[HomePage (§ "This is a link to the HomePage") Internal forced link]]""##
Deletions:
~~=> http://wikka.jsnx.com/HomePage?par1=test&par2=bo
~~##""[[HomePage (# this) Internal forced link]]""##
~~##""[[HomePage (title "This is a link to the HomePage") Internal forced link]]""##