Character substitutions

This is the coordination page for: Character substitutions in Tiki

The code started in r43471.

Background

In Tiki, page names are case insensitive. "Commit Code", "commit code" and "COMMIT CODE" are all equivalent.
sylvieg: Are you sure? - in mysql as case insensitive - but in postgres, I do not think so. I am enable to find in the code a place where the strlower is done
Jyhem: I believe Tiki does not enforce this. Your database does. If I create a mySQL database with "utf8_unicode_ci", it is like Marc describes. When I create the database with "utf8_bin", then it is not true: "test" and "TEST" are different pages.
This makes things simpler for end users and has never, AFAIK, been reported as an issue on the wishlist. There is also a substitution for Microsoft Word Special characters.
sylvieg: so far I found in the case there is an utf8 normilization, nothing else
alain_desilets: I think character substitutions makes sense for latin languages. But what do we do with Chinese for example?

What about accents? spaces? underscores? plus? slash? etc.

Do we really want a page "Déjà vu" and a different page "Deja vu"? Probably not. Thus, we should identify characters which would serve as aliases. Let's determine this together.

Since wiki page names should avoid special characters, we should consider using character substitutions in page names (a instead of à, etc.) and use the description field for the exact format. Using the description is also useful for very long wiki page names.

Objectives

Easy for the end user
Clean URLs (avoid %20, and similar)
Robust
Handle cases where there is a desire for similar but different page names, where current behavior is a feature, not a usability bug.

Questions

How does Wikipedia do it?
- http://en.wikipedia.org/wiki/Help:Page_name
Should it be aliases, redirect or substitutions?
- Aliases: all work
- Redirect: all work, but you are redirected to the cleaner URL, for nicer copy-pasting
- Substitutions: when you create a page with a special character, Tiki swaps for another character.
What is universally accepted in URLs? (without conversion to %20 or similar) (This is standardized in RFC 3986)
What about languages like Arabic and Mandarin?

Suggested alias/substitutions

Character	Could/Should be	But
a-z and A-Z	No special handling
parenthesis ( ( ) or ( ) )	No special handling
All characters with diacritics (à, ç, é, î...)	Equivalent without diacritics (a, c, e, i...)	I see a problem ignoring accents there ! For example in Czech language: "Most" and "mošt" are two very different things (a bridge and a cider) Just my 2 Czech crowns... — luci. French (and presumably most languages with diacritics) are the same. — Chealer
Space ( )	hyphen (-)
Plus (+)	hyphen (-)	This is problematic. If I create a "Visual C++" page, I do not want it named "Visual C--". And this kind of issue probably applies to all substitutions.
Apostrophe (')	hyphen (-)
Colon (:)	hyphen (-)
SemiColon (;)	hyphen (-)
Slash (/)	hyphen (-)
Backslash (\)	hyphen (-)
Pipe (\|)
Ampersand	(&)	So no substitution?
At (@)	should find an email address if we just search for prefix or suffix	What?
number sign	( # )	So no substitution?

Discussion on #wiki about # in pagenames

[+]

12:21 ThomasWaldmann: how are different wiki markups handling # char in page names vs. # char for separation of a target anchor?
12:22 ThomasWaldmann: e.g. [[Problem #1]] vs. [[LongPage#someanchor]]
12:23 kensanata: As far as Oddmuse is concerned, such a name would be illegal.
12:25 ThomasWaldmann: i had a glance into the creole spec, but they don't seem to address that problem
12:26 ThomasWaldmann: kensanata: so now way to have # in a pagename somehow?
12:26 ThomasWaldmann: s/now/no/
12:26 ThomasWaldmann: what does oddmuse do with foo#bar#baz as target?
12:27 kensanata: That's page foo with anchor bar#baz and handled by the browser.
12:28 ThomasWaldmann had the vague idea of just splitting the rightmost # part away as anchor
12:29 ThomasWaldmann: so it could be [[Problem #1#]], the anchor would be empty in that case and the #1 would be escaped as usual for pagenames
12:29 kensanata: I think the CGI script doesn't get anything to the right of the first # — the browser handles it.
12:30 kensanata: You would have to use foo%23bar#baz — then the script gets to see foo#bar as a parameter where as the browser handles the jumping to the anchor named baz.
12:31 ThomasWaldmann: that's what I mean with escaping
12:32 kensanata: Ah. Now I see what nefarious short cuts you're trying to do...
12:32 ThomasWaldmann: 😀
12:33 kensanata: [[Problem #1]] — [[I love C#]] — [[C# Problem #1]] — [[C# Problems#1]]
12:33 ThomasWaldmann: it's maybe not that nefarious as # is invalid in a name afair
12:33 kensanata: I'd say only the last one of my examples involves a named anchor.
12:33 kensanata: So context (preceding whitespace) must be important, I think.
12:34 ThomasWaldmann: yes, but that is introducing language semantics
12:34 kensanata: Indeed, and we are talking about semantics, no?
12:34 kensanata: 😊
12:34 ThomasWaldmann: of links, but not of english
12:34 kensanata: But... but...
12:35 kensanata: There is one symbol (#) and two ways to interpret it. What does a # mean? This is semantics.
12:35 ThomasWaldmann: also, what about the page about cpp instructions, like #define? 😊
12:36 ThomasWaldmann: yeah, that is the semantics of how to create the link wanted
12:36 kensanata: I guess all I'm saying is that figuring out whether a # is intended to be part of a pagename or not is a tricky problem.
12:36 ThomasWaldmann: but assuming that there must be a blank before the # is just for the usual spoken english usage of #
12:37 kensanata: Oh, I do agree that my proposed heuristic is less than perfect.
12:37 kensanata: Which is why Oddmuse doesn't try to solve this problem. 😊
12:37 ThomasWaldmann: yes, it's tricky, that's why I am asking 😊
12:38 MartinCleaver has joined
12:38 ThomasWaldmann: for moin I have to change something, as for interwiki it currently "thinks" that foo#bar is a pagename, while for internal page links it thinks it is a page with an anchor
12:40 ThomasWaldmann tries to find the C# page on WP
12:40 ThomasWaldmann: For technical reasons, C# redirects here. For uses of C#, see C-sharp.
12:40 ThomasWaldmann: here == C
12:41 ThomasWaldmann: ok, so a # at the end of the pagename is obviously not possible with MediaWiki
12:48 ThomasWaldmann: TheSheep: does hatta allow # in pagenames?
12:50 TheSheep: no
21:37 marclaporte: ThomasWaldmann: I have no answer for you but some related stuff we are struggling with : http://dev.tiki.org/Character+substitutions
22:08 ThomasWaldmann: marclaporte: thanks
22:09 ThomasWaldmann: btw, we had that space > _ thing and later we had to remove it again which was a pita
22:10 ThomasWaldmann: nowadays I try to do as little "magic" as possible
22:11 marclaporte: can you tell me more about space > _ thing ?
22:11 ThomasWaldmann: well, I did a similar thing a while ago as you have on that page
22:12 ThomasWaldmann: i also had the idea "lets look how mw handles it"
22:12 ThomasWaldmann: then I introduced " " > "_" (and "_" > "_") so that URLs are prettier
22:12 ThomasWaldmann: no %20
22:13 ThomasWaldmann: later I had to map filesystem files into the wiki page namespace
22:13 ThomasWaldmann: kind of "virtual pages", a file browser within the wiki gui
22:14 ThomasWaldmann: of course there are files like foo_bar.txt, but also "foo bar.txt"
22:14 ThomasWaldmann: and the names are given
22:14 ThomasWaldmann: in the worst case, you could even have "foo_bar baz.txt"
22:15 ThomasWaldmann: so confusing (unifying) " " and "_" was not really helpful
22:16 ThomasWaldmann: thus I removed that again. it was a pita because after throwing that together, you can not automatically decide what it was or should be, thus the conversion had a manual step with the user deciding whether he wants a " " or "_"
22:17 ThomasWaldmann: (and even without that "fs browser", we had to revert that because soon we'll have unified wiki pages and file items)
22:19 ThomasWaldmann: the idea of prettier urls became a bit less important recently btw, because some browsers nowadays render those urls prettier and transform %20 to space themselves
22:19 ThomasWaldmann: it can be a problem for c&p though if you do it wrong
22:19 marclaporte: I think I noticed this in FF3
22:20 ThomasWaldmann: yeah
22:20 ThomasWaldmann: some mac browser did that before
22:20 ThomasWaldmann: and if you really want _, you just have to use _ 😊
22:21 ThomasWaldmann: btw, TheSheep had a good idea on #moin:
22:22 ThomasWaldmann: because [[target#anchor]] is a bit problematic with targets including #,
22:22 ThomasWaldmann: [[target|label|#anchor]] seems to be the best long term solution
22:23 ThomasWaldmann: that 3rd segment of link syntax is already used by us for:
22:23 ThomasWaldmann: foo=bar (will get into <a> tag as attr)
22:23 ThomasWaldmann: &key=val (will get into query string)
22:23 ThomasWaldmann: #name (fragment, new)
22:24 ThomasWaldmann: I found that having this separate is not only cleaner, but that if you handle it together with the page name,
22:25 ThomasWaldmann: things can go wrong if you mix it with a query string:
22:26 ThomasWaldmann: .../pagename#fragment?key=val is incorrect, rfc says it must be .../pagename?

Other comments & questions

sylvieg: and what about in a first step, work on the like pages that are proposed when editing a page. The like pages proposition is very poor , I think for the moment - it is only 'contains a word in common' - why not keeping a normalized form of the pagename in the database (obtained by a replacement pattern defined in admin)...

Question?
Should hypen (-) and underscore (_) be aliases?
comma (.) -> conflicts with ShortURLs, yet, it's common to want a page name with one. What to do?
What about dollar ($) sign?

Anchors in automatically generated table of contents

For example: http://dev.tiki.org/EditUIRevamp#Preview_amp_history

Usernames

Also, usernames should follow similar, if not the same guidelines.
Sylvie: username has a filter hardcoded in 2.x - now a param in 3.0

admin->username pattern - the default is

Copy to clipboard

/^[ '\-_a-zA-Z0-9@\.]*$/

Gmail prevents certain characters to avoid confusion between two users. Perhaps we should do the same?

In Facebook and Gmail, joesmith is the same user as joe.smith

Search engine

Maybe this should be used for search engine results as well. Searching "Déjà" should find "Deja" and vice-versa.
When I search "event", I would like to find this page: http://profiles.tiki.org/Event_Management_System

Wiki Link Format

Controls recognition of Wiki links using the two parenthesis Wiki link syntax page name.

3 choices in tiki-admin.php?page=wiki are complete, latin and English

Related info

http://en.wikipedia.org/wiki/Punctuation
http://wiki-translation.com/ (This is increasingly important as we improve Tiki i18n features)
http://ca.php.net/strtr (scroll down) pieces of PHP code that rewrite strings into a URL-suitable form.

Tiki Roundtable Meeting - August

Open Source solutions

Please wait

Character substitutions

Background

Objectives

Questions

Suggested alias/substitutions

Discussion on #wiki about # in pagenames

Other comments & questions

Anchors in automatically generated table of contents

Usernames

Search engine

Tags

Wiki Link Format

Related info

Related wish list items

See also

About Tiki

Community

Documentation

Support

Development

Legal

Tiki Project Sites

Networks

Navigation and related functionality and content

Related content

Tiki Roundtable Meeting - August

Open Source solutions

Please wait

Backlinks

Character substitutions

Background

Objectives

Questions

Suggested alias/substitutions

Discussion on #wiki about # in pagenames

Other comments & questions

Anchors in automatically generated table of contents

Usernames

Search engine

Tags

Wiki Link Format

Related info

Related wish list items

See also

Related content