This is the coordination page for: Character substitutions in Tiki
The code started here:
In Tiki, page names are case insensitive. Commit Code, Commit Code and Commit Code are all equivalent.
sylvieg: Are you sure? - in mysql as case insensitive - but in postgres, I do not think so. I am enable to find in the code a place where the strlower is done
Jyhem: I believe Tiki does not enforce this. Your database does. If I create a mySQL database with "utf8_unicode_ci", it is like Marc describes. When I create the database with "utf8_bin", then it is not true: "test" and "TEST" are different pages.
This makes things simpler for end users and has never, AFAIK, been reported as an issue on the wishlist. There is also a substitution for Microsoft Word Special characters.
sylvieg: so far I found in the case there is an utf8 normilization, nothing else
alain_desilets: I think character substitutions makes sense for latin languages. But what do we do with Chinese for example?
What about accents? spaces? underscores? plus? slash? etc.
Do we really want a page "Déjà vu" and a different page "Deja vu"? Probably not. Thus, we should identify characters which would serve as aliases. Let's determine this together.
Since wiki page names should avoid special characters, we should consider using character substitutions in page names (a instead of à, etc.) and use the description field for the exact format. Using the description is also useful for very long wiki page names.
- Easy for the end user
- Clean URLs (avoid %20, and similar)
- Handle cases where there is a desire for similar but different page names, where current behavior is a feature, not a usability bug.
- How does Wikipedia do it?
- Should it be aliases, redirect or substitutions?
- Aliases: all work
- Redirect: all work, but you are redirected to the cleaner URL, for nicer copy-pasting
- Substitutions: when you create a page with a special character, Tiki swaps for another character.
- What is universally accepted in URLs? (without conversion to %20 or similar) (This is standardized in RFC 3986)
- What about languages like Arabic and Mandarin?
|a-z and A-Z||No special handling|
|parenthesis ( ( ) or ( ) )||No special handling|
|All accented characters (é à î...)||Non-accented form (aei...)||I see a problem ignoring accents there ! For example in Czech language: "Most" and "mošt" are two very different things (a bridge and a cider) Just my 2 Czech crowns... — luci|
|Space ( )||underscore (_)|
|Plus (+)||underscore (_)|
|Apostrophe (')||underscore (_)|
|Colon (:)||underscore (_)|
|SemiColon (;)||underscore (_)|
|At (@)||should find an email address if we just search for prefix or suffix|
|number sign||( # )|
Discussion on #wiki about # in pagenames[+]
Other comments & questions
sylvieg: and what about in a first step, work on the like pages that are proposed when editing a page. The like pages proposition is very poor , I think for the moment - it is only 'contains a word in common' - why not keeping a normalized form of the pagename in the database (obtained by a replacement pattern defined in admin)...
Should hypen (-) and underscore (_) be aliases?
comma (.) -> conflicts with ShortURLs, yet, it's common to want a page name with one. What to do?
What about dollar ($) sign?
Anchors in automatically generated table of contents
For example: http://dev.tiki.org/EditUIRevamp#Preview_amp_history
Also, usernames should follow similar, if not the same guidelines.
Sylvie: username has a filter hardcoded in 2.x - now a param in 3.0
Gmail prevents certain characters to avoid confusion between two users. Perhaps we should do the same?
In Facebook and Gmail, joesmith is the same user as joe.smith
Maybe this should be used for search engine results as well. Searching "Déjà" should find "Deja" and vice-versa.
When I search "event", I would like to find this page: http://profiles.tiki.org/Event_Management_System
Tags can already get messy with synonyms, plurals, typos, etc.
Some people use commas (,) in tags entry box instead of spaces and they are recorded.
Sylvie: tag have already a normalisation + lower reduction in settings
Wiki Link Format
Controls recognition of Wiki links using the two parenthesis Wiki link syntax page name.
3 choices in tiki-admin.php?page=wiki are complete, latin and English
Please see related issue reported by cellvia2
http://wiki-translation.com/ (This is increasingly important as we improve Tiki i18n features)
http://ca.php.net/strtr (scroll down) pieces of PHP code that rewrite strings into a URL-suitable form.
Related wish list items
- Wiki page names and links, plus (+) simulates a space( ), but a space( ) doesn't simulate a plus (+)
- Quick Edit module should detect and warn about special characters in page names
- period (.) in page names conflicts with Short URLs rewrite rules
- Equivalent characters for page linking, backlinking, searching, etc (ex.: space, underscore, period)