Loading...
 

Search Dilemma

Note: Unified Index is the feature introduced in Tiki7 (formely known as Unified Search)

Background

There are a few issues with Unified Index with Zend_Lucene such as:


The usual solution/workaround is just to revert to the classic MySQL Full Text Search implementation at tiki-searchresults.php

However, there a few cases where this doesn't resolve the issue, such as http://tiki.org/forumthread47305 and a regression reported from 6.x to 10.x/11.x that it's no longer possible to associate two articles from different languages. To reproduce: 1- Create an article in EN 2- Create an article in FR 3- Go back to EN article in view mode and try to associate to a FR article. A list should appear with a list of untranslated articles in FR, but the list is empty. This works only when Unified Index is activated.

These are just symptoms of a larger, long term dilemma that has been in Tiki, and that could get worse if we don't have a plan.

There has almost forever been 2 search engines in Tiki. MySQL Full Text Search (tiki-searchresults.php) and Tiki Search (tiki-searchindex.php). In Tiki7, Tiki Search was upgraded to Unified Index (think of it as an abstraction layer which can handle several storage types). At first, it was using Zend_Search_Lucene, then ElasticSearch was added. And more recently, MySQL Full Text Search (but through Unified Index). See more info at Unified Index Comparison.

Having duplication in Tiki is normally avoided so that we all converge our efforts. However, each approach (tiki-searchresults.php and tiki-searchindex.php) would offer benefits that the other couldn't, and thus, they both were kept.

Unified Index has several benefits

  • Speed
  • Pagination works well (which is not possible with legacy search)
  • Weighting of fields
  • More relevant results
  • Designed for searching throughout various Tiki features
    • SQL code to do the same thing is much more complex
  • We have 3 search back-ends now, and could easily add more (ex.: MongoDB or Sphinx) with minimal overhead.
  • Projects like Tutela, CartoGraf, etc. can't exist without Unified Index. And projects like this are funding a big part of the R & D in Tiki.


And it's not just plain search. It's also all the listings, with permission checking, perspective filtering, etc.

Types of users

Small sites on shared hosting

Objectives:

  • Works on shared hosting
  • No cron job needed
  • No issues with file/folder permissions
  • Features that used to work, continue to work

Big projects

Objectives:

  • Works with InnoDB
  • Better search results
  • Faster search results

Options

Keep both tiki-searchresults.php and tiki-searchindex.php

Benefits

  • tiki-searchresults.php is the simplest for small sites (doesn't require indexing, etc.)
  • tiki-searchresults.php has been used for a long time and its limitations are known but livable/accepted

Disadvantages

If we pick this

  • We need a way to really make Unified Index really optional
  • We should set the defaults to be as robust as possible for entry-level shared hosting

Migrate all to Unified Index, but have MySQL FTS as the default


Unified Index is not optional. It's always on. Zend_Search_Lucene is better but there can be some issues with permissions and performance so people would need to configure it.

Benefits

  • Unifies the code base
  • Can work with InnoDB because only the index table has to be in MyISAM
    • Thus, we could make InnoDB the new default in Tiki
  • OK (not great) for shared hosting


Disadvantages

  • When sites reach a certain size, they need to change to Zend_Lucene or Elastic Search (which is also somewhat true with option above)

If we pick this

  • We should set the defaults to be as robust as possible for entry-level shared hosting
  • We should make re-indexing automatic, à la poor man's cron job
  • tiki-searchresults.php (and search arguments) should redirect to tiki-searchindex.php (or vice versa)
  • Try to find a way where one failure in the indexing process doesn't stop everything. It should log and skip the offending entry.
  • Admin panel should report stats of last reindex attempt, success or failure, RAM used, time elapsed, if failure: what is last line of error log

It's important to understand why the cron job is required. It's not a fatality.

  • Quite a few places in Tiki do not trigger index update after changing the data. This dates back to the prior tiki index implementation that had hooks in the code. The rebel code simply need to be tracked down.
    • This is an added bonus to improve Zend_Lucene and ElasticSearch. Renaming pages is one I have noticed.
  • Permissions are indexed with the data to allow fast filtering. When the view privileges are modified, a rebuild is required as full incremental update would take more time anyway.
    • Could we detect this and indicate in tiki-admin.php that a re-index is due?

Questions

  • MySQL FTS doesn't permit field boosting. Would there be a way to put the data twice in the index? (especially wiki page names)
    • Not without compromising search performance as we would need to re-extract the data from the original source.
    • The root cause of the issue is that we can match on multiple fields, but the ranking only applies to one. There is no way around this limitation.
  • Is it possible to convert and index from MySQL to Zend_Lucene? (if so, could help with time-outs)
    • Not significantly boost, not worth the hassle. The time spent fetching the data is the same for all engines. What varies is the time it takes to index it and whether this is done as a background process.
  • Could there be a way to indicate to the admin that site is becoming too big, and they they should change from MySQL FTS -> Zend_Lucene -> ElasticSearch
    • There is no heuristic for this as all servers behave differently.
  • What is the approx. maximum number of tracker fields for Unified Index with MySQL FTS?
    • There is a limitation is at 64K characters on varchar per row. All fields are created as TEXT. When direct hit (=) or initial (LIKE 'Foobar%') match is required, the fields are converted to VARCHAR(300) and indexed. How many fields is more a question of how complex the queries are within your site.
    • There is also a hard limit at 4K fields in a table. Given some field types may create multiple entries in the index, there would be a limit around 2K tracker fields, considering the other features in Tiki.
      • 2k tracker fields is quite high. It seems OK to me to have a requirement to use Zend_Lucene at this level
    • There is yet an other limit at 64 indexes per table. However, that one is not as critical as we can simply skip the indexing. With the dynamic indexing, we can assume that the most critical indexes will be hit first and the last ones in are less important. Queries might simply be slower, however it's likely that MySQL will simply use an other index in the query for an initial filtering.
  • How come number of results is different? (plugin index exclusion?)

Should InnoDB become the new default (instead of MyISAM)?

  • Can InnoDB be used for all the data but we keep MyISAM just for the index?


See also: Unified Index with MySQL Full Text Search

Proposal for Tiki 13

To remove tiki-searchresults.php in Tiki13 and focus all the efforts on tiki-searchindex.php (with three different engines, and MySQL Full Text Search as the default, when available)
Accept Undecided Reject
3 1 1
  • amette
  • pascalstjean
  • jonnybradley
  • marclaporte
  • Jyhem

I feel that users should adopt better features when they are confident that the feature is better. Removing features for things no user cares about such as the code is nicer with the hope that it will magically make the new feature better has been experimented before (tracker revamp, default search as FTS, comments revamp). It does not focus energies since it spreads energies between bleeding edge devs on an under-tested development release and real-world customers who stick to old LTS versions on which they pay other consultants for improvements and fixes.
A brand new feature cannot be immediately better before it has a chance of being extensively tested against real-life scenarios (not benchmarks or proof of concepts). And that's unlikely to happen if people are not confident they can revert to the working solution at the first sign of trouble.

Jean-Marc Libs: Can you take 20 minutes to test Unified Index with MySQL against your biggest/most complex sites and report issues and missing features? -> Unified Index with MySQL Full Text Search. Thanks!

alias

Keywords

The following is a list of keywords that should serve as hubs for navigation within the Tiki development and should correspond to documentation keywords.

Each feature in Tiki has a wiki page which regroups all the bugs, requests for enhancements, etc. It is somewhat a form of wiki-based project management. You can also express your interest in a feature by adding it to your profile. You can also try out the Dynamic filter.

Accessibility (WAI & 508)
Accounting
Administration
Ajax
Articles & Submissions
Backlinks
Banner
Batch
BigBlueButton audio/video/chat/screensharing
Blog
Bookmark
Browser Compatibility
Calendar
Category
Chat
Comment
Communication Center
Consistency
Contacts Address book
Contact us
Content template
Contribution
Cookie
Copyright
Credits
Custom Home (and Group Home Page)
Database MySQL - MyISAM
Database MySQL - InnoDB
Date and Time
Debugger Console
Diagram
Directory (of hyperlinks)
Documentation link from Tiki to doc.tiki.org (Help System)
Docs
DogFood
Draw -superseded by Diagram
Dynamic Content
Preferences
Dynamic Variable
External Authentication
FAQ
Featured links
Feeds (RSS)
File Gallery
Forum
Friendship Network (Community)
Gantt
Group
Groupmail
Help
History
Hotword
HTML Page
i18n (Multilingual, l10n, Babelfish)
Image Gallery
Import-Export
Install
Integrator
Interoperability
Inter-User Messages
InterTiki
jQuery
Kaltura video management
Karma
Live Support
Logs (system & action)
Lost edit protection
Mail-in
Map
Menu
Meta Tag
Missing features
Visual Mapping
Mobile
Mods
Modules
MultiTiki
MyTiki
Newsletter
Notepad
OS independence (Non-Linux, Windows/IIS, Mac, BSD)
Organic Groups (Self-managed Teams)
Packages
Payment
PDF
Performance Speed / Load / Compression / Cache
Permission
Poll
Profiles
Quiz
Rating
Realname
Report
Revision Approval
Scheduler
Score
Search engine optimization (SEO)
Search
Security
Semantic links
Share
Shopping Cart
Shoutbox
Site Identity
Slideshow
Smarty Template
Social Networking
Spam protection (Anti-bot CATPCHA)
Spellcheck
Spreadsheet
Staging and Approval
Stats
Survey
Syntax Highlighter (Codemirror)
Tablesorter
Tags
Task
Tell a Friend
Terms and Conditions
Theme
TikiTests
Timesheet
Token Access
Toolbar (Quicktags)
Tours
Trackers
TRIM
User Administration
User Files
User Menu
Watch
Webmail and Groupmail
WebServices
Wiki History, page rename, etc
Wiki plugins extends basic syntax
Wiki syntax text area, parser, etc
Wiki structure (book and table of content)
Workspace and perspectives
WYSIWTSN
WYSIWYCA
WYSIWYG
XMLRPC
XMPP




Useful Tools