Note: Unified Index is the feature introduced in Tiki7 (formely known as Unified Search)
Background
There are a few issues with Unified Index with Zend_Lucene such as:
- Fail to save edited wiki pages or articles
- Zend_Search_Lucene doesn't work on certain servers
- Search index error
- Make Unified Index optional
The usual solution/workaround is just to revert to the classic MySQL Full Text Search implementation at tiki-searchresults.php
However, there a few cases where this doesn't resolve the issue, such as http://tiki.org/forumthread47305 and a regression reported from 6.x to 10.x/11.x that it's no longer possible to associate two articles from different languages. To reproduce: 1- Create an article in EN 2- Create an article in FR 3- Go back to EN article in view mode and try to associate to a FR article. A list should appear with a list of untranslated articles in FR, but the list is empty. This works only when Unified Index is activated.
These are just symptoms of a larger, long term dilemma that has been in Tiki, and that could get worse if we don't have a plan.
There has almost forever been 2 search engines in Tiki. MySQL Full Text Search (tiki-searchresults.php) and Tiki Search (tiki-searchindex.php). In Tiki7, Tiki Search was upgraded to Unified Index (think of it as an abstraction layer which can handle several storage types). At first, it was using Zend_Search_Lucene, then ElasticSearch was added. And more recently, MySQL Full Text Search (but through Unified Index). See more info at Unified Index Comparison.
Having duplication in Tiki is normally avoided so that we all converge our efforts. However, each approach (tiki-searchresults.php and tiki-searchindex.php) would offer benefits that the other couldn't, and thus, they both were kept.
Unified Index has several benefits
- Speed
- Pagination works well (which is not possible with legacy search)
- Weighting of fields
- More relevant results
- Designed for searching throughout various Tiki features
- SQL code to do the same thing is much more complex
- We have 3 search back-ends now, and could easily add more (ex.: MongoDB or Sphinx) with minimal overhead.
- Projects like Tutela, CartoGraf, etc. can't exist without Unified Index. And projects like this are funding a big part of the R & D in Tiki.
And it's not just plain search. It's also all the listings, with permission checking, perspective filtering, etc.
Types of users
Small sites on shared hosting
Objectives:
- Works on shared hosting
- No cron job needed
- No issues with file/folder permissions
- Features that used to work, continue to work
Big projects
Objectives:
- Works with InnoDB
- Better search results
- Faster search results
Options
Keep both tiki-searchresults.php and tiki-searchindex.php
Benefits
- tiki-searchresults.php is the simplest for small sites (doesn't require indexing, etc.)
- tiki-searchresults.php has been used for a long time and its limitations are known but livable/accepted
Disadvantages
- Some features won't work
- And the number of features that don't work will progressively increase
- Every time features are revamped (like the big work in Tiki13 for notifications), they will depend on Unified Index. Ex.:
- And the number of features that don't work will progressively increase
- When sites reach a certain size or need advanced features, they need to change to Unified Index
- Code duplication, which is an added maintenance burden and security risk
- tiki-searchresults.php and tiki-searchindex.php don't check permissions the same way
- Adding things like Additional search options if results are not good is duplicate work
- Additional complexity of extra options, such as FT vs Tiki search problem
If we pick this
- We need a way to really make Unified Index really optional
- We should set the defaults to be as robust as possible for entry-level shared hosting
Migrate all to Unified Index, but have MySQL FTS as the default
Unified Index is not optional. It's always on. Zend_Search_Lucene is better but there can be some issues with permissions and performance so people would need to configure it.
Benefits
- Unifies the code base
- Can work with InnoDB because only the index table has to be in MyISAM
- Thus, we could make InnoDB the new default in Tiki
- OK (not great) for shared hosting
Disadvantages
- When sites reach a certain size, they need to change to Zend_Lucene or Elastic Search (which is also somewhat true with option above)
If we pick this
- We should set the defaults to be as robust as possible for entry-level shared hosting
- We should make re-indexing automatic, à la poor man's cron job
- tiki-searchresults.php (and search arguments) should redirect to tiki-searchindex.php (or vice versa)
- Try to find a way where one failure in the indexing process doesn't stop everything. It should log and skip the offending entry.
- Admin panel should report stats of last reindex attempt, success or failure, RAM used, time elapsed, if failure: what is last line of error log
- Quite a few places in Tiki do not trigger index update after changing the data. This dates back to the prior tiki index implementation that had hooks in the code. The rebel code simply need to be tracked down.
- This is an added bonus to improve Zend_Lucene and ElasticSearch. Renaming pages is one I have noticed.
- Permissions are indexed with the data to allow fast filtering. When the view privileges are modified, a rebuild is required as full incremental update would take more time anyway.
- Could we detect this and indicate in tiki-admin.php that a re-index is due?
Questions
- MySQL FTS doesn't permit field boosting. Would there be a way to put the data twice in the index? (especially wiki page names)
- Not without compromising search performance as we would need to re-extract the data from the original source.
- The root cause of the issue is that we can match on multiple fields, but the ranking only applies to one. There is no way around this limitation.
- Is it possible to convert and index from MySQL to Zend_Lucene? (if so, could help with time-outs)
- Not significantly boost, not worth the hassle. The time spent fetching the data is the same for all engines. What varies is the time it takes to index it and whether this is done as a background process.
- Could there be a way to indicate to the admin that site is becoming too big, and they they should change from MySQL FTS -> Zend_Lucene -> ElasticSearch
- There is no heuristic for this as all servers behave differently.
- What is the approx. maximum number of tracker fields for Unified Index with MySQL FTS?
- There is a limitation is at 64K characters on varchar per row. All fields are created as TEXT. When direct hit (=) or initial (LIKE 'Foobar%') match is required, the fields are converted to VARCHAR(300) and indexed. How many fields is more a question of how complex the queries are within your site.
- There is also a hard limit at 4K fields in a table. Given some field types may create multiple entries in the index, there would be a limit around 2K tracker fields, considering the other features in Tiki.
- 2k tracker fields is quite high. It seems OK to me to have a requirement to use Zend_Lucene at this level
- There is yet an other limit at 64 indexes per table. However, that one is not as critical as we can simply skip the indexing. With the dynamic indexing, we can assume that the most critical indexes will be hit first and the last ones in are less important. Queries might simply be slower, however it's likely that MySQL will simply use an other index in the query for an initial filtering.
- How come number of results is different? (plugin index exclusion?)
Should InnoDB become the new default (instead of MyISAM)?
- Can InnoDB be used for all the data but we keep MyISAM just for the index?
Examples of Unified Index with MySQL Full Text Search
See also: Unified Index with MySQL Full Text Search
Proposal for Tiki 13
Accept | Undecided | Reject |
---|---|---|
3 | 1 | 1 |
|
|
|
I feel that users should adopt better features when they are confident that the feature is better. Removing features for things no user cares about such as the code is nicer with the hope that it will magically make the new feature better has been experimented before (tracker revamp, default search as FTS, comments revamp). It does not focus energies since it spreads energies between bleeding edge devs on an under-tested development release and real-world customers who stick to old LTS versions on which they pay other consultants for improvements and fixes.
A brand new feature cannot be immediately better before it has a chance of being extensively tested against real-life scenarios (not benchmarks or proof of concepts). And that's unlikely to happen if people are not confident they can revert to the working solution at the first sign of trouble.
@Jyhem: Can you take 20 minutes to test Unified Index with MySQL against your biggest/most complex sites and report issues and missing features? -> Unified Index with MySQL Full Text Search. Thanks!