The query engine module is absolutely central to the entire system. Its job it to implement the semantics of the metamodel, and expose them to the rest of the system. It is used by almost every module in many contexts.
Ultimately, it almost singlehandedly implements search, collection handling, as well as hierarchical navigation (everything but display).
The data actually indexed for full text search is the MetaModel. There is no full text search on any other data, although the metamodel does provide provisions to concatenate elements of the native format that are not part of the MetaModel to allows keyword searches to match them, and to provide for queries over protocols like SQI which may, for example, know the LOM field names to target the text content of those fields.
List of critical functionality
- Provide an OpenSearch? description document, including
- Allowable values for the Parameters using the Parameters extension, and an effective way to retrieve vocabulary values.
- Provides a publically accessible, standard interface to the query engine.
- Large library support
- Allows easy integration from remote sites.
- Un cas d'utilisation spécifique est profweb dont la liste de ressource est directement générée à partir d'appels d'API à Eurêka.
- Avoids reinventing the wheel
- Any OpenSearch? compatible software (such as firefox) is able to consume an opensearch description document and allow at least simple keyword searching.
- Here is a OpenSearch? validator:
- Allow all the parameters necessary to define a collection.
- See Eurêka example of supported parameters http://eureka.ntic.org/search.php?action=help
- Provide search snippets (a short summary of the page with the relevant search terms highlighted)
- Provide scoring of search results
- Support AND, OR and NOT exact field matches.
List of desirable functionality
- Faceted search
- Word completion
- For keyword search, but also some text fields searching in structured data like author, organizations could benefit.
- Word completion data would be generated from the metadata in the system, not from historical user queries.
- For Vocabularies, a similar functionality will be implemented in the form of a "phrase wheel". See Architecture/Modules/Vocabulary
- Allows suggesting different search terms to the user. The best knows example is when Google suggest correcting a spelling error in search terms.
- Search for similar resources (textually, not using metadata), this would help implement user story #4
- This is asking the search engine to fins similar documents from the document's overall textual content. An example in Google is when clicking on the small magnifying glass on the right of a single search result, and then clicking on "Similar" at the bottom of the popup.
ORI-OAI and Eurêka both want improvements over their current search systems, but have both hit roadblocks:
- In the case of Eureka, it has an excellent support for vocabulary inference and collection handling. It exploits the labels and description of vocabularies in keyword searches. Unfortunately, it hit the limit of relational database technology for vocabulary inference, and has poor support for modern search UI paradigms (Facetted search, word completion, suggestions), as well as less than stellar ranking.
- ORI-OAI is quite the opposite, it makes extensive use of SOLR to provide a good and stedily improving support for modern search UI features, but it hit a modeling limit because it is limited by Lucene's document oriented data model. It has no direct support for vocabulary inference, and no way to support vocabularies whose identifiers are unsuitable for keyword searching.
The chosen approach to implement the search is to use customized fulltext-models in Mulgara.
Mulgara already supports full-text search using fulltext models. The problem though is that only one index (English-aware) is used by the current version. This provides poor support for multilingual searches.
Because of that, Mulgara should be customized to use more than one indexes. One for each languages (or categories of languages) and possibly a default indexer that would be language-independant (i.e., no lemmatization). Indexing of the metadata would be made when a metadata record is created or modified. Strings should be indexed using a specific indexer in function of their associated languages (when provided). Strings would also be indexed in the default indexer.
When a query is performed, the language of the Web interface would be used as the default language of the query. This is needed so that the system can pick the right indexer to perform the query efficiently. For example, interpreting a query in English and applying it using an index that have used a German analyzer could give a lot of false positive because of erroneous lemmatizations.
In addition to that, the query will also be performed on the default index just in case that our assumption is wrong (because there is no way to be certain if the language of the Web UI is actually the language that the query should be performed). Both results of the queries will be merged somehow taking into account scoring ranking.
For advanced search, it should be possible for a user to specify the language of the query to obtain the best results.