Full Text Index

(3Q21)


This article provides information on configuring and using eXist-db's full text index.

Introduction

The full text index module is based on Apache Lucene.

The full-text index module is tightly integrated with eXist-db's modularized indexing architecture: the index behaves like a plug-in which adds itself to the database's index pipelines. Once configured, the index will be notified of relevant events, like adding/removing a document, removing a collection or updating single nodes. No manual re-indexing is required to keep the index up-to-date.

The full-text index module also implements common interfaces which are shared with other indexes, for instance for highlighting matches (see KWIC). It is easy to switch between the Lucene index and, for instance, the ngram index without rewriting much XQuery code.

Configuring the Index

The index has a single configuration parameter on the <modules> / <module> element called buffer. It defines the amount of memory (in megabytes) Lucene will use for buffering index entries before they are written to disk. See the Lucene Javadocs.

Like other indexes, you create a Lucene index by configuring it in a collection.xconf document as explained in documentation. For example:

<collection xmlns="http://exist-db.org/collection-config/1.0">
  <index xmlns:wiki="http://exist-db.org/xquery/wiki" xmlns:html="http://www.w3.org/1999/xhtml" xmlns:atom="http://www.w3.org/2005/Atom">
    <!-- Lucene index is configured below -->
    <lucene>
      <analyzer class="org.apache.lucene.analysis.standard.StandardAnalyzer"/>
      <analyzer id="ws" class="org.apache.lucene.analysis.core.WhitespaceAnalyzer"/>
      <text qname="TITLE" analyzer="ws"/>
      <text qname="p">
        <inline qname="em"/>
      </text>
      <text match="//foo/*"/>
      <!-- "inline" and "ignore" can be specified globally or per-index as shown above -->
      <inline qname="b"/>
      <ignore qname="note"/>
    </lucene>
  </index>
</collection>
collection.xconf (for version 3.0 and above).

You can define a Lucene index on a single element or attribute (qname="...") or a node path with wildcards (match="...", see below).

It is important make sure to choose the right context for an index, which has to be the same as in your query. To better understand this, let's have a look at how the index creation is handled by eXist-db and Lucene. For example: <text qname="SPEECH">

This creates an index on <SPEECH> only. What is passed to Lucene is the string value of <SPEECH>, which also includes the text of all its descendant text nodes (except those filtered out by an optional <ignore>).

Consider the fragment:

<SPEECH>
  <SPEAKER>
    Second Witch
  </SPEAKER>
  <LINE>
    Fillet of a fenny snake,
  </LINE>
  <LINE>
    In the cauldron boil and bake;
  </LINE>
</SPEECH>

If you have an index on <SPEECH>, Lucene will use the text "Second Witch Fillet of a fenny snake, In the cauldron boil and bake;" and index it. eXist-db internally links this Lucene document to the <SPEECH> node, but Lucene itself has no knowledge of that (it doesn't know anything about XML nodes).

Given this, take the following query:

//SPEECH[ft:query(., 'cauldron')]

This searches the index and finds the text, which eXist-db can trace back to the <SPEECH> node in the XML document.

However, it is required that you use the same context (<SPEECH>) for creating and querying the index. For instance:

//SPEECH[ft:query(LINE, 'cauldron')]

This will not return anything, even though <LINE> is a child of <SPEECH> and cauldron was indexed. This particular cauldron is linked to its ancestor <SPEECH> , not its parent <LINE>.

However, you are free to give the user both options, i.e. use <SPEECH> and <LINE> as context at the same time. For this define a second index on <LINE>:

<text qname="SPEECH"/>
<text qname="LINE"/>

Let's use a different example to illustrate this. Assume you have a document with encoded place names:

<p>
  He loves
  <placeName>
    Paris
  </placeName>
  .
</p>

For a general query you probably want to search through all paragraphs. However, you may also want to provide an advanced search option, which allows the user to restrict his/her queries to place names. To make this possible, simply define an index on <placeName> as well:

<lucene>
  <text qname="p"/>
  <text qname="placeName"/>
</lucene>

Based on this setup, you'll be able to query for the word 'Paris' anywhere in a paragraph:

//p[ft:query(., 'paris')]

And also on 'Paris' occurring within a <placeName>:

//p[ft:query(placeName, 'paris')]

Using match="..."

In addition to defining an index on a given qualified name, you can also specify a "path" with wildcards. This feature might be subject to change, so please be careful when using it.

Assume you want to define an index on all the possible elements below <SPEECH>. You can do this by creating one index for every element:

<text qname="LINE"/>
<text qname="SPEAKER"/>

As a shortcut, you can use a match attribute with a wildcard:

<text match="//SPEECH/*"/>

This will create a separate index on each child element of SPEECH it encounters. Please note that the argument to match is a simple path pattern, not a full XPath expression. For the time being, it only allows:

  • / and // to denote child or descendant steps,

  • * wildcard selector to match an arbitrary element,

  • matching a single attribute's value, e.g. foo[@bar = 'xyz']

As explained above, you have to figure out which parts of your document will likely be interesting as context for a full text query. The full text index works best if the context isn't too narrow. For example, if you have a document structure with section <div>s, headings and paragraphs, you would probably want to create an index on the <div>s and maybe on the headings, so the user can differentiate between the two.

In some cases, you could decide to put the index on the paragraph level. Then you don't need the index on the section, since you can always get from the paragraph back to the section.

If you query a larger context, you can use the KWIC module to show the user text surrounding each match. Or you can ask eXist-db to highlight each match with an <exist:match> tag, which you can later use to locate the matches within the text.

Whitespace Treatment and Ignored Content

We'll go into more detail with two common requirements when using full-text indexes.

Inlined elements

By default, eXist-db's indexer assumes that element boundaries break on a word or token. For example, if you have an element:

<size>
  <width>
    12
  </width>
  <height>
    8
  </height>
</size>

You want 12 and 8 to be indexed as separate tokens, even though there's no whitespace between the elements. eXist-db will pass the content of the two elements to Lucene as separate strings and Lucene will see two tokens (instead of just 128).

However, you usually don't want this behaviour for mixed content nodes. For example:

<p>
  This is
  <b>
    un
  </b>
  clear.
</p>

In this case, you want unclear to be indexed as a single word. This can be done by telling eXist-db which nodes are inline nodes. The example configuration above uses:

<inline qname="b"/>

The <inline> option can both be specified globally or per-index:

<text qname="p">
  <inline qname="em"/>
</text>

Ignored elements

It is sometimes necessary to skip the content of an inline element. Notes are a good example:

<p>
  This is a paragraph
  <note>
    containing an inline note
  </note>
  .
</p>

Use an <ignore> element in the collection configuration to have eXist-db ignore the note:

<ignore qname="note"/>

Basically, <ignore> simply allows you to hide a chunk of text before Lucene sees it.

Like the <inline> tag, <ignore> may appear both globally or within a single index definition.

The <ignore> only applies to descendants of an indexed element. You can still create another index on the ignored element itself. For example, you can have index definitions for <p> and <note>:

<lucene>
  <text qname="p"/>
  <text qname="note"/>
  <ignore qname="note"/>
</lucene>

If <note> appears within <p>, it will not be added to the index on <p>, only to the index on <note>. For example:

//p[ft:query(., "note")]

This may not return a hit if "note" occurs within a <note>, while this finds a match:

//p[ft:query(note, "note")]

Boost

A boost value can be assigned to an index to give it a higher score. The score for each match will be multiplied by the boost factor (default is: 1.0). For example, you may want to rank matches in titles higher than other matches.

Here's how to configure the documentation search indexes in eXist-db:

<lucene>
  <analyzer class="org.apache.lucene.analysis.standard.StandardAnalyzer"/>
  <text qname="section">
    <ignore qname="title"/>
    <ignore qname="programlisting"/>
    <ignore qname="screen"/>
    <ignore qname="synopsis"/>
  </text>
  <text qname="para"/>
  <text qname="title" boost="2.0"/>
  <ignore qname="title"/>
</lucene>

The <title> index gets a boost of 2.0 to make sure that its matches get a higher score. Since the <title> element occurs within <section>, we add an ignore rule to the index definition on the section and create a separate index on title. We also ignore titles occurring inside paragraphs. Without this, title would be matched two times.

Because the title is now indexed separately, we need to query it explicitly. For example, to search the section and the title at the same time, one could issue the following query:

for $sect in /book//section[ft:query(., "ngram")] | /book//section[ft:query(title, "ngram")]
order by ft:score($sect) descending 
return $sect

Attribute boost

Starting with eXist-db 3.0 a boost value can also be assigned to an index by attribute. This can be used to weight your search results, even if you have flat data structures with the same attribute value pairs in attributes throughout your documents. Two flavours of dynamic weighting are available through the new pairs <match-sibling-attribute>, <has-sibling-attribute> and <match-attribute>, <has-attribute> child elements in the full-text index configuration.

If you have data in Lexical metadata framework (LMF) format you will recognize these repeated structures of <feat> elements with att and val attributes within <LexicalEntry> elements. For instance <feat att='writtenForm' val='LMF feature value'>. The attribute boosting allows you to weight the results based on the value of the att attribute so that hits in definitions come before hits in comments and examples. This behaviour is enabled by adding a child <match-sibling-attr> to a Lucene configuration <text> element. An example index configuration for it looks like this:

<text qname="@val">
  <match-sibling-attr boost="25" qname="att" value="writtenForm"/>
</text>

This means that the ft:score#1 function will boost hits in val attributes with a factor of 25 times for the writtenForm value of the att attribute.

In the same way <match-attr> would be used for element qnames in the <text> element.

If you do not care about any value of the sibling attribute, use the <has-attribute> index configuration variant. An example index configuration with <has-attr> looks like this:

<text qname="feat">
  <has-attr boost="0" qname="xml:lang"/>
</text>

This means that if your <feat> elements have an attribute <xml:lang> it will score them nil and push them last of the pack, which might be useful to demote hits in features in other languages than the main entry language.

In the same way <has-sibling-attr> would be used for attributes in the <text> element.

Analyzers

One of the strengths of Lucene is that it allows the developer to determine nearly every aspect of text analysis. This is done through analyzer classes, which combine a tokenizer with a chain of filters to post-process the tokenized text. eXist-db's Lucene module already allows different analyzers to be used for different indexes. And starting from eXist version 5.3.0 - specifying a particular analyzer to be used for processing a particular query (see query-analyzer-id in Additional parameters).

<lucene>
  <analyzer class="org.apache.lucene.analysis.standard.StandardAnalyzer"/>
  <analyzer id="ws" class="org.apache.lucene.analysis.core.WhitespaceAnalyzer"/>
  <text match="//SPEECH//*"/>
  <text qname="TITLE" analyzer="ws"/>
</lucene>

In the example above, we define that Lucene's StandardAnalyzer should be used by default (the <analyzer> element without id attribute). We provide an additional analyzer and assign it the id ws, by which the analyzer can be referenced in the actual index definitions and by queries specifying a particular analyzer that should be used to process them (see query-analyzer-id in Additional parameters).

The whitespace analyzer is the most basic one. As the name implies, it tokenizes the text at white space characters, but treats all other characters - including punctuation - as part of the token. The tokens are not converted to lower case and there's no stopword filter applied.

eXist-db provides a special analyzer for characters with diacritics based on the StandardAnalyzer. The NoDiacriticsStandardAnalyzer can be switched on and off by setting the diacritics attribute on the <lucene> element of your index configuration file.

<collection xmlns="http://exist-db.org/collection-config/1.0">
  <index>
    <!-- Lucene indexes -->
    <lucene diacritics="no">
      <analyzer class="org.apache.lucene.analysis.standard.StandardAnalyzer"/>
      <text match="//title[@xml:lang='Sa-Ltn']"/>
      <text match="/TEI/text">
        <ignore qname="text"/>
      </text>
    </lucene>
  </index>
</collection>

Without diacritics ä, å, ā, etc will all be indexed as a. Alternatively, this analyzer can also be called by its full name.

<collection xmlns="http://exist-db.org/collection-config/1.0">
  <index xmlns:xs="http://www.w3.org/2001/XMLSchema">
    <lucene>
      <analyzer class="org.exist.indexing.lucene.analyzers.NoDiacriticsStandardAnalyzer" id="nodiacritics"/>
      <text qname="letter" analyzer="nodiacritics">
        <field name="place" expression="place" analyzer="nodiacritics"/>
        <field name="from" expression="from" store="no"/>
        <field name="to" expression="to"/>
      </text>
    </lucene>
  </index>
</collection>

Configuring the Analyzer

You can send configuration parameters to the instantiation of the Analyzer. These parameters must match a Constructor signature on the underlying Java class of the Analyzer, please review the Javadoc for the Analyzer that you wish to configure.

We currently support passing the following types:

  • java.lang.String (default if no type is specified)

  • java.lang.String[] (since eXist-db 5.4.0)

  • char[] (since eXist-db 5.4.0)

  • java.io.FileReader or file

  • java.lang.Boolean or boolean

  • java.lang.Integer or int

  • org.apache.lucene.analysis.util.CharArraySet or set

  • java.lang.reflect.Field

The value Version#LUCENE_CURRENT is always added as first parameter for the analyzer constructor (a fallback mechanism is present for older analyzers). The previously valid values java.io.File and java.util.Set can not be used since Lucene 4.

For instance to add a stopword list, use one of the following constructions:

<analyzer id="stdstops" class="org.apache.lucene.analysis.standard.StandardAnalyzer">
  <param name="stopwords" type="java.io.FileReader" value="/tmp/stop.txt"/>
</analyzer>
<analyzer id="stdstops" class="org.apache.lucene.analysis.standard.StandardAnalyzer">
  <param name="stopwords" type="org.apache.lucene.analysis.util.CharArraySet">
    <value>
      the
    </value>
    <value>
      this
    </value>
    <value>
      and
    </value>
    <value>
      that
    </value>
  </param>
</analyzer>

For instance to construct your custom analyzer you might use something like:

<analyzer id="my-custom-analyzer" class="tld.org.CustomAnalyzer">
    <param name="minimumTermLength" type="int" value="2"/>
    <param name="punctuationDictionary" type="char[]">
        <value>'</value>
        <value>-</value>
        <value>’</value>
    </param>
</analyzer>

Using the Snowball analyzer requires you to add additional libraries to lib/user.

<analyzer id="sbstops" class="org.apache.lucene.analysis.snowball.SnowballAnalyzer">
  <param name="name" value="English"/>
  <param name="stopwords" type="org.apache.lucene.analysis.util.CharArraySet">
    <value>
      the
    </value>
    <value>
      this
    </value>
    <value>
      and
    </value>
    <value>
      that
    </value>
  </param>
</analyzer>

Facets and Fields

Starting with eXist 5.0, an index configuration may define additional facets and fields. Both can hold arbitrary content, which will be attached to the indexed parent node and can be used to further refine a query, sort results or display additional information to the user:

facet

a facet defines a concept or information item by which the indexed items can be grouped. Typical facets would be categories taken from some pre-defined taxonomy, languages, dates, places or names occurring in a text corpus. The goal is to enable users to "drill down" into a potentially large result set by selecting from a list of facets displayed to them. For example, if you shop for a laptop, you are often presented with a list of facets with which you may restrict your result by CPU type, memory or screen size etc. As you select facets, the result set will become smaller and smaller.

Facets are always pre-defined at indexing time, so the drill down is very fast. They are meant for refining other queries, the assumption always is that the user selects one or more facets from a list of facet values associated with the current query results.

field

a field contains additional, searchable content attached to an indexed parent node. In many cases fields will contain constructed content which is not directly found in the indexed XML or requires costly computation. For example, determining publication dates or author names for a set of articles may require some pre-processing which may be too expensive at query time. A field allows you to pre-compute those information items at indexing time.

Fields can be queried in the same expression as the parent node, resulting in fast response times. Their content can optionally be stored to speed up display or sorting. Fields may also use a different analyzer than the parent node, which allows e.g. multiple languages to be handled separately.

Facet and Field Configuration

Facets and fields are configured in a similar way. Both should appear nested inside the parent index element they are attached to. Let's assume we have a collection of articles written in docbook. Each article will have a top-level <info> element describing the article. Each <info> element contains a <title>, one or more <author>s and a list of keywords in <keywordset>.

Keywords are a perfect candidate for a facet, so let's start with it:

<collection xmlns="http://exist-db.org/collection-config/1.0">
  <index xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:db="http://docbook.org/ns/docbook">
    <lucene>
      <text qname="db:article">
        <facet dimension="keyword" expression="db:info/db:keywordset/db:keyword"/>
      </text>
    </lucene>
  </index>
</collection>

Every facet needs to have a dimension attribute, defining the name of the facet dimension the items will be added to. The values associated with this facet dimension are determined by the expression attribute: it may contain an arbitrary XQuery expression rooted in the parent node being indexed. In the example the parent will be a <db:article> element, so the context item for the expression is set to this element.

The expression is evaluated and for each result item, a facet value is added to the dimension using the string value of the item. Therefore if the expression returns multiple items, a facet for that dimension will also hold multiple values. If the expression returns the empty sequence for the current parent node, the corresponding facet will be empty as well.

A facet can also be defined to be hierarchical. A typical example would be a date, which consists of a year, month and day component. By indexing the single components as separate parts of a hierarchical facet, we enable the user to drill down by year first, then by month and finally by day. Let's assume each of our docbook articles has a <pubdate> containing a date in xs:date format:

<collection xmlns="http://exist-db.org/collection-config/1.0">
  <index xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:db="http://docbook.org/ns/docbook">
    <lucene>
      <text qname="db:article">
        <facet dimension="keyword" expression="db:info/db:keywordset/db:keyword"/>
        <facet dimension="date" expression="tokenize(db:info/db:pubdate, '-')" hierarchical="yes"/>
      </text>
    </lucene>
  </index>
</collection>

Hierarchical facets may also hold multiple values, for example if we would like to associate our documents with a subject classification on various levels of granularity (say: science with math and physics as subcategories or humanities with art, sociology and history). This way we enable the user to drill down into broad humanities or science subject first and choose particular topics afterwards. If the result of the hierarchical facet expression evaluates to an array, each of array members will be treated as a hierarchical value for that facet. Such an array could look in XQuery similar to [('science', 'math'), ('humanities', 'history')] and be a result of evaluating a function like idx:subject-hierarchy below stored in an imported module (see below)

<collection xmlns="http://exist-db.org/collection-config/1.0">
  <index xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:db="http://docbook.org/ns/docbook">
    <lucene>
      <module uri="http://exist-db.org/lucene/test/" prefix="idx" at="module.xql"/>
      <text qname="db:article">
        <facet dimension="keyword" expression="db:info/db:keywordset/db:keyword"/>
        <facet dimension="date" expression="tokenize(db:info/db:pubdate, '-')" hierarchical="yes"/>
        <facet dimension="subject" expression="idx:subject-hierarchy(db:info/db:subjectset/db:subject/db:subjectterm)" hierarchical="yes"/>
      </text>
    </lucene>
  </index>
</collection>
declare function idx:subject-hierarchy($key as xs:string*) {
    array:for-each (array {$key}, function($k) {
        doc('/db/subjects/subjects.xml')//subject[@name=$k]/ancestor-or-self::subject/@name
    })
};

which assumes hierarchical subject structure stored in /db/subjects/subjects.xml

<subject>
  <subject name="science">
    <subject name="math"/>
    <subject name="physics"/>
  </subject>
  <subject name="humanities">
    <subject name="art"/>
    <subject name="sociology"/>
    <subject name="history"/>
  </subject>
</subject>

Next, we may want to define fields for the authors and title of the article. In docbook, <author> can be a complex element, consisting e.g. of a <personname> with nested <surname> and <firstname>. For display to the user and sorting we want to pre-compute a normalized string out of those components:

<collection xmlns="http://exist-db.org/collection-config/1.0">
  <index xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:db="http://docbook.org/ns/docbook">
    <lucene>
      <text qname="db:article">
        <facet dimension="keyword" expression="db:info/db:keywordset/db:keyword"/>
        <field name="title" expression="db:info/db:title"/>
        <field name="author" expression="for $au in db:info/db:author string-join(($au/db:personname/db:firstname, $au/db:personname/db:surname), ' ')"/>
      </text>
    </lucene>
  </index>
</collection>

A field does not need to define an <expression> attribute though: if no expression is given, the field's content will be taken from the parent element. This makes sense e.g. if you would like to index a node twice, e.g. using a different analyzer. Or you can specify index="no" on the parent element and index its content with an explicit field.

A field may use a different analyzer than the one used to index the parent content. Analyzers are referenced through analyzer attribute as described above.

Typed fields: fields may also declare a type attribute: supported values are atomic types like xs:date, xs:dateTime, xs:time, xs:integer, xs:decimal and their sub-types. Defining a type is important with respect to sorting (see below), e.g. to get dates in the correct order. Typed fields can also be retrieved into corresponding XQuery atomic values, so no additional casting is necessary. However, typed fields cannot be queried using Lucene's default query parser, only retrieved with ft:field.

Storing fields: by default the complete content of a field is stored in the Lucene index, allowing later fast retrieval of the content using ft:field. You can disable storing the content by adding attribute store="no". The field will still be indexed and available for queries though.

Binary fields: a special type of field offering faster access times than a normal field. Binary fields are stored into a fast lookup table alongside the main index. Their content can be retrieved, but not queried, and you can declare a type (see typed fields above). The default type will be xs:string. Use a binary field if you frequently need to sort or filter large node sets by a given field, e.g. when sorting by date:

<collection xmlns="http://exist-db.org/collection-config/1.0">
  <index xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:db="http://docbook.org/ns/docbook">
    <lucene>
      <module uri="http://exist-db.org/lucene/test/" prefix="idx" at="module.xql"/>
      <text qname="db:article">
        <facet dimension="keyword" expression="db:info/db:keywordset/db:keyword"/>
        <field name="title" expression="db:info/db:title"/>
        <field name="date" expression="db:info/db:pubdate" type="xs:date" binary="yes"/>
        <field name="author" expression="idx:author(db:info/db:author)"/>
      </text>
    </lucene>
  </index>
</collection>

Importing external modules: as can be seen in the field definition for "author" above, expressions can easily become quite verbose, so writing them into an attribute is not convenient. It is thus also possible to import one or more XQuery modules into the index configuration and use the functions declared in the module:

<collection xmlns="http://exist-db.org/collection-config/1.0">
  <index xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:db="http://docbook.org/ns/docbook">
    <lucene>
      <module uri="http://exist-db.org/lucene/test/" prefix="idx" at="module.xql"/>
      <text qname="db:article">
        <facet dimension="keyword" expression="db:info/db:keywordset/db:keyword"/>
        <field name="title" expression="db:info/db:title"/>
        <field name="author" expression="idx:author(db:info/db:author)"/>
      </text>
    </lucene>
  </index>
</collection>

In this example we extract the code for computing the author field into a function idx:authors located in an XQuery module, module.xql. Note that we're using a relative import path for the module in the at attribute. The path will be resolved relative to the collection to which the collection configuration applies (not where the collection configuration itself is stored). It is also important that the module and all dependencies it imports is stored before the collection configuration is saved and indexing starts.

Conditions: sometimes you may want to create a field only if a certain condition is met. For this purpose, an additional attribute if may be added, containing an XPath expression. If the expression evaluates to an effective boolean value of true, the field will be created. Otherwise it is skipped.

Conditions are useful to e.g. distinguish between different languages and apply an appropriate analyzer to each. Let's assume our docbook articles may have both, a German and English version. The language is indicated by the @xml:lang attribute on the top-level <section> element. We thus create a separate field for each language and connect it to the analyzer appropriate for the language:

<collection xmlns="http://exist-db.org/collection-config/1.0">
  <index xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:db="http://docbook.org/ns/docbook">
    <lucene>
      <analyzer class="org.apache.lucene.analysis.de.GermanAnalyzer" id="german"/>
      <analyzer class="org.apache.lucene.analysis.en.EnglishAnalyzer" id="english"/>
      <text qname="db:article" index="no">
        <field name="english" if="@xml:lang='en'" analyzer="english"/>
        <field name="german" if="@xml:lang='de'" analyzer="german"/>
      </text>
    </lucene>
  </index>
</collection>

Note that we skip indexing the parent <article> element with index="no" because we do not want a default index, but rather a separate field for each language, so we can target them in queries explicitely.

Querying the Index

Querying full text from XQuery is straightforward. For example:

for $m in //SPEECH[ft:query(., "boil bubble")]
order by ft:score($m) descending
return $m

The query function takes a query string in Lucene's default query syntax. It returns a set of nodes which are relevant with respect to the query. Lucene assigns a relevance score or rank (a decimal number) to each match. This score is preserved by eXist-db and can be accessed through the score function.

The higher the score, the more relevant the text. You can use Lucene's features to "boost" a certain term in the query: give it a higher or lower influence on the final rank.

Please note that the score is computed relative to the root context of the index. If you created an index on <SPEECH>, all scores will be computed based on text in <SPEECH> nodes, even though your actual query may only return <LINE> children of <SPEECH>.

The query string passed to ft:query may be empty. In this case all items from the context sequence are matched and returned. Using an empty query makes sense in combination with the options for retrieving facets and field values described below.

The Lucene module is fully supported by eXist-db's query-rewriting optimizer. This means that the query engine can rewrite the XQuery expression to make best use of the available indexes. All the rules and hints given in the tuning guide fully apply to the Lucene index.

To present search results in a Keywords in Context format, you may want to have a look at eXist-db's KWIC module.

Querying Fields

Fields associated with the indexed parent node (see above) can be queried with ft:query by prefixing parts of the query expression with the field name followed by a colon (':') as described in the documentation for Lucene's default query syntax. For example, the following expression searches for a docbook article containing the terms "xml" in the text and "xquery language" in the title:

//db:article[ft:query(., "title:(xquery AND language) AND xml")]

Note how subexpressions can be grouped with parentheses to clearly state to which field they apply.

Retrieving Field Content

You can retrieve the content of a field for display or sorting purposes using the ft:field function or alternatively the ft:binary-field function if you created a binary field. However, fields are always bound to the result of a full text query, so you cannot retrieve them without calling ft:query first.

One of the most common uses for retrieving field contents will be for sorting the results of a query. The order by in the example below sorts results by title first and then by author.

for $article in collection("/db/articles")//db:article[ft:query(., "xquery")]
order by ft:field($article, "title"), ft:binary-field($article, "date", "xs:date")[1]
return
    $article

Note that even though fields are only available with the results of the ft:query, it is still possible to use them for sorting and displaying the whole available data set. For example, to view all articles in the collection you could pass in an empty sequence in place of the query string like this:

//db:article[ft:query(., ())]

Typed fields: If you declared a different type than xs:string on a field, you should remember to use the 3-parameter variant of ft:field or ft:binary-field and pass in the name of the desired target type as 3rd parameter. Reason: lucene basically stores all non-text data types as numbers and eXist has no way to figure out the original type of the field. So if you defined a field (binary or not) with type xs:date, make sure to retrieve it with ft:binary-field($node, "date", "xs:date"), otherwise you'll either get an error or garbage returned.

Matches in fields

When retrieving the content of a field for display, you may still need an indication of where matches were found. You cannot use util:expand for that but need ft:highlight-field-matches instead. Effectively it provides the same mechanism as util:expand for full text matches but works on fields.

  • util:expand operates on nodes which are part of an XML document

  • fields are attached to a node, but can contain arbitrary string content which may, but does not have to, be derived from the document

Thus util:expand and ft:highlight-field-matches behave differently and produce slightly different output. ft:highlight-field-matches is also faster as it does not need to traverse the XML tree like util:expand.

ft:highlight-field-matches will return a field it received in its first parameter with exist:match tag wrapped around the matches.

For example, to display each matching article with its title, author and matched full text as KWIC, you could start with the code below. Please note that ft:highlight-field-matches just returns exist:field with exist:match inside, so you'd need to further process the result for proper HTML output.

  
for $article in collection("/db/articles")//db:article[ft:query(., "lucene AND title:xquery AND author:wolfgang", map { "fields": ("title", "author") })]
    order by ft:field($article, "title")[1]
return
    <result>
      {ft:highlight-field-matches($article, 'title')}
      {ft:highlight-field-matches($article, 'author')}
      {kwic:summarize($article, <config width="40"/>)}
    </result>

Displaying Facet Counts

Facet counts for the query result can be retrieved if facets are associated with an indexed parent element. Facet counts for a particular dimension are available as a map containing an entry for each facet value occurring in one or more items of the query result. The map links the facet value given as a map key with a positive count corresponding to the number of times the value occurs in the result set. Facet values with zero count are never included.

For example, we may use the following query to display the facet counts for the "keyword" dimension in our set of docbook articles:

let $result := collection("/db/articles")//db:article[ft:query(., "xml")]
let $facets := ft:facets($result, "keyword", ())
return
    <table>
    {
        map:for-each($facets, function($label, $count) {
            <tr><td>{$label}</td><td>{$count}</td></tr>
        })
    }
    </table>

Function ft:facets expects a sequence of nodes belonging to a result set obtained from one or more calls to ft:query. If the sequence was combined from multiple expressions calling ft:query, the facet counts will be merged. Second parameter of ft:facets specifies the dimension for which facet counts should be retrieved. The third parameter should be either empty sequence or a positive integer denoting the maximum number of facets to show. In the case it is smaller than the total number of facets, only those with the highest counts are returned. Passing an empty sequence means that all facet value counts should be shown. Please note that facets with a zero occurrence count (i.e. facets not appearing anywhere in the result) are never returned.

For hierarchical facets only the top-most facet value in the hierarchy will be returned by default. For example, if you indexed a date facet with separate year, month and day component, a call to ft:facets($node, "date", ()) will return facet counts for years only. To also get counts for months, you have to call ft:facets with a fourth parameter, passing in the year for which sub-facet counts should be retrieved. To get days, you also need to specify month and so on. ft:facets($node, "date", (), ("2018", "06")) will thus return facet counts for all days in June 2018.

Refining a Query with Facets

The main purpose of facets is to quickly narrow down a query result, limiting it to only items which match a certain facet value. To drill down by a given facet dimension and value, pass a key "facets" in the options map given in the third parameter of ft:query:

let $options := map { 
    "facets": map { 
        "keyword": ("indexing", "facets")
    }
}
return
    collection("/db/articles")//db:article[ft:query(., "xml", $options)]

If you specify multiple dimensions, these will be linked together with a logical and, limiting the result to elements matching both dimensions.

Treatment of multiple values for one facet dimension depends on the type of facet. For non-hierarchical facets, as in example above, if you specify more than one value these will be linked together with a logical or, returning elements matching any of the alternative facet values for that dimension.

In case of hierarchical facets, a sequence of items is interpreted as a hierarchical value/subvalue facet path, therefore expression like ("2018", "06", "25") for the date dimension mentioned earlier will return nodes from the 25th of June 2018. Nested sequences are not allowed in XQuery, so the only way to pass in multiple hierarchical facet paths is by wrapping the whole structure in an array. Each array element then is logically linked with its other members with or expression. To query for elements from June or May 2018 we therefore need to specify date dimension values as [("2018", "06"), ("2018", "05")]

let $options := map { 
    "facets": map { 
        "keyword": ("indexing", "facets"),
        "date": [("2018", "06"), ("2018", "05")]
    }
}
return
    collection("/db/articles")//db:article[ft:query(., "xml", $options)]

Describing Queries in XML

Lucene's default query syntax does not provide access to all available features. However, eXist-db's ft:query function also accepts a description of the query in XML, as an alternative to passing a query string. The XML description closely mirrors Lucene's query API. It is transformed into an internal tree of query objects, which is directly passed to Lucene for execution. This has several advantages, for example you can specify if the order of terms should be relevant for a phrase query:

let $query :=
    <query>
        <near ordered="no">miserable nation</near>
    </query>
return
    //SPEECH[ft:query(., $query)]

Ranged queries using TO are also supported. Suppose you have marked dates and wish to return only results between 1600 and 1610.

let $query := "date:[1600 TO 1610]"

return ft:search($col, $query)//exist:match

The following elements may occur within a query description:

<term>

Defines a single term to be searched in the index. If the root query element contains a sequence of term elements, wrap them in <bool/> and they will be combined as in a boolean "or" query. For example:

let $query :=
    <query>
        <bool><term>nation</term><term>miserable</term></bool>
    </query>
return
//SPEECH[ft:query(., $query)]

This finds all <SPEECH> elements containing either nation or miserable or both.

<wildcard>

A string with a * wildcard in it. This will be matched against the terms of a document. Can be used instead of a <term> element. For example:

let $query :=
    <query>
        <bool><term>nation</term><wildcard>miser*</wildcard></bool>
    </query>
return
//SPEECH[ft:query(., $query)]
<regex>

A regular expression which will be matched against the terms of a document. Can be used instead of a <term> element. For example:

let $query :=
    <query>
        <bool><term>nation</term><regex>miser.*</regex></bool>
    </query>
return
//SPEECH[ft:query(., $query)]
<bool>

Constructs a boolean query from its children. Each child element may have an occurrence indicator, which could be either must, should or not:

must

this part of the query must be matched

should

this part of the query should be matched, but doesn't need to

not

this part of the query must not be matched

For instance:

let $query :=
    <query>
        <bool><term occur="must">boil</term><term occur="should">bubble</term></bool>
    </query>
return //SPEECH[ft:query(LINE, $query)]

To optimize performance you can specify a minimum number of matches to prevent needless disjunctive searches, using the min attribute. If no occurrence indicator is provided the query will default to should, as this is the only indicator that supports min:

let $query := <query><bool min="3"><term>witch</term></bool></query>

return
  //SPEECH[ft:query(LINE, $query)]        
<phrase>

Searches for a group of terms occurring in the correct order. The element may either contain explicit <term> elements or text content. Text will be automatically tokenized into a sequence of terms. For example:

let $query :=
    <query>
        <phrase>cauldron boil</phrase>
    </query>
return //SPEECH[ft:query(., $query)]

This has the same effect as:

let $query :=
    <query>
        <phrase><term>cauldron</term><term>boil</term></phrase>
    </query>
return //SPEECH[ft:query(., $query)]

The attribute slop can be used for a proximity search: Lucene will try to find terms which are within the specified distance:

let $query :=
    <query>
        <phrase slop="10"><term>frog</term><term>dog</term></phrase>
    </query>
return //SPEECH[ft:query(., $query)]
<near>

<near> is a powerful alternative to <phrase> and one of the features not available through the standard Lucene query parser.

If the element has text content only, it will be tokenized into terms and the expression behaves like <phrase>. Otherwise it may contain any combination of <term>, <first> and nested <near> elements. This makes it possible to search for two sequences of terms which are within a specific distance. For example:

let $query :=
    <query>
        <near slop="20"><term>snake</term><near slop="1">tongue dog</near></near>
    </query>
return //SPEECH[ft:query(., $query)]

Element <first> matches a span against the start of the text in the context node. It takes an optional attribute end to specify the maximum distance from the start of the text. For example:

let $query :=
    <query>
        <near slop="50"><first end="2"><near>second witch</near></first><near
slop="1">tongue dog</near></near>
    </query>
    return //SPEECH[ft:query(., $query)]

As shown above, the content of <first> can again be text, a <term> or <near>.

Contrary to <phrase>, <near> can be told to ignore the order of its components. Use parameter ordered="yes|no" to change near's behaviour. For example:

let $query :=
    <query>
        <near slop="100" ordered="no"><term>bubble</term><term>fillet</term></near>
    </query>
return //SPEECH[ft:query(., $query)]

All elements in a query may have an optional boost parameter (float). The score of the nodes matching the corresponding query part will be multiplied by this factor.

Additional parameters

The ft:query function allows a third parameter for passing additional settings to the query engine. This parameter must be an XML fragment which lists the configuration properties to be set as child elements:

let $options :=
    <options>
        <query-analyzer-id>ws</query-analyzer-id>
        <default-operator>and</default-operator>
        <phrase-slop>1</phrase-slop>
        <leading-wildcard>no</leading-wildcard>
        <filter-rewrite>yes</filter-rewrite>
        <lowercase-expanded-terms>yes</lowercase-expanded-terms>
    </options>
return
    //SPEECH[ft:query(., $query, $options)]

The meaning of those properties is as follows:

query-analyzer-id

Explicitly specifies the analyzer that should be used to process that particular query. The value provided should match the id attribute of an analyzer defined in collection.xconf. If you don't specify that property, your query will be processed using the same analyzer that was used to create the index you're querying. While this works for most cases, there are tasks that require this finer level of control. For example it is a feature of some Lucene analyzers that the parameters used for indexing a field are different from those used to query it. In such cases simply define a distinct analyzer (dubbed 'query analyzer') and explicitly refer to it with the query-analyzer-id parameter in each respective query invocation.

filter-rewrite

Controls how terms are expanded for wildcard or regular expression searches. If set to yes, Lucene will use a filter to pre-process matching terms. If set to no, all matching terms will be added to a single boolean query which is then executed. This may generate a "too many clauses" exception when applied to large data sets. Setting filter-rewrite to yes avoids those issues.

default-operator

The default operator with which multiple terms will be combined. Allowed values: or, and.

phrase-slop

Sets the default slop for phrases. If 0, then exact phrase matches are required. Default value is 0.

leading-wildcard

When set to yes, * or ? are allowed as the first character of a PrefixQuery and WildcardQuery. Note that this can produce very slow queries on big indexes.

lowercase-expanded-terms

An option used to set whether the terms in wildcard, prefix, fuzzy, or range queries should be automatically lower-cased or left in their original case. When set to yes, query terms are lower-cased (i.e., as if fn:lower-case() were applied to the query string). Default value is no.

Adding Constructed Fields to a Document

This feature allows to add arbitrary fields to a binary or XML document and have them indexed with Lucene. It was developed as part of the content extraction framework, to attach metadata extracted from for instance a PDF to the binary document. It works equally well for XML documents though and is an efficient method to attach computed fields to a document, containing information which does not exist in the XML as such.

With the advent of Facets and Fields) functionality it is recommended to use these instead of constructed fields.

The field indexes are not configured via collection.xconf. Instead we add fields programmatically from an XQuery (which could be run via a trigger):

ft:index("/db/demo/test.xml", <doc>
    <field name="title" store="yes">Indexing</field>
    <field name="author" store="yes">Me</field>
    <field name="date" store="yes">2013</field>
</doc>)

The store attribute indicates that the fields content should be stored as a string. Without this attribute, the content will be indexed for search, but you won't be able to retrieve the contents.

To get the contents of a field, use the ft:get-field function:

ft:get-field("/db/demo/test.xml", "title")

To query this index, use the ft:search function:

ft:search("/db/demo/test.xml", "title:indexing and author:me")

Custom field indexes are automatically deleted when their parent document is removed. If you want to update fields without removing the document, you need to delete the old fields first though. This can be done using the ft:remove-index function:

ft:remove-index("/db/demo/test.xml")