Full Text Index
This article provides information on configuring and using eXist-db's full text index.
The full text index module is based on Apache Lucene.
The full-text index module is tightly integrated with eXist-db's modularized indexing architecture: the index behaves like a plug-in which adds itself to the database's index pipelines. Once configured, the index will be notified of relevant events, like adding/removing a document, removing a collection or updating single nodes. No manual re-indexing is required to keep the index up-to-date.
The full-text index module also implements common interfaces which are shared with other indexes, for instance for highlighting matches (see KWIC). It is easy to switch between the Lucene index and, for instance, the ngram index without rewriting much XQuery code.
The Lucene full text index is enabled by default (since eXist-db version 1.4). In case it is not enabled in your installation, here's how to get it up and running:
Enable it according to the instructions in the article on index modules.
Then (re-)build eXist-db using the provided
build.batscript. The build process downloads the required Lucene jars automatically. If everything builds ok, you'll find a jar
Edit the main configuration file,
conf.xmland un-comment the Lucene-related section:<modules> <module id="lucene-index" class="org.exist.indexing.lucene.LuceneIndex" buffer="32"/> ... </modules>
The index has a single configuration parameter on the
<module> element called
It defines the amount of memory (in megabytes) Lucene will use for buffering index entries before they are written to disk. See the Lucene Javadocs.
Like other indexes, you create a Lucene index by configuring it in a
collection.xconf document as explained in documentation. For example:
You can define a Lucene index on a single element or attribute (
qname="...") or a node path with wildcards
match="...", see below).
It is important make sure to choose the right context for an index, which has to be the same as in your query. To better understand this, let's have a look at how the index creation is handled by eXist-db and Lucene. For example:
This creates an index on
<SPEECH> only. What is passed to Lucene is the string value of
<SPEECH>, which also includes the
text of all its descendant text nodes (except those filtered out by an optional
Consider the fragment:
If you have an index on
<SPEECH>, Lucene will use the text
"Second Witch Fillet of a fenny snake, In the cauldron boil and
bake;" and index it. eXist-db internally links this Lucene document to the
<SPEECH> node, but Lucene itself has no knowledge
of that (it doesn't know anything about XML nodes).
Given this, take the following query:
This searches the index and finds the text, which eXist-db can trace back to the
<SPEECH> node in the XML document.
However, it is required that you use the same context (
<SPEECH>) for creating and querying the index. For
This will not return anything, even though
<LINE> is a child of
was indexed. This particular
cauldron is linked to its ancestor
<SPEECH> , not its parent
However, you are free to give the user both options, i.e. use
<LINE> as context at the same time. For this
define a second index on
Let's use a different example to illustrate this. Assume you have a document with encoded place names:
For a general query you probably want to search through all paragraphs. However, you may also want to provide an advanced search option,
which allows the user to restrict his/her queries to place names. To make this possible, simply define an index on
Based on this setup, you'll be able to query for the word 'Paris' anywhere in a paragraph:
And also on 'Paris' occurring within a
In addition to defining an index on a given qualified name, you can also specify a "path" with wildcards. This feature might be subject to change, so please be careful when using it.
Assume you want to define an index on all the possible elements below
<SPEECH>. You can do this by creating one index for every
As a shortcut, you can use a
match attribute with a wildcard:
This will create a separate index on each child element of SPEECH it encounters. Please note that the argument to match is a simple path
pattern, not a full XPath expression. It only allows
// to denote child or descendant steps, plus the wildcard
* to match an arbitrary element.
As explained above, you have to figure out which parts of your document will likely be interesting as context for a full text query. The
full text index works best if the context isn't too narrow. For example, if you have a document structure with section
headings and paragraphs, you would probably want to create an index on the
<div>s and maybe on the headings, so the user can
differentiate between the two.
In some cases, you could decide to put the index on the paragraph level. Then you don't need the index on the section, since you can always get from the paragraph back to the section.
If you query a larger context, you can use the KWIC module to
show the user text surrounding each match. Or you can ask eXist-db to highlight each match with an
<exist:match> tag, which you can later use to locate the
matches within the text.
By default, eXist-db's indexer assumes that element boundaries break on a word or token. For example, if you have an element:
8 to be indexed as separate tokens, even though there's no whitespace between the elements.
eXist-db will pass the content of the two elements to Lucene as separate strings and Lucene will see two tokens (instead of just
However, you usually don't want this behaviour for mixed content nodes. For example:
In this case, you want
unclear to be indexed as a single word. This can be done by telling eXist-db which nodes are inline
nodes. The example configuration above uses:
<inline> option can both be specified globally or per-index:
It is sometimes necessary to skip the content of an inline element. Notes are a good example:
<ignore> element in the collection configuration to have eXist-db ignore the note:
<ignore> simply allows you to hide a chunk of text before Lucene sees it.
<ignore> may appear both globally or within a single index definition.
<ignore> only applies to descendants of an indexed element. You can still create another index on the ignored element
itself. For example, you can have index definitions for
<note> appears within
<p>, it will not be added to the index on
<p>, only to the index on
This may not return a hit if
"note" occurs within a
<note>, while this finds a match:
A boost value can be assigned to an index to give it a higher score. The score for each match will be multiplied by the boost factor (default is: 1.0). For example, you may want to rank matches in titles higher than other matches.
Here's how to configure the documentation search indexes in eXist-db:
<title> index gets a boost of 2.0 to make sure that its matches get a higher score. Since the
<title> element occurs
<section>, we add an ignore rule to the index definition on the section and create a separate index on title. We also ignore
titles occurring inside paragraphs. Without this, title would be matched two times.
Because the title is now indexed separately, we need to query it explicitly. For example, to search the section and the title at the same time, one could issue the following query:
Starting with eXist-db 3.0 a boost value can also be assigned to an index by attribute. This can be used to weight your search results,
even if you have flat data structures with the same attribute value pairs in attributes throughout your documents. Two flavours of dynamic
weighting are available through the new pairs
<has-attribute> child elements in the full-text index configuration.
If you have data in Lexical metadata framework (LMF) format you will recognize these repeated structures of
val attributes within
<LexicalEntry> elements. For instance
val='LMF feature value'>. The attribute boosting allows you to weight the results based on the value of the
attribute so that hits in definitions come before hits in comments and examples. This behaviour is enabled by adding a child
<match-sibling-attr> to a Lucene configuration
<text> element. An example index configuration for it looks like
This means that the
ft:score#1 function will boost hits in
val attributes with a factor of 25 times for the
writtenForm value of the
In the same way
<match-attr> would be used for element qnames in the
If you do not care about any value of the sibling attribute, use the
<has-attribute> index configuration variant. An example
index configuration with
<has-attr> looks like this:
This means that if your
<feat> elements have an attribute
<xml:lang> it will score them nil and push them last of the
pack, which might be useful to demote hits in features in other languages than the main entry language.
In the same way
<has-sibling-attr> would be used for attributes in the
One of the strengths of Lucene is that it allows the developer to determine nearly every aspect of text analysis. This is done through analyzer classes, which combine a tokenizer with a chain of filters to post-process the tokenized text. eXist-db's Lucene module already allows different analyzers to be used for different indexes.
In the example above, we define that Lucene's StandardAnalyzer should be used by default (the
<analyzer> element without
attribute). We provide an additional analyzer and assign it the id
ws, by which the analyzer can be referenced in the
actual index definitions.
The whitespace analyzer is the most basic one. As the name implies, it tokenizes the text at white space characters, but treats all other characters - including punctuation - as part of the token. The tokens are not converted to lower case and there's no stopword filter applied.
You can send configuration parameters to the instantiation of the Analyzer. These parameters must match a
signature on the underlying Java class of the Analyzer, please review the Javadoc for the Analyzer that you wish to configure.
We currently support passing the following types:
String(default if no type is specified)
java.io.FileReader(since Lucene 4) or
The value Version#LUCENE_CURRENT is always added as first parameter for the analyzer constructor (a fallback mechanism is present for older
analyzers). The previously valid values
java.util.Set can not be used since Lucene 4.
For instance to add a stopword list, use one of the following constructions:
Using the Snowball analyzer requires you to add additional libraries to
Sometimes you want to define different Lucene indexes on the same set of elements, for instance to use a different
analyzer. eXist-db allows to name a certain index using the
Such an index is called named index. See Query a Named Index on how to query these indexes.
Querying full text from XQuery is straightforward. For example:
The query function takes a query string in Lucene's default query syntax. It returns a set of nodes which are relevant with respect to the query. Lucene assigns a relevance score or rank (a decimal number) to each match. This score is preserved by eXist-db and can be accessed through the score function.
The higher the score, the more relevant the text. You can use Lucene's features to "boost" a certain term in the query: give it a higher or lower influence on the final rank.
Please note that the score is computed relative to the root context of the index. If you created an index on
<SPEECH>, all scores
will be computed based on text in
<SPEECH> nodes, even though your actual query may only return
<LINE> children of
The Lucene module is fully supported by eXist-db's query-rewriting optimizer. This means that the query engine can rewrite the XQuery expression to make best use of the available indexes. All the rules and hints given in the tuning guide fully apply to the Lucene index.
To present search results in a Keywords in Context format, you may want to have a look at eXist-db's KWIC module.
To query a named index (see Defining Fields), use the
ft:query-field($fieldName, $query) instead of
ft:query-field works exactly like
ft:query, except that the set of nodes to search is determined by the
nodes in the named index. The function returns the nodes selected by the query, which would be
<title> elements in the example
You can use
ft:query-field with an XPath filter expression, just as you would call
Lucene's default query syntax does not provide access to all available features. However, eXist-db's
also accepts a description of the query in XML, as an alternative to passing a query string. The XML description closely mirrors Lucene's
query API. It is transformed into an internal tree of query objects, which is directly passed to Lucene for execution. This has several
advantages, for example you can specify if the order of terms should be relevant for a phrase query:
The following elements may occur within a query description:
Defines a single term to be searched in the index. If the root query element contains a sequence of term elements, wrap them in
<bool></bool>and they will be combined as in a boolean "or" query. For example:let $query := <query> <bool><term>nation</term><term>miserable</term></bool> </query> return //SPEECH[ft:query(., $query)]
This finds all
<SPEECH>elements containing either
A string with a
*wildcard in it. This will be matched against the terms of a document. Can be used instead of a
<term>element. For example:let $query := <query> <bool><term>nation</term><wildcard>miser*</wildcard></bool> </query> return //SPEECH[ft:query(., $query)]
A regular expression which will be matched against the terms of a document. Can be used instead of a
<term>element. For example:let $query := <query> <bool><term>nation</term><regex>miser.*</regex></bool> </query> return //SPEECH[ft:query(., $query)]
Constructs a boolean query from its children. Each child element may have an occurrence indicator, which could be either
this part of the query must be matched
this part of the query should be matched, but doesn't need to
this part of the query must not be matched
For instance:let $query := <query> <bool><term occur="must">boil</term><term occur="should">bubble</term></bool> </query> return //SPEECH[ft:query(LINE, $query)]
Searches for a group of terms occurring in the correct order. The element may either contain explicit
<term>elements or text content. Text will be automatically tokenized into a sequence of terms. For example:let $query := <query> <phrase>cauldron boil</phrase> </query> return //SPEECH[ft:query(., $query)]
This has the same effect as:let $query := <query> <phrase><term>cauldron</term><term>boil</term></phrase> </query> return //SPEECH[ft:query(., $query)]
slopcan be used for a proximity search: Lucene will try to find terms which are within the specified distance:let $query := <query> <phrase slop="10"><term>frog</term><term>dog</term></phrase> </query> return //SPEECH[ft:query(., $query)]
<near>is a powerful alternative to
<phrase>and one of the features not available through the standard Lucene query parser.
If the element has text content only, it will be tokenized into terms and the expression behaves like
<phrase>. Otherwise it may contain any combination of
<near>elements. This makes it possible to search for two sequences of terms which are within a specific distance. For example:let $query := <query> <near slop="20"><term>snake</term><near slop="1">tongue dog</near></near> </query> return //SPEECH[ft:query(., $query)]
<first>matches a span against the start of the text in the context node. It takes an optional attribute
endto specify the maximum distance from the start of the text. For example:let $query := <query> <near slop="50"><first end="2"><near>second witch</near></first><near slop="1">tongue dog</near></near> </query> return //SPEECH[ft:query(., $query)]
As shown above, the content of
<first>can again be text, a
<near>can be told to ignore the order of its components. Use parameter
ordered="yes|no"to change near's behaviour. For example:let $query := <query> <near slop="100" ordered="no"><term>bubble</term><term>fillet</term></near> </query> return //SPEECH[ft:query(., $query)]
All elements in a query may have an optional
boost parameter (float). The score of the nodes matching the corresponding
query part will be multiplied by this factor.
ft:query function allows a third parameter for passing additional settings to the query engine. This parameter must be an
XML fragment which lists the configuration properties to be set as child elements:
The meaning of those properties is as follows
Controls how terms are expanded for wildcard or regular expression searches. If set to
yes, Lucene will use a filter to pre-process matching terms. If set to
no, all matching terms will be added to a single boolean query which is then executed. This may generate a "too many clauses" exception when applied to large data sets. Setting filter-rewrite to
yesavoids those issues.
The default operator with which multiple terms will be combined. Allowed values:
Sets the default slop for phrases. If
0, then exact phrase matches are required. Default value is
When set to
?are allowed as the first character of a PrefixQuery and WildcardQuery. Note that this can produce very slow queries on big indexes.
This feature allows to add arbitrary fields to a binary or XML document and have them indexed with Lucene. It was developed as part of the content extraction framework, to attach metadata extracted from for instance a PDF to the binary document. It works equally well for XML documents though and is an efficient method to attach computed fields to a document, containing information which does not exist in the XML as such.
The field indexes are not configured via
collection.xconf. Instead we add fields programmatically from an XQuery (which
could be run via a trigger):
store attribute indicates that the fields content should be stored as a string. Without this attribute, the content
will be indexed for search, but you won't be able to retrieve the contents.
To get the contents of a field, use the
To query this index, use the
Custom field indexes are automatically deleted when their parent document is removed. If you want to update fields without removing the
document, you need to delete the old fields first though. This can be done using the