Legacy Range index
This article describes eXist-db's old range index. This has been replaced by a redesigned range index (since eXist 2.2). The old index is still available for compatibility reasons. Its use is discouraged.
Range indexes provide a shortcut for the database to directly select nodes based on their typed values. They are used when matching or
comparing nodes by way of standard XPath operators and functions. Without a range index, comparison operators like
< will default to a "brute-force" inspection of the DOM, which can be extremely slow if eXist-db has to
search through maybe millions of nodes: each node has to be loaded and cast to the target type.
To see how range indexes work, consider the following fragment:
With this short inventory, the text nodes of the
<price> elements have dollar values expressed as a floating-point number, (e.g.
"299.99"), which has an XML Schema
Definition (XSD) data type of
xs:double. Using this built-in type to define a range index, we can improve the
efficiency of searches for
<price> values. During indexing, eXist-db will apply this data type selection by attempting to cast all
<price> values as double floating point numbers and add appropriate values to the index. Values that cannot be cast as double are
ignored. This range index will be used by any expression that compares
<price> to an
xs:double value. For
//item[price > 100.0]
For non-string data types, the range index provides the query engine with a more efficient method of data conversion. Instead of retrieving
the value of each selected element and casting it as a
xs:double type, the engine can evaluate the expression by using the
range index as a form of lookup index. Without an index, eXist-db has to do a full scan over all price
<price> elements, retrieve the
string values of their text node and cast them to a double number. This is a time-consuming process which also scales very badly with growing
data sets. With a proper index, eXist-db needs just a single index lookup to evaluate
price = 100.0. The range expression
price > 100.0 is processed with an index scan starting at 100.
For string data, the index will also be used by the standard functions
To illustrate this functionality, let's return to the previous example. If you define a range index of type
<name>, a query on this element to select tall bookcases using
fn:matches() will be supported by the
fn:matches will by default try to match the regular expression anywhere in the string. We
can thus speed up the query dramatically by using "^" to restrict the match to the start of the string:
If you really need to search for an exact substring in a longer text sequence, it is often better to use the NGram index instead of the
range index, i.e. use
ngram:contains() instead of
fn:contains(). Unfortunately, there's no equivalent
NGram function for
In general, three conditions must be met in order to optimize a search using a range index:
The range index must be defined on all items in the input sequence.
For example, suppose you have two collections in the database: C1 and C2. If you have a range index defined for collection C1, but your query happens to operate on both C1 and C2, then the range index would not be used. The query optimizer selects an optimization strategy based on the entire input sequence of the query. Since, in this example, since only nodes in C1 have a range index, no range index optimization would be applied.
The index data type (first argument type) must match the test data type (second argument type).
In other words, with range indexes, there is no promotion of data types (i.e. no data type precedes or replaces another data type). For example, if you defined an index of type
<price>, a query that compares this element's value with a string literal would not use a range index, for instance:
//item[price = '1000.0']
In order to apply the range index, you would need to cast the value as a type
//item[price = xs:double($price)] (where $price is any test value)
Similarly, when we compare
xs:integervalues, as in, for instance:
//item[price = 1000]
the range index would again not be used since the
<price>data type differs from the test value type, although this conflict might not seem as obvious as it is with string values.
The right-hand argument has no dependencies on the current context item.
That is, the test or conditional value must not depend on the value against which it is being tested. For example, range indexes will not be applied given the following expression:
//item[price = self]
Concerning range indexes on strings there's another restriction to be considered: up to version 1.3, range indexes on strings can only be used with the default Unicode collation. Also, string indexes will always be case sensitive (while n-gram and full text indexes are not). It is not (yet) possible to define a string index on a different collation (e.g. for German or French) or to make it case-insensitive.
Range index configuration is done in
collection.xconf documents (see indexing):
A range index is configured by adding a
<create> element directly below the root
<index> element. As explained above, the
node to be indexed is either specified through a
path or a
Unlike the new range index, the
<create> elements of the old range index are
not wrapped inside a
As range indexes are type specific, the
type attribute is always required. The type should be one of the atomic XML
schema types, currently including
xs:integer and its derived types
xs:dateTime. If the name of the type is unknown, the index
configuration will be ignored and you will get a warning written into the logs.
Please note that the index configuration will only apply to the node specified via the
attribute, not to descendants of that node. Consider a mixed content element like:
If an index is defined on
<mixed>, the key for the index is built from the concatenated text nodes of element
all its descendants, so
uneven in this case. The created index will only be used to evaluate queries on
<mixed>, but not
for queries on
<s>. However, you can create an additional index on
<s> without getting into conflict with the existing index
It is important to note the difference between the
qname attributes used in above examples.
Both attributes are used to define the elements or attributes to which the index should apply. However, the
creates context-dependant indexes, while the
qname attribute does not.
path attribute takes a simple path expression:
path expression looks like XPath, but it's not. Index path syntax uses the following components to construct paths:
Elements are specified by their qname
Attributes are specified by
@attributeName, so if the attribute is called
attrib1, one uses
@attrib1in the index specification.
Child nodes are selected using the forward-slash (
All descendant nodes in a tree are selected using the double forward-slash (
The example above creates a range index of type string on all
<title> elements which are children of
which may occur at an arbitrary position in the document tree. All other
<title> elements, e.g. those being children of
<section> nodes, are not indexed. The path expression thus defines a selective index, which is also
context-dependant: we always need look at the context of each
<title> node before we can determine if this
particular title is to be indexed or not.
This kind of context-dependant index definition helps to keep the index small, but unfortunately it makes it hard for the query optimizer to properly rewrite the expression tree without missing some nodes. The optimizer needs to make an optimization decision at compile time, where the context of an expression is unknown or at least not exactly known (read the blog article to get the whole picture). This means that some of the optimization techniques can not be applied to context-dependant indexes!
We therefore had to introduce an alternative configuration method which is not context-dependant. To keep things simple, we decided to define the index on the qname of the element or attribute alone and to ignore the context altogether:
This results in an index being created on every
<title> element found in the document node tree. Section titles will be indexed as
well as chapter or book titles.
Indexes on attributes are defined as above by prepending "@" to the attribute's name:
This defines an index on all attributes named "type", but not on elements with the same name.
Defining indexes on qnames may result in a considerably larger index, but it allows the query engine to apply all available optimization techniques, which can improve query times by an order of magnitude. As always, there's a trade-off here between performance and storage space. In many cases, the performance win can be dramatic enough to justify the increase in index size.
To be on the safe side, and to benefit from current and future improvements in the query engine, you should prefer
path, unless you really need to exclude certain nodes from indexing.