Tuning the Database

(2Q19)


This article describes strategies for improving the performance of eXist-db and optimizing its efficiency. It covers subjects that constrain performance, from memory and cache settings to the way a query is constructed.

Memory settings

Java always limits the maximum amount of memory available to a process. So eXist-db will not automatically use all of the available memory on your machine.

The maximum amount of memory Java will allocate is determined by the -Xmx parameter passed to Java on the command line.

When eXist-db is started via the graphical launcher or via one of the shell or batch scripts, memory settings are read from the file $EXIST_HOME/etc/launcher.properties:

# This file contains a list of VM parameters to be passed to Java
# when eXist is started

# Minimum and maximum memory
memory.max=2048
memory.min=64

…

For Unix/Linux systems, this is done in the .sh files, for instance $EXIST_HOME/bin/startup.sh, from the JAVA_OPTS evironment variable. You can override those variables globally in your own shell.

On Windows,this is done in the main .bat files, for instance $EXIST_HOME\bin\startup.bat.

Cache settings

Each of the core database files and indexes has a page cache. The main purpose of this cache is to make sure that the most frequently used pages of these files are kept in memory. If a file's cache is too small, eXist starts to unload pages, only to reload them a few moment later. This "trashing effect" results in an immediate performance drop, in particular while indexing documents.

All caches share a single memory pool, whose size is determined by the attribute cacheSize in the <db-connection> section of $EXIST_HOME/etc/conf.xml:

<db-connection cacheSize="48M" collectionCache="24M" database="native" files="../data" pageSize="4096" nodesBuffer="-1"/>

The global cache manager will dynamically grant more memory to caches while they are under load and free memory used by idle caches.

The default setting for cacheSize is very conservative (48M). It is ok for smaller databases, but you may soon experience a performance drop when indexing more than several 100M of XML data. Consider increasing cacheSize up to approximately 1/3 of the main memory available to Java (determined by the -Xmx parameter passed to the Java command line). If you are running eXist-db with other web applications in the same servlet engine, you may need to choose a smaller setting.

Warning:

Running out of memory will crash the database, be careful!

The cacheSize is mostly relevant for storing/updating data. The effect on query speed is negligible (unless the index caches are much too small).

If you continue to experience performance issues while storing data, revisit your index configuration. Removing unused indexes provides more room to the other indexes. In particular, the full text index can grow very fast until it becomes a bottleneck. Try to disable the default full text index (see below).

The nodesBuffer attribute can be used to set eXist-db's temporary internal buffer to a fixed size. This buffer is used during indexing to cache nodes before they are flushed to disk. The default setting (nodesBuffer="-1") means use as much memory as is available. This can be problematic if you store large documents in a multi-user environment. For a production server, it is recommended to set nodesBuffer to 1000 or less when there are many concurrent write operations.

Index configuration

Some notes on the optimizing different indexes. More general information regarding these options is available in the individual articles for each index.

Don't rely on the default behaviour

eXist-db does not index element or attribute values by default. For example, assume the following query without an index:

//SPEECH[SPEAKER = "HAMLET"]

The query engine will perform a full scan over all <SPEAKER> elements in the database. This is very slow and limits concurrency. You should at least create a global index definition (in /db/system/config/db/collection.xconf) and add range indexes for the most frequently used comparisons.

Prefer simple index definitions

Keeping your index definitions simple makes it easier for the query optimizer to resolve dependencies.

In particular, avoid context-dependant index definitions unless you really have a reason to use them. A context-dependant index is defined on a path like /book/chapter/title. General indexes are defined on a simple element or attribute qname. For instance:

<collection xmlns="http://exist-db.org/collection-config/1.0">
  <index>
    <!-- Range indexes by qname -->
    <create qname="title" type="xs:string"/>
    <create qname="@ID" type="xs:string"/>
    <!-- context-dependant configuration using the path attribute: -->
    <create path="/book/title" type="xs:string"/>
  </index>
</collection>

Defining indexes on qnames may result in a larger index, but allows the query engine to apply all available optimization techniques. This can improve query times by an order of magnitude. Replacing a context-dependant index by a simple index on qname can therefore result in a performance boost.

Use range indexes on strongly typed data or short strings

Range indexes work with the standard XQuery operators and string functions. For instance:

//book[year = 2000]

This will always be slow without an index: eXist-db scans over every <year> element in the db, casting its string value to an integer.

For queries on string content, range indexes work well for exact comparisons (author = 'Joe Doe') or regular expressions (matches(author, "^Joe.*")) You can also use a full text index for the latter.

Range indexes on strings are by default case-sensitive. To be more precise: they are sensitive to the default collation. If you need case-insensitive queries, consider using an ngram index.

Consider an n-gram index for exact substring queries on longer text sequences

While range indexes tend to become slow for substring queries (like contains(title, "XSLT 2.0")), an n-gram index is nearly as fast as a full text index. It also indexes whitespace and punctuation and is case-insensitive.

For example, ngram:contains(title, "XSLT 2.0") will only match titles containing the exact phrase "XSLT 2.0".

Choose a full-text index for tokenizable text where whitespace/punctuation is mostly irrelevant

The full text index is very fast and should be used whenever you need to query for a sequence of separate words or tokens in a longer text. It might even be faster to post-process the returned node set and filter out wrong matches than using a, usually much slower, regular expression!

Writing Queries

Next to configering your indexes for better performance, you can also optimize the way you write your queries.

Prefer short paths

eXist-db uses internal indexes to directly locate an element or attribute by name. It doesn't need to traverse the document tree. This means that direct selection of a node through a single descendant step is faster than walking the child axis. For example:

a/b/c/d/e/f

This will be slower than

a//f

The first expression requires 6 (!) index lookups where the second just needs two. The same applies to the ancestor axis, for instance f/ancestor::a.

Always process the most selective filter/expression first

If you need multiple steps to select nodes from a larger node set, try to process the most selective steps first. The earlier you reduce the node set to process, the faster your query.

For example, assume we have to find publications written by "Bjarne Stroustrup", published after the year 2000:

/dblp/*[year > 2000][author = 'Bjarne Stroustrup']

The database has 568824 records matching year > 2000, but only 41 of them were written by Stroustrup. Moving the filter on the author to the front of the expression results in better performance:

/dblp/*[author = 'Bjarne Stroustrup'][year > 2000]

Avoid unnecessary nested filters

Nesting filters in an XPath expression is often required and eXist-db will process them correctly. However, unnecessary nesting should be avoided because it has a negative impact on the query optimizer.

For example: //A[B[C = "D"]] could also be written as //A[B/C = "D"] without changing the result. The variant with only one filter is easier to optimize for eXist-db.

Likewise, if you are calling one of the optimized functions (contains, matches, ft:query, etc.), make sure you do not nest them, unless really required. For instance //A[B/C[contains(., "D")]] can be rewritten as //A[contains(B/C, "D")].

Allow eXist-db to process large node sets in one step

The query engine is optimized to process a path expression in a single operation. For instance:

//A/*[B = 'C']

This is evaluated in a single operation for all context items. It doesn't make a difference if the input set comes from a single large document, includes all the documents in a specific collection or even the entire database. The logic of the operation remains the same.

However, "bad" queries can force the query engine to partition the input sequence and process it in an item-by-item mode. For example, most function calls will force the query engine into item-by-item mode:

//A/*[f:process(B) = 'C']

The query engine needs to call a (non-optimized) function for each <B> and will therefore process the entire comparison once for every context item.

There are functions to which the above does not apply. This includes most functions which operate on indexes, e.g. contains, matches, starts-with, ngram:contains, etc. These calls are optimized, eXist-db only needs to call them once to process the entire context set. For example, using ngram:contains like in //A/*[ngram:contains(B, 'C')] is ok. But //A/*[ngram:contains(f:process(B), 'C')] will again force eXist-db into step-by-step evaluation.

Prefer XPath predicates over where expressions

This is a variation of what was discussed above. Many users formulate SQL-style queries using an explicit "where" clause:

for $e in //entry 
where $e/@type = 'subject'
return $e

This could be rewritten as:

for $e in //entry[@type = 'subject'] 
return $e

The "for … where" expression forces the query engine into a step-by-step iteration over the input sequence, testing each instance of $e against the where expression. Any possible optimizations are lost.

However, the XPath predicate expression can be processed in one single step, making best use of available indexes. There are of course cases which cannot be handled without using where (for instance joins between multiple documents), but you don't use where if you can replace it by a simple XPath predicate.

The query engine will always try to process a where clause like an equivalent XPath with predicate, but it only detects the simple cases.

Use general comparisons to compare an item to a list of alternatives

General comparisons are handy if you need to compare something to multiple values. For example, you could use an "or" to find all <b> children whose string value is either "c" or "d":

//a[b eq 'c' or b eq 'd']

A shorter and more efficient way to express this is:

//a[b = ('c', 'd')]

If an index is defined on <b>, eXist-db will need only one index lookup to find all b's matching the comparison. The equivalent "or" expression needs two separate index lookups.

Use "group by"

The XQuery 3.0 group by feature is more efficient than using the distinct-values function. For example, to order the results of a query by the value of the child element <SPEAKER>:

xquery version "3.0";

let $query := "king"
let $speeches := //SPEECH[ft:query(., $query)]
for $speaker in distinct-values($speeches/SPEAKER)
let $speechBySpeaker := $speeches[SPEAKER = $speaker]
order by $speaker
return
    <speaker name="{$speaker}">
    { $speechBySpeaker }
    </speaker>

The XQuery 3.0 variant with group by is much more efficient:

let $query := "king"
for $speechBySpeaker in //SPEECH[ft:query(., $query)]
group by $speaker := $speechBySpeaker/SPEAKER
order by $speaker
return
    <speaker name="{$speaker}">
    { $speechBySpeaker }
    </speaker>

Querying multiple collections

You could use a for loop to query over multiple collections on the same level in the collection hierarchy. However, this forces the query engine to process the remaining expression once for each collection. It is more efficient to construct the initial node set once and use it as input for the main expression. For example:

for $path in ('/db/a', '/db/b')
for $result in collection($path)//test[...]
return
    ...

This will be less efficient than:

let $docs :=
    for $path in ('/db/a', '/db/b') return $collection($path)
for $result in $docs//test[...]
return
    ...

Use the ancestor or parent axis instead of a top-down approach

eXist-db can navigate the ancestor axis as fast as the descendant axis. It can therefore be more efficient to build a query bottom-up instead of top-down. Here's a top-down example:

for $section in collection("/db/articles")//section
for $match in $section//p[contains(., "XML")]
return
    <match>
        <section>{$section/title/text()}</section>
        {$match}
    </match>

This query walks through a set of sections and queries each of them for paragraphs containing the string "XML". It then outputs the title of the section, followed by the matching paragraphs. Note that it will also return the title of all sections which do not have any matches.

The nested for loop forces the query engine into a step-by-step iteration over the section elements. We can avoid this by using a bottom-up approach:

for $match in collection("/db/articles")//section//p[contains(., "XML")]
return
    <match>
        <section>{$match/ancestor::title/text()}</section>
        {$match}
    </match>

Match regular expressions against the start of a string

The function fn:matches returns true if a substring of its first argument matches the regular expression. The query engine needs to scan all index entries, as the match could be at any position of an entry.

You can reduce the range of entries scanned by anchoring your pattern at the start of a string:

fn:matches($str, "^XQuery")

Use fn:id to lookup xml:id attributes

eXist-db automatically indexes all xml:id attributes and other attributes with type ID as declared in a DTD (if validation is enabled). This automatic index is used by the standard id functions and provides a fast way to look up an element. For example, id("sect1")/head works through a fast index lookup.

However, the equivalent expression //section[@xml:id = 'sect1']/head will not use the id index.

Some users have reported that larger xml:id values have a negative performance impact.

Defer output generation until really needed

When working with large result sets within a query, it is important to understand the differences between stored nodes and in-memory XML: if a node set consists of nodes stored in the database, eXist-db will, in most cases, never load these into memory. It uses lightweight references instead for most processing steps. Even large node sets do not consume too much memory.

However, all new XML nodes created within an XQuery reside in memory. The constructed XML fragments need to fit into the memory available to the Java VM. If a query generates too many nodes, the XQuery watchdog (if enabled) may step in and kill it.

A typical scenario: a query selects a large number of documents from the database and then iterates through each to generate some HTML output for display. However, only the first 10 results are returned to the user, the rest is stored into an HTTP session for later viewing. In this case it is important to limit the HTML generation to those items which are actually returned. Though the source XML documents may be large, eXist-db will not load them into memory. Storing those references into a session does not consume much memory:

let $nodes := (: select some nodes in the db :)
let $session := session:set-attribute("result", $nodes) (: store result into session :)
(: only return the first 10 nodes :)
for $node in subsequence($nodes, 1, 10)
return
    (: Generate HTML for output :)
    <div>(: Create complex HTML markup using $node :)</div>

Also note that eXist-db uses lazy evaluation when constructing new XML fragments. For example:

<book>{$node/title}</book>

Assuming that $node references a node in the database, the query engine will not copy $node/title into the constructed <book> element. Only a reference is inserted. The reference will not be expanded until the fragment is serialized or queried.