Tuning the Database
This article describes strategies for improving the performance of eXist-db and optimizing its efficiency. It covers subjects that constrain performance, from memory and cache settings to the way a query is constructed.
Java always limits the maximum amount of memory available to a process. So eXist-db will not automatically use all of the available memory on your machine.
The maximum amount of memory Java will allocate is determined by the
-Xmx parameter passed to Java on the command line.
When eXist-db is started via the graphical launcher or via one of the shell or batch
scripts, memory settings are read from the file
# This file contains a list of VM parameters to be passed to Java # when eXist is started by double clicking on start.jar (or calling # "java -jar start.jar" without parameters on the shell). # Minimum and maximum memory memory.max=2048 memory.min=64 …
For Unix systems, this is done in
$EXIST_HOME/bin/functions.d/eXist-settings.sh, with the
JAVA_OPTIONS variable. The
variable does the same for the Java admin client. You can override those variables globally in
your own shell.
On Windows,this is done in the main
.bat files, for
If you launch eXist-db as a service, all Java settings will be
controlled by the service wrapper. In this case, the file to edit is
# Maximum Java Heap Size (in MB) # alternative for wrapper.java.additional.<n>=-Xmx wrapper.java.maxmemory=NNNN
Each of the core database files and indexes has a page cache. The main purpose of this cache is to make sure that the most frequently used pages of these files are kept in memory. If a file's cache is too small, eXist starts to unload pages, only to reload them a few moment later. This "trashing effect" results in an immediate performance drop, in particular while indexing documents.
All caches share a single memory pool, whose size is determined by the attribute
cacheSize in the
<db-connection> section of
The global cache manager will dynamically grant more memory to caches while they are under load and free memory used by idle caches.
The default setting for
cacheSize is very conservative (48M). It is ok
for smaller databases, but you may soon experience a performance drop when indexing more than
several 100M of XML data. Consider increasing
cacheSize up to approximately
1/3 of the main memory available to Java (determined by the
passed to the Java command line). If you are running eXist-db with other web applications in
the same servlet engine, you may need to choose a smaller setting.
Running out of memory will crash the database, be careful!
cacheSize is mostly relevant for storing/updating data. The effect
on query speed is negligible (unless the index caches are much too small).
If you continue to experience performance issues while storing data, revisit your index configuration. Removing unused indexes provides more room to the other indexes. In particular, the full text index can grow very fast until it becomes a bottleneck. Try to disable the default full text index (see below).
nodesBuffer attribute can be used to set eXist-db's temporary
internal buffer to a fixed size. This buffer is used during indexing to cache nodes before
they are flushed to disk. The default setting (
nodesBuffer="-1") means use as
much memory as is available. This can be problematic if you store large documents in a
multi-user environment. For a production server, it is recommended to set
1000 or less when there are many concurrent write
eXist-db does not index element or attribute values by default. For example, assume the following query without an index:
//SPEECH[SPEAKER = "HAMLET"]
The query engine will perform a full scan over all
<SPEAKER> elements in the
database. This is very slow and limits concurrency. You should at least create a global
index definition (in
/db/system/config/db/collection.xconf) and add range
indexes for the most frequently used comparisons.
Keeping your index definitions simple makes it easier for the query optimizer to resolve dependencies.
In particular, avoid context-dependant index definitions unless you really have a reason
to use them. A context-dependant index is defined on a path like
/book/chapter/title. General indexes are defined on a simple element or
attribute qname. For instance:
Defining indexes on qnames may result in a larger index, but allows the query engine to apply all available optimization techniques. This can improve query times by an order of magnitude. Replacing a context-dependant index by a simple index on qname can therefore result in a performance boost.
Range indexes work with the standard XQuery operators and string functions. For instance:
This will always be slow without an index: eXist-db scans over every
element in the db, casting its string value to an integer.
For queries on string content, range indexes work well for exact comparisons
author = 'Joe Doe') or regular expressions (
"^Joe.*")) You can also use a full text index for the latter.
Range indexes on strings are by default case-sensitive. To be more precise: they are sensitive to the default collation. If you need case-insensitive queries, consider using an ngram index.
While range indexes tend to become slow for substring queries (like
contains(title, "XSLT 2.0")), an n-gram index is nearly as fast as a full
text index. It also indexes whitespace and punctuation and is
ngram:contains(title, "XSLT 2.0") will only match titles
containing the exact phrase "XSLT 2.0".
The full text index is very fast and should be used whenever you need to query for a sequence of separate words or tokens in a longer text. It might even be faster to post-process the returned node set and filter out wrong matches than using a, usually much slower, regular expression!
eXist-db uses internal indexes to directly locate an element or attribute by name. It doesn't need to traverse the document tree. This means that direct selection of a node through a single descendant step is faster than walking the child axis. For example:
This will be slower than
The first expression requires 6 (!) index lookups where the second just needs two. The
same applies to the ancestor axis, for instance
If you need multiple steps to select nodes from a larger node set, try to process the most selective steps first. The earlier you reduce the node set to process, the faster your query.
For example, assume we have to find publications written by "Bjarne Stroustrup", published after the year 2000:
The database has 568824 records matching
year > 2000, but only 41 of
them were written by Stroustrup. Moving the filter on the author to the front of the
expression results in better performance:
Nesting filters in an XPath expression is often required and eXist-db will process them correctly. However, unnecessary nesting should be avoided because it has a negative impact on the query optimizer.
//A[B[C = "D"]] could also be written as
"D"] without changing the result. The variant with only one filter is easier to
optimize for eXist-db.
Likewise, if you are calling one of the optimized functions (
ft:query, etc.), make sure you do not nest them, unless
really required. For instance
//A[B/C[contains(., "D")]] can be rewritten as
The query engine is optimized to process a path expression in a single operation. For instance:
This is evaluated in a single operation for all context items. It doesn't make a difference if the input set comes from a single large document, includes all the documents in a specific collection or even the entire database. The logic of the operation remains the same.
However, "bad" queries can force the query engine to partition the input sequence and process it in an item-by-item mode. For example, most function calls will force the query engine into item-by-item mode:
The query engine needs to call a (non-optimized) function for each
<B> and will
therefore process the entire comparison once for every context item.
There are functions to which the above does not apply. This includes most functions
which operate on indexes, e.g.
ngram:contains, etc. These calls are optimized,
eXist-db only needs to call them once to process the entire context set. For example, using
ngram:contains like in
//A/*[ngram:contains(B, 'C')] is ok. But
//A/*[ngram:contains(f:process(B), 'C')] will again force eXist-db into
This is a variation of what was discussed above. Many users formulate SQL-style queries using an explicit "where" clause:
This could be rewritten as:
for … where" expression forces the query engine into a step-by-step
iteration over the input sequence, testing each instance of
$e against the
where expression. Any possible optimizations are lost.
However, the XPath predicate expression can be processed in one single step, making best
use of available indexes. There are of course cases which cannot be handled without using
where (for instance joins between multiple documents), but you don't use
where if you can replace it by a simple XPath predicate.
The query engine will always try to process a
clause like an equivalent XPath with predicate, but it only detects the simple cases.
General comparisons are handy if you need to compare something to multiple values. For
example, you could use an "or" to find all
<b> children whose string value is
A shorter and more efficient way to express this is:
If an index is defined on
<b>, eXist-db will need only one index lookup to find
all b's matching the comparison. The equivalent "or" expression needs two separate index
The XQuery 3.0
group by feature is more efficient than using the
distinct-values function. For example, to order the results of a query by the
value of the child element
The XQuery 3.0 variant with
group by is much more efficient:
You could use a
for loop to query over multiple collections on the same
level in the collection hierarchy. However, this forces the query engine to process the
remaining expression once for each collection. It is more efficient to construct the initial
node set once and use it as input for the main expression. For example:
This will be less efficient than:
eXist-db can navigate the ancestor axis as fast as the descendant axis. It can therefore be more efficient to build a query bottom-up instead of top-down. Here's a top-down example:
This query walks through a set of sections and queries each of them for paragraphs containing the string "XML". It then outputs the title of the section, followed by the matching paragraphs. Note that it will also return the title of all sections which do not have any matches.
for loop forces the query engine into a step-by-step iteration
over the section elements. We can avoid this by using a bottom-up approach:
true if a substring of its
first argument matches the regular expression. The query engine needs to scan all index
entries, as the match could be at any position of an entry.
You can reduce the range of entries scanned by anchoring your pattern at the start of a string:
eXist-db automatically indexes all
xml:id attributes and other attributes
ID as declared in a DTD (if validation is enabled). This automatic
index is used by the standard
id functions and provides a fast way to look up
an element. For example,
id("sect1")/head works through a fast index
However, the equivalent expression
//section[@xml:id = 'sect1']/head will
not use the
Some users have reported that larger
xml:id values have a negative
When working with large result sets within a query, it is important to understand the differences between stored nodes and in-memory XML: if a node set consists of nodes stored in the database, eXist-db will, in most cases, never load these into memory. It uses lightweight references instead for most processing steps. Even large node sets do not consume too much memory.
However, all new XML nodes created within an XQuery reside in memory. The constructed XML fragments need to fit into the memory available to the Java VM. If a query generates too many nodes, the XQuery watchdog (if enabled) may step in and kill it.
A typical scenario: a query selects a large number of documents from the database and then iterates through each to generate some HTML output for display. However, only the first 10 results are returned to the user, the rest is stored into an HTTP session for later viewing. In this case it is important to limit the HTML generation to those items which are actually returned. Though the source XML documents may be large, eXist-db will not load them into memory. Storing those references into a session does not consume much memory:
Also note that eXist-db uses lazy evaluation when constructing new XML fragments. For example:
$node references a node in the database, the query engine
will not copy
$node/title into the constructed
<book> element. Only a
reference is inserted. The reference will not be expanded until the fragment is serialized