Tuning the Database
(2Q19)
This article describes strategies for improving the performance of eXist-db and optimizing its efficiency. It covers subjects that constrain performance, from memory and cache settings to the way a query is constructed.
Memory settings
Java always limits the maximum amount of memory available to a process. So eXist-db will not automatically use all of the available memory on your machine.
The maximum amount of memory Java will allocate is determined by the
-Xmx
parameter passed to Java on the command line.
When eXist-db is started via the graphical launcher or via one of the shell or batch
scripts, memory settings are read from the file
$EXIST_HOME/etc/launcher.properties
:
# This file contains a list of VM parameters to be passed to Java # when eXist is started # Minimum and maximum memory memory.max=2048 memory.min=64 …
For Unix/Linux systems, this is done in the .sh
files, for instance $EXIST_HOME/bin/startup.sh
, from the
JAVA_OPTS
evironment variable. You can override those variables
globally in your own shell.
On Windows,this is done in the main .bat
files, for
instance $EXIST_HOME\bin\startup.bat
.
Cache settings
Each of the core database files and indexes has a page cache. The main purpose of this cache is to make sure that the most frequently used pages of these files are kept in memory. If a file's cache is too small, eXist starts to unload pages, only to reload them a few moment later. This "trashing effect" results in an immediate performance drop, in particular while indexing documents.
All caches share a single memory pool, whose size is determined by the attribute
cacheSize
in the <db-connection>
section of
$EXIST_HOME/etc/conf.xml
:
<db-connection cacheSize="48M" collectionCache="24M" database="native" files="../data" pageSize="4096" nodesBuffer="-1"/>
The global cache manager will dynamically grant more memory to caches while they are under load and free memory used by idle caches.
The default setting for cacheSize
is very conservative (48M). It is
ok for smaller databases, but you may soon experience a performance drop when indexing
more than several 100M of XML data. Consider increasing cacheSize
up
to approximately 1/3 of the main memory available to Java (determined by the
-Xmx
parameter passed to the Java command line). If you are running
eXist-db with other web applications in the same servlet engine, you may need to choose
a smaller setting.
Warning:
Running out of memory will crash the database, be careful!
The cacheSize
is mostly relevant for storing/updating data. The
effect on query speed is negligible (unless the index caches are much too small).
If you continue to experience performance issues while storing data, revisit your index configuration. Removing unused indexes provides more room to the other indexes. In particular, the full text index can grow very fast until it becomes a bottleneck. Try to disable the default full text index (see below).
The nodesBuffer
attribute can be used to set eXist-db's temporary
internal buffer to a fixed size. This buffer is used during indexing to cache nodes
before they are flushed to disk. The default setting (nodesBuffer="-1"
)
means use as much memory as is available. This can be problematic if you store large
documents in a multi-user environment. For a production server, it is recommended to set
nodesBuffer
to 1000
or less when there are many concurrent
write operations.
Index configuration
Some notes on the optimizing different indexes. More general information regarding these options is available in the individual articles for each index.
Don't rely on the default behaviour
eXist-db does not index element or attribute values by default. For example, assume the following query without an index:
//SPEECH[SPEAKER = "HAMLET"]
The query engine will perform a full scan over all <SPEAKER>
elements in
the database. This is very slow and limits concurrency. You should at least create a
global index definition (in
/db/system/config/db/collection.xconf
) and add range indexes for
the most frequently used comparisons.
Prefer simple index definitions
Keeping your index definitions simple makes it easier for the query optimizer to resolve dependencies.
In particular, avoid context-dependant index definitions unless you really have a
reason to use them. A context-dependant index is defined on a path like
/book/chapter/title
. General indexes are defined on a simple element or
attribute qname. For instance:
<collection xmlns="http://exist-db.org/collection-config/1.0">
<index>
<!-- Range indexes by qname -->
<create qname="title" type="xs:string"/>
<create qname="@ID" type="xs:string"/>
<!-- context-dependant configuration using the path attribute: -->
<create path="/book/title" type="xs:string"/>
</index>
</collection>
Defining indexes on qnames may result in a larger index, but allows the query engine to apply all available optimization techniques. This can improve query times by an order of magnitude. Replacing a context-dependant index by a simple index on qname can therefore result in a performance boost.
Use range indexes on strongly typed data or short strings
Range indexes work with the standard XQuery operators and string functions. For instance:
//book[year = 2000]
This will always be slow without an index: eXist-db scans over every
<year>
element in the db, casting its string value to an integer.
For queries on string content, range indexes work well for exact comparisons
(author = 'Joe Doe'
) or regular expressions (matches(author,
"^Joe.*")
) You can also use a full text index for the latter.
Range indexes on strings are by default case-sensitive. To be more precise: they are sensitive to the default collation. If you need case-insensitive queries, consider using an ngram index.
Consider an n-gram index for exact substring queries on longer text sequences
While range indexes tend to become slow for substring queries (like
contains(title, "XSLT 2.0")
), an n-gram index is nearly as fast as a
full text index. It also indexes whitespace and punctuation and is
case-insensitive.
For example, ngram:contains(title, "XSLT 2.0")
will only match titles
containing the exact phrase "XSLT 2.0".
Choose a full-text index for tokenizable text where whitespace/punctuation is mostly irrelevant
The full text index is very fast and should be used whenever you need to query for a sequence of separate words or tokens in a longer text. It might even be faster to post-process the returned node set and filter out wrong matches than using a, usually much slower, regular expression!
Writing Queries
Next to configering your indexes for better performance, you can also optimize the way you write your queries.
Prefer short paths
eXist-db uses internal indexes to directly locate an element or attribute by name. It doesn't need to traverse the document tree. This means that direct selection of a node through a single descendant step is faster than walking the child axis. For example:
a/b/c/d/e/f
This will be slower than
a//f
The first expression requires 6 (!) index lookups where the second just needs two.
The same applies to the ancestor axis, for instance
f/ancestor::a
.
Always process the most selective filter/expression first
If you need multiple steps to select nodes from a larger node set, try to process the most selective steps first. The earlier you reduce the node set to process, the faster your query.
For example, assume we have to find publications written by "Bjarne Stroustrup", published after the year 2000:
/dblp/*[year > 2000][author = 'Bjarne Stroustrup']
The database has 568824 records matching year > 2000
, but only 41
of them were written by Stroustrup. Moving the filter on the author to the front of
the expression results in better performance:
/dblp/*[author = 'Bjarne Stroustrup'][year > 2000]
Avoid unnecessary nested filters
Nesting filters in an XPath expression is often required and eXist-db will process them correctly. However, unnecessary nesting should be avoided because it has a negative impact on the query optimizer.
For example: //A[B[C = "D"]]
could also be written as //A[B/C =
"D"]
without changing the result. The variant with only one filter is easier
to optimize for eXist-db.
Likewise, if you are calling one of the optimized functions
(contains
, matches
, ft:query
, etc.), make
sure you do not nest them, unless really required. For instance
//A[B/C[contains(., "D")]]
can be rewritten as //A[contains(B/C,
"D")]
.
Allow eXist-db to process large node sets in one step
The query engine is optimized to process a path expression in a single operation. For instance:
//A/*[B = 'C']
This is evaluated in a single operation for all context items. It doesn't make a difference if the input set comes from a single large document, includes all the documents in a specific collection or even the entire database. The logic of the operation remains the same.
However, "bad" queries can force the query engine to partition the input sequence and process it in an item-by-item mode. For example, most function calls will force the query engine into item-by-item mode:
//A/*[f:process(B) = 'C']
The query engine needs to call a (non-optimized) function for each <B>
and will therefore process the entire comparison once for every context item.
There are functions to which the above does not apply. This includes most
functions which operate on indexes, e.g. contains
,
matches
, starts-with
, ngram:contains
, etc.
These calls are optimized, eXist-db only needs to call them once to process the
entire context set. For example, using ngram:contains
like in
//A/*[ngram:contains(B, 'C')]
is ok. But
//A/*[ngram:contains(f:process(B), 'C')]
will again force eXist-db into
step-by-step evaluation.
Prefer XPath predicates over where expressions
This is a variation of what was discussed above. Many users formulate SQL-style queries using an explicit "where" clause:
for $e in //entry
where $e/@type = 'subject'
return $e
This could be rewritten as:
for $e in //entry[@type = 'subject']
return $e
The "for … where
" expression forces the query engine into a
step-by-step iteration over the input sequence, testing each instance of
$e
against the where
expression. Any possible
optimizations are lost.
However, the XPath predicate expression can be processed in one single step,
making best use of available indexes. There are of course cases which cannot be
handled without using where
(for instance joins between multiple
documents), but you don't use where
if you can replace it by a simple
XPath predicate.
The query engine will always try to process a
where
clause like an equivalent XPath with predicate, but it only
detects the simple cases.
Use general comparisons to compare an item to a list of alternatives
General comparisons are handy if you need to compare something to multiple values.
For example, you could use an "or" to find all <b>
children whose string
value is either "c"
or "d"
:
//a[b eq 'c' or b eq 'd']
A shorter and more efficient way to express this is:
//a[b = ('c', 'd')]
If an index is defined on <b>
, eXist-db will need only one index lookup
to find all b's matching the comparison. The equivalent "or" expression needs two
separate index lookups.
Use "group by"
The XQuery 3.0 group by
feature is more efficient than using the
distinct-values
function. For example, to order the results of a query
by the value of the child element <SPEAKER>
:
xquery version "3.0";
let $query := "king"
let $speeches := //SPEECH[ft:query(., $query)]
for $speaker in distinct-values($speeches/SPEAKER)
let $speechBySpeaker := $speeches[SPEAKER = $speaker]
order by $speaker
return
<speaker name="{$speaker}">
{ $speechBySpeaker }
</speaker>
The XQuery 3.0 variant with group by
is much more efficient:
let $query := "king"
for $speechBySpeaker in //SPEECH[ft:query(., $query)]
group by $speaker := $speechBySpeaker/SPEAKER
order by $speaker
return
<speaker name="{$speaker}">
{ $speechBySpeaker }
</speaker>
Querying multiple collections
You could use a for
loop to query over multiple collections on the
same level in the collection hierarchy. However, this forces the query engine to
process the remaining expression once for each collection. It is more efficient to
construct the initial node set once and use it as input for the main expression. For
example:
for $path in ('/db/a', '/db/b')
for $result in collection($path)//test[...]
return
...
This will be less efficient than:
let $docs :=
for $path in ('/db/a', '/db/b') return $collection($path)
for $result in $docs//test[...]
return
...
Use the ancestor or parent axis instead of a top-down approach
eXist-db can navigate the ancestor axis as fast as the descendant axis. It can therefore be more efficient to build a query bottom-up instead of top-down. Here's a top-down example:
for $section in collection("/db/articles")//section
for $match in $section//p[contains(., "XML")]
return
<match>
<section>{$section/title/text()}</section>
{$match}
</match>
This query walks through a set of sections and queries each of them for paragraphs containing the string "XML". It then outputs the title of the section, followed by the matching paragraphs. Note that it will also return the title of all sections which do not have any matches.
The nested for
loop forces the query engine into a step-by-step
iteration over the section elements. We can avoid this by using a bottom-up
approach:
for $match in collection("/db/articles")//section//p[contains(., "XML")]
return
<match>
<section>{$match/ancestor::title/text()}</section>
{$match}
</match>
Match regular expressions against the start of a string
The function fn:matches
returns true
if a substring of
its first argument matches the regular expression. The query engine needs to scan
all index entries, as the match could be at any position of an entry.
You can reduce the range of entries scanned by anchoring your pattern at the start of a string:
fn:matches($str, "^XQuery")
Use fn:id to lookup xml:id attributes
eXist-db automatically indexes all xml:id
attributes and other
attributes with type ID
as declared in a DTD (if validation is
enabled). This automatic index is used by the standard id
functions and
provides a fast way to look up an element. For example,
id("sect1")/head
works through a fast index lookup.
However, the equivalent expression //section[@xml:id = 'sect1']/head
will not use the id
index.
Some users have reported that larger xml:id
values have a negative
performance impact.
Defer output generation until really needed
When working with large result sets within a query, it is important to understand the differences between stored nodes and in-memory XML: if a node set consists of nodes stored in the database, eXist-db will, in most cases, never load these into memory. It uses lightweight references instead for most processing steps. Even large node sets do not consume too much memory.
However, all new XML nodes created within an XQuery reside in memory. The constructed XML fragments need to fit into the memory available to the Java VM. If a query generates too many nodes, the XQuery watchdog (if enabled) may step in and kill it.
A typical scenario: a query selects a large number of documents from the database and then iterates through each to generate some HTML output for display. However, only the first 10 results are returned to the user, the rest is stored into an HTTP session for later viewing. In this case it is important to limit the HTML generation to those items which are actually returned. Though the source XML documents may be large, eXist-db will not load them into memory. Storing those references into a session does not consume much memory:
let $nodes := (: select some nodes in the db :)
let $session := session:set-attribute("result", $nodes) (: store result into session :)
(: only return the first 10 nodes :)
for $node in subsequence($nodes, 1, 10)
return
(: Generate HTML for output :)
<div>(: Create complex HTML markup using $node :)</div>
Also note that eXist-db uses lazy evaluation when constructing new XML fragments. For example:
<book>{$node/title}</book>
Assuming that $node
references a node in the database, the query
engine will not copy $node/title
into the constructed <book>
element. Only a reference is inserted. The reference will not be expanded until the
fragment is serialized or queried.