eXist 2.2.RC1

We are very proud to announce the first release candidate for the next version of eXist, 2.2. The release candidate is feature complete, but not yet recommended for production use. Please let us know of any issues that you encounter so that we may resolve any unexpected bugs and finalise the release within the next month or so.

eXist 2.2 provides a new Range Index, which can accelerate your XQuery code and has proven to be upto 100 times faster than previous versions. This is the fastest release of eXist ever!

You can download the release from sourceforge.

# eXist 2.2.RC1

## New Range Index 2.2 features a reimplementation of the range index, the most important user-configurable index in eXist. As reported by users, some types of queries can run up to 100 times faster. The most dramatic performance increases have been observed on large data sets (with millions of documents) and queries on frequent strings. However, the new Lucene based index also brings many benefits for those working with smaller data sets.

While updates on the previous index system did not scale well with increasing collection sizes, the new index removes those limitations, thus allowing queries and updates to scale up. As previously reported by some users, problems with slow updates and increased memory usage have disappeared since switching to the new index.

Please refer to the documentation for more information.

## Improved Crash Recovery The new version also features a largely rewritten and simplified crash recovery, leading to a more robust recovery procedure and smaller transaction logs.
## Security eXist 2.2.RC1 further extends it's Unix permission model, and now includes setUid and setGid bits. This both allows stored XQuerys to escalate permissions and enables the controlled sharing of Collection documents with groups os users. This is extremely important as it now makes it possible to call a query as an unpriviledged user and have it switch to a different effective user without providing a target for attacks.
## Bug Fixes 2.2.RC1 includes numerous bug fixes, some of the highlights are:
  • Crash Recovery - Exceptions during transaction rollback no longer cause the database recovery to be aborted; Previously this was commonly seen as page not initialized errors.
  • Java Service Wrapper - No longer kills eXist if it takes longer than expected to shut down or start up.
  • Concurrency - Removed consequtive query invocation lockups, and many other small fixes.
  • Memory - Memory Leaks in the full text index were fixed.
  • Optimizer - Now descends into XQuery Update expressions.
  • Java Admin Client - It is no longer possible to accidentally lock out the admin user by mis-changing her password.

XML Prague eXist Preconference

We are happy to announce the annual eXist-db users group meeting at the 'official' pre-conference day of the XML Prague 2014 conference!

The users meetup became a tradition during the past years. It is the best opportunity to meet the eXist-db developers and the eXist-db community.

For more information, head over to our preconference page.

Redesigned Range Index How To

While the structural and full text indexes in eXist-db have seen redesigns during the past two years, range indexes were still largely unchanged. They increasingly became a bottleneck and limited scalability, at least for applications requiring frequent updates. This has changed: the current development version of eXist includes a rewritten, modularized range index. Under the hood it is based on Apache Lucene for super fast lookups. It also provides new optimizations to speed up some types of queries which failed to run efficiently with the old index.

Range indexes are extremely important in eXist-db. Without a proper index, evaluating a general comparison in a filter (like //foo[baz = "xyz"]) requires eXist to do a full scan over the context node set, checking the value of every node against the argument. This is not only slow, it also limits concurrency due to necessary locking and consumes memory for loading each of the nodes. With a well-defined index, queries will usually complete in a few milliseconds instead of taking seconds. The index allows the optimizer to rewrite the expression and process the index lookup in advance, assuming that the number of baz elements with content "xyz" is much smaller than the total number of elements.

The old range indexing code had three main issues though:

  1. Index entries were organized by collection, resulting in an unfortunate dependency between collection size and update speed. In simple words: updating or removing documents became slower as the collection grew. For a long time, the general recommendation was to split large document sets into multiple, smaller sub-collections if update speed was an issue.
  2. Queries on very frequent search strings were quite inefficient: for example, a query //term[@type = "main"][. = "xyz"] could be quite slow despite an index being defined if @type="main" occurred very often. Unfortunately this is a common use of attributes and to make it quick, you had to reformulate the query, e.g. by moving the non-selective step to the back: //term[. = "xyz"][@type = "main"].
  3. Range indexes were baked into the core of eXist-db, making maintenance and bug fixing difficult.

The rewritten range index addresses both issues. First, indexes are now organized by document/node, so collection size does no longer matter when updating an index entry. Concerning storage, the index is entirely based on Apache Lucene instead of the B+-tree which was previously used. Most range indexes tend to be strings, so why not leave the indexing to a technology like Lucene, which is known to scale well and does a highly efficient job on string processing? Since version 4, Lucene has added support for storing numeric data types and binary data into the index, so it seemed to be a perfect match for our requirements.  Lucene is integrated into eXist on a rather low level with direct access to the indexes.

To address the second issue, it is now possible to combine several fields to index into one index definition, so above XPath: //term[@type = "main"] [. = "xyz"] can be evaluated with a single index lookup. We'll see in a minute how to define such an index.

Finally, the new range index is implemented as a pluggable module: a separate component which is not required for the core of eXist-db to work properly. For eXist, the index is a black box: it does not need to know what the index does. If the index is there, it will automatically plug itself into the indexing pipeline as well as the query engine. If it is not, eXist will fall back to default (brute force) query processing.

Index Configuration

We tried to keep the basic index configuration as much backwards compatible as possible. The old range index is still supported to allow existing applications to run unchanged.

To switch the following index definition to the new range index, we simply wrap the create elements into a range element. Here's the old definition:

<collection xmlns="http://exist-db.org/collection-config/1.0"> <!--from Tamboti--> <index xmlns:mods="http://www.loc.gov/mods/v3"> <fulltext default="none" attributes="no"/> <lucene> <text qname="mods:title"/> </lucene> <!-- Range indexes --> <create qname="mods:namePart" type="xs:string"/> <create qname="mods:dateIssued" type="xs:string"/> <create qname="@ID" type="xs:string"/> </index> </collection>

To use the new range index, wrap the range index definitions into a range element:

<collection xmlns="http://exist-db.org/collection-config/1.0"> <!--from Tamboti--> <index xmlns:mods="http://www.loc.gov/mods/v3"> <fulltext default="none" attributes="no"/> <lucene> <text qname="mods:title"/> </lucene> <!-- Range indexes --> <range> <create qname="mods:namePart" type="xs:string" case="no"/> <create qname="mods:dateIssued" type="xs:string"/> <create qname="@ID" type="xs:string"/> </range> </index> </collection>

If you store this definition and do a reindex, you should find new index files in the webapp/WEB-INF/data/range directory (or wherever you configured your data directory to be).

Just as the old range index, the new indexes will be used automatically for general or value comparisons as well as string functions like fn:contains, fn:starts-with, fn:ends-with (fn:matches is currently not supported due to limitations in Lucene's regular expression handling). 

Above configuration applies to documents using MODS, a standard for bibliographical metadata. To provide some examples, the following XPath expressions should use the created indexes:

declare namespace mods="http://www.loc.gov/mods/v3"; //mods:mods[mods:name/mods:namePart = "Dennis Ritchie"], //mods:mods[mods:originInfo/mods:dateIssued = "1978"], //mods:mods[mods:name/mods:namePart = "Dennis Ritchie"][mods:originInfo/mods:dateIssued = "1978"]

New Configuration Features

Case insensitive index

Add case="no" to create a case insensitive index on a string. This is a feature many users have asked for. With a case insensitive index on mods:namePart a match will also be found if you query for "dennis ritchie" instead of "Dennis Ritchie".

Collations

A collation changes how strings are compared. For example, you can change the strength property of the collation to ignore diacritics, accents or case. So to compare strings ignoring accents or case, you can define an index as follows:

<create qname="mods:namePart" type="xs:string" collation="?lang=en-US&amp;strength=primary"/>

Please refer to the ICU documentation (which is used by eXist) for more information on collations, strength etc.

Combining indexes

If you know you will often use a certain combination of filters, you can combine the corresponding indexes into one to further reduce query times. For example, the mods:name element has an attribute type which qualifies the name as being "personal", "corporate" or another predefined value. To speed up a query like //mods:mods[mods:name[@type = "personal"] [mods:namePart = "Dennis Ritchie"] you could create a combined index on mods:name as follows:

<range> <create qname="mods:name"> <field name="name-type" match="@type" type="xs:string"/> <field name="name-part" match="mods:namePart" type="xs:string"/> </create> </range>

This index will be used whenever the context of the filter expression is a mods:name and it filters on either or both: @type and mods:namePart. Advantage: only one index lookup is required to evaluate such an expression, resulting in a huge performance boost, in particular if the combination of filters does only match a few names out of a large set!

Note that all 3 attributes of the field element are required. The name you give to the field can be arbitrary, but it should be unique within the index configuration document. The match attribute specifies the nodes to include in the field. It should be a simple path relative to the context element. 

You can skip the match attribute if you want to index the content of the context node itself. In this case, an additional attribute: nested="yes|no" can be added to tell the indexer to skip the content of nested nodes to only index direct text children of the context node.

The index is also used if you only query one of the defined fields, e.g.: //mods:mods[mods:name[mods:namePart = "Dennis Ritchie"]]. It is important that the filter expression matches the index definition though, so the following will not be sped up by the index: //mods:mods[mods:name/mods:namePart = "Dennis Ritchie"] because the context of the filter expression here is mods:mods, notmods:name.

You can create as many combined indexes as you like, even if some of them refer to elements which are nested inside other elements having a different index. For example, to index a complete MODS record, we could create one nested index on the root element: mods:mods, and include all attributes or simple descendant elements we may want to query at the same time. mods:name - even though a child of mods:mods - is a complex element, so we want it to have a separate index as shown above. We thus define both indexes:

<range> <create qname="mods:name"> <field name="name-type" match="@type" type="xs:string"/> <field name="name-part" match="mods:namePart" type="xs:string"/> </create> <create qname="mods:mods"> <field name="mods-dateIssued" match="mods:originInfo/mods:dateIssued" type="xs:string"/> <field name="mods-id" match="@ID" type="xs:string"/> <field name="mods-authority" match="@authority" type="xs:string"/> <field name="mods-lang" match="@lang" type="xs:string"/> </create> </range>

This allows a more complex query to be optimized: 

//mods:mods[mods:name[@type = "personal"][mods:namePart = "Dennis Ritchie"]] [mods:originInfo/mods:dateIssued = "1979"]

In this case, the mods:dateIssued lookup will be done first, which presumably returns more hits than the name lookup. For maximum performance it may thus still be faster to split the expression into two parts and do the name check first. Anyway, average performance should be much better compared to the old range index though.

The combined indexing feature was originally created to handle a type of not-so-nice XML which is hard to query efficiently. In the concrete use case, each document consisted of a larger number of parameter elements, each having nothing but a key and value, e.g.: <parameter name="key" value="value"/>. Queries usually looked like this: //parameter[@name="key"][@value="value"] and were pretty slow, even after applying some optimization tricks. Creating a combined index on @name/@value solved this issue.

Availability

The rewritten range index is available in the develop branch of eXist-db at github. It will be included in the next release once more people have reported successful adoption.

eXist-db 2.1 Released

We're proud to announce release 2.1 of eXist-db. Most notably, this version contains some important enhancements and bug fixes in critical areas like indexing, storage backend and query processing.

There are numerous improvements to the XQuery engine and query optimizer. XQuery error detection and reporting was refined. In combination with eXide, finding and fixing errors in XQuery code has become easier than ever. Note that 2.1 may find errors which went unnoticed in 2.0.

On the feature side, changes are mostly in bundled apps. eXide, the XQuery IDE, has been updated to 2.0. Next to a visual redesign it provides fullscreen mode, XQuery refactorings, quick fixes, code snippets, hints and much more.  See the webcast for more information.

A detailed change log can be found on the download page. eXist-db 2.1 is binary compatible with 2.0. Data files created by 2.0 can be read into 2.1.

Adding a Map Datatype to XQuery

Introduction

The "standard" way of passing a complex data structure in XQuery is to create an XML fragment and later query it using XPath. This approach works well most of the time, but sometimes you just can't use it:

  • wrapping stored data into an XML fragment will create a new copy of the data in memory (even though eXist-db will defer this until the data is actually used). Wrapping a large query result into an XML structure may thus use a considerable amount of memory.
  • the reference to the original document gets lost.
  • the power of higher-order functions in XQuery 3.0 makes me wish I could create data structures containing function items as values.

Maps provide a solution to the problems above. Michael Kay has posted a well thought out proposal for maps, which I decided to implement a few weeks ago.

Let's have a quick look at the map datatype as proposed by Michael and implemented in the current trunk of eXist-db. Note that this is not part of the XQuery 3.0 specification - though it is considered for later inclusion - and may be subject to change.

Creating a Map

You create a new map through either the literal syntax or the functions map:new and map:entry. Here's the literal syntax:

let $daysOfWeek := map { "Sunday" := 1, "Monday" := 2, "Tuesday" := 3, "Wednesday" := 4, "Thursday" := 5, "Friday" := 6, "Saturday" := 7 }

The keys are arbitrary atomic values while any sequence can be used as value. You are thus not limited to string keys: dates, numbers or QNames will work as well. Keys are compared for equality using the eq operator under the map's collation.

map:entry creates a map with a single key/value pair. Use this to create map items programmatically in combination with map:new (see map:new below):

map:entry("Sunday", 1)

map:new creates either an empty map or a new map from a sequence of maps. It accepts an optional collation string as second parameter:

let $daysOfWeek := ( "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday" ) let $map := map:new( for $day at $pos in days return map:entry($day, $pos), "?strength=primary" )

As you can see, the only way to create a map from a sequence programmatically is to merge single-item maps into the new map. The map implementation in eXist-db makes sure this is not too expensive (by using a lightweight wrapper for single key/value pairs).

In this example, the collation string "?strength=primary" causes keys to be compared in a case-insensitive way.

Look Up

To look up a key, use map:get:

map:get($map, "Tuesday")

But wait, there's a real cool shortcut to do a look up: a map is also a function item, which means you can directly call it as a function, passing the key to retrieve as single parameter:

$map("Tuesday")

Calling the map as a function item otherwise just behaves like map:get.

Because the empty sequence is allowed as a value, map:get does not tell you for sure if a key exists in a map or not. You can use map:contains to see if a key is present in the map:

map:contains($map, "Tuesday")

map:keys retrieves all keys in the map as a sequence:

map:keys($map)

Please note that the order in which keys are returned is implementation-defined, so don't rely on it. In fact, eXist-db uses two different map implementations for better performance, depending on collation settings and key types.

Here's a complete example which combines the functions to access a map:

xquery version "1.0"; let $workDays := map { "Monday" := 2, "Tuesday" := 3, "Wednesday" := 4, "Thursday" := 5, "Friday" := 6 } let $daysOfWeek := map:new(($workDays, map { "Sunday" := 1, "Saturday" := 7 })) for $day in map:keys($daysOfWeek) order by map:get($daysOfWeek, $day) return <day n="{$daysOfWeek($day)}" atWork="{map:contains($workDays, $day)}">{$day}</day>
Edit

Maps are Immutable

To remove a key/value pair, call

let $newMap := map:remove("Sunday")

At this point we definitely need to talk about an important feature: maps are immutable! Adding or removing a key/value pair will result in a new map. To illustrate this with an example:

let $daysOfWeek := map { "Sunday" := 1, "Monday" := 2, "Tuesday" := 3, "Wednesday" := 4, "Thursday" := 5, "Friday" := 6, "Saturday" := 7 } let $workDays := map:remove($daysOfWeek, "Sunday") return ( map:contains($daysOfWeek, "Sunday") (: Still there :), map:contains($workDays, "Sunday") (: Nope :) )
Edit

Internally, eXist-db uses an efficient implementation of persistent immutable maps and hash tables taken from clojure , another lisp-like, functional language for the Java VM.

Use Cases

So far I found maps to be useful in a number of scenarios:

  1. in my HTML templating framework for passing around application data between templates. In this case the sequences stored in the map can potentially be very large, e.g. if they include the result of queries into the database. Wrapping the data into an in-memory fragment would thus be a bad idea.
  2. to pass optional configuration parameters into a library module.
  3. to introduce additional levels of abstraction when working with heterogeneous data sets.

Function Items as Values

To understand the last scenario, we have to take a closer look at an important feature of maps: one can use function items as map values! For example, a library module may allow the calling module to register an optional function for resolving a resource, which only the calling module can know how to find:

let $configuration := map { "resolve": function($relPath as xs:string) { (: resolve resource :) } }

You can even use maps and function items to simulate "objects". For example, one of my library modules has to display a short summary of documents using two different schemas: docbook and TEI. It thus needs to extract common metadata like title or author from the documents. Using maps, I could create a wrapper around the documents, which provides functions to access the data in object-oriented style:

xquery version "3.0"; declare namespace tei="http://www.tei-c.org/ns/1.0"; declare namespace db="http://docbook.org/ns/docbook"; declare function local:tei($root as element()) as map(xs:string, function(*)) { map { "title" := function() as xs:string { $root//tei:teiHeader/tei:fileDesc/tei:titleStmt/tei:title/string() } } }; declare function local:docbook($root as element()) as map(xs:string, function(*)) { map { "title" := function() as xs:string { $root//db:info/db:title/string() } } }; declare function local:wrap($root as element()) as map(xs:string, function(*))? { typeswitch ($root) case element(tei:TEI) return local:tei($root) case element(db:article) return local:docbook($root) default return () }; <ul> { for $doc in (doc("/db/db-test.xml")/*, doc("/db/tei-test.xml")/*) let $wrapped := local:wrap($doc) return <li>{$wrapped("title")()}</li> } </ul>

This approach has its limitations. There's no guarantee that the maps returned by local:wrap do indeed have a "title" function. XQuery is not - and was not designed to be - an object-oriented language. However, I can see that the technique could improve reusability of code libraries.

Availability

Maps as a data type are currently available in eXist-db trunk and will likely go into the final 2.0 release (only minor additions to the query engine were required). If you would like to test them right now, feel free to check out trunk.