XML Validation

(2Q19)


eXist-db supports validation of XML documents.

There are two ways to validate documents:

Implicit validation

Implicit validation is executed automatically when documents are inserted into the database.

To enable implicit validation, change eXist-db configuration by editing conf.xml. The following two items must be configured:

  • Validation mode

  • Catalog Entity Resolver

Validation mode

The validation mode can be set in the <validation> element in conf.xml:

<validation mode="auto">
  <entity-resolver>
    <catalog uri="${WEBAPP_HOME}/WEB-INF/catalog.xml"/>
  </entity-resolver>
</validation>

Attribute mode switches the validation capabilities of the (Xerces) XML parser. Its values are:

yes

Switch validation on. All XML documents will be validated. If the grammar (XML schema, DTD) document(s) cannot be resolved, the XML document is rejected.

no

(default) Switch validation off. No grammar validation is performed and all well-formed XML documents will be accepted.

auto

Validation of an XML document is performed based on the content of the document.

  • When a document contains a reference to a grammar document (XML schema or DTD), the XML parser tries to resolve this grammar and the XML document is validated against it (equivalent to mode="yes"). Again, if the grammar cannot be resolved, the XML document will be rejected.

  • When the XML document does not contain a reference to a grammar, it will not be parsed.

Catalog Entity Resolver

All grammars (XML schema, DTD) used for implicit validation must be registered with eXist using OASIS catalog files. The actual resolving is performed by the Apache xml-commons resolver library.

Catalogs can be stored on disk and/or in the database.

It is possible to configure multiple catalog entries in <entity-resolver> child element(s) of <validation> in conf.xml. For instance:

<validation mode="auto">
  <entity-resolver>
    <catalog uri="xmldb:exist:///db/grammar/catalog.xml"/>
    <catalog uri="${WEBAPP_HOME}/WEB-INF/catalog.xml"/>
  </entity-resolver>
</validation>

A catalog stored in the database can be addressed by a URL like xmldb:exist:///db/mycollection/catalog.xml (note the 3 leading slashes, which implies localhost). Or use the shorter equivalent /db/mycollection/catalog.xml.

In the preceding example ${WEBAPP_HOME} can be substituted by a file:// URL pointing to the webapp directory of eXist (for instance $EXIST_HOME/etc/webapp/).

Here is an example of a catalog file:

<catalog xmlns="urn:oasis:names:tc:entity:            xmlns:xml:catalog">
  <public publicId="-//PLAY//EN" uri="entities/play.dtd"/>
  <system systemId="play.dtd" uri="entities/play.dtd"/>
  <system systemId="mondial.dtd" uri="entities/mondial.dtd"/>
  <uri name="http://exist-db.org/samples/shakespeare" uri="entities/play.xsd"/>
  <uri name="http://www.w3.org/XML/1998/namespace" uri="entities/xml.xsd"/>
  <uri name="http://www.w3.org/2001/XMLSchema" uri="entities/XMLSchema.xsd"/>
  <uri name="urn:oasis:names:tc:entity:                    xmlns:xml:catalog" uri="entities/catalog.xsd"/>
</catalog>

Collection configuration

Within the database the validation mode for each individual collection can be configured using collection.xconf documents (in the same way these are used for configuring indexes). These documents need to be stored in /db/system/config/db/....

The following example collection.xconf file turns implicit validation off:

<collection xmlns="http://exist-db.org/collection-config/1.0">
  <validation mode="no"/>
</collection>

Explicit validation

Explicit validation is performed through the use of provided XQuery extension functions. The following validation options are provided:

  • JAXP

  • JAXV

  • Jing

Each of these options is discussed in the following sections. Consult the function documentation for details.

JAXP

The JAXP validation functions are based on the validation capabilities of the javax.xml.parsers API. The actual validation is performed by the Xerces2 library.

When parsing an XML document and a reference to a grammar (either DTD or XSD) is found, the parser attempts to resolve the grammar reference by following either:

  • The XSD xsi:schemaLocation or xsi:noNamespaceSchemaLocation hints

  • The DTD DOCTYPE SYSTEM information

  • by outsourcing the retrieval of the grammars to an XML Catalog resolver. The resolver identifies XSDs by their (target)namespace. DTDs are identified by the PublicId.

The jaxp() and jaxp-report() functions accept the following parameters:

$instance

The XML instance document, either as document node, element node, xs:anyURI or as Java file object.

$cache-grammars

Set this to true() to enable grammar caching.

$catalogs

One or more OASIS catalog files referenced as xs:anyURI. Depending on its values different resolvers will be used:

  • When passing an empty sequence (), the catalog files defined in conf.xml are used.

  • If the URI ends with .xml the specified catalog is used.

  • If the URI points ends with / it is supposed to point a collection. The grammar files are searched in this collection and its sub-collections. XSDs are found by their targetNamespace attribute, DTDs are found by their publicId entries in catalog files.

JAXV

The JAXV validation functions are based on the java.xml.validation API which has been introduced in Java 5 to provide a schema-language-independent interface to validation services. Although officially the specification allows use of additional schema languages, only XML schemas can be really used so far.

The jaxv() and jaxv-report() functions accept the following parameters:

$instance

The XML instance document either as document node, element node, xs:anyURI or as Java file object.

$grammars

One or more grammar files either as document nodes, element nodes, xs:anyURI, or as Java file objects.

$language

The namespace of the schema language as xs:anyURI. The following values are supported by the jaxv.SchemaFactory:

  • For XSD 1.0 either http://www.w3.org/2001/XMLSchema or http://www.w3.org/XML/XMLSchema/v1.0

  • For XSD 1.1 http://www.w3.org/XML/XMLSchema/v1.1

  • For RELAX NG 1.0 http://relaxng.org/ns/structure/1.0

Jing

The Jing validation functions are based on James Clark's Jing library. eXist uses the maintained version that is available via Google Code. The library relies on the com.thaiopensource.validate.ValidationDriver which supports a wide range of grammar types:

  • XML schema (.xsd)

  • Namespace-based Validation Dispatching Language (.nvdl)

  • RelaxNG (.rng and .rnc)

  • Schematron 1.5 (.sch)

The jing() and jing-report() functions accept the following parameters:

$instance

The XML instance document as document node, element node, xs:anyURI, or as Java file object.

$grammar

The grammar file can be referenced either as document node, element node, xs:anyURI, binary document, or as Java file object.

You can use util:binary-doc() to pass .rnc files as binary document

Validation report

A validation report contains the following for a valid document:

<report>
  <status>
    valid
  </status>
  <namespace>
    MyNameSpace
  </namespace>
  <duration unit="msec">
    106
  </duration>
</report>

For an invalid document the following is returned:

<report>
  <status>
    invalid
  </status>
  <namespace>
    MyNameSpace
  </namespace>
  <duration unit="msec">
    39
  </duration>
  <message level="Error" line="3" column="20">
    cvc-datatype-valid.1.2.1: 'aaaaaaaa' is not a valid value for 'decimal'.
  </message>
  <message level="Error" line="3" column="20">
    cvc-type.3.1.3: The value 'aaaaaaaa' of element 'c' is not valid.
  </message>
</report>

When something goes wrong you might the following:

<?xml version='1.0'?>
<report>
  <status>invalid</status>
  <duration unit="msec">2</duration>
  <exception>
    <class>java.net.MalformedURLException</class>
    <message>unknown protocol: foo</message>
    <stacktrace>java.net.MalformedURLException: unknown protocol: foo at java.net.URL.<init>(URL.java:574) at java.net.URL.<init>(URL.java:464) at java.net.URL.<init>(URL.java:413) at
      org.exist.xquery.functions.validation.Shared.getStreamSource(Shared.java:140) at org.exist.xquery.functions.validation.Shared.getInputSource(Shared.java:190) at org.exist.xquery.functions.validation.Parse.eval(Parse.java:179) at
      org.exist.xquery.BasicFunction.eval(BasicFunction.java:68) at ......
    </stacktrace>
  </exception>
</report>

Grammar management

The Xerces XML parser compiles all grammar files upon first use. For efficiency reasons these compiled grammars are cached, resulting in a significant increase in validation processing performance. However, sometimes it may be desirable to manually clear this cache. For this purpose two grammar management functions are provided:

clear-grammar-cache()

Removes all cached grammar and returns the number of removed grammar

pre-parse-grammar(xs:anyURI)

Parses the referenced grammar and returns the namespace of the parsed XSD.

show-grammar-cache()

Returns an XML report about all cached grammars. For instance:

<report>
  <grammar type="http://www.w3.org/2001/XMLSchema">
    <Namespace>
      http://www.w3.org/XML/1998/namespace
    </Namespace>
    <BaseSystemId>
      file:/Users/guest/existdb/trunk/webapp//WEB-INF/entities/XMLSchema.xsd
    </BaseSystemId>
    <LiteralSystemId>
      http://www.w3.org/2001/xml.xsd
    </LiteralSystemId>
    <ExpandedSystemId>
      http://www.w3.org/2001/xml.xsd
    </ExpandedSystemId>
  </grammar>
  <grammar type="http://www.w3.org/2001/XMLSchema">
    <Namespace>
      http://www.w3.org/2001/XMLSchema
    </Namespace>
    <BaseSystemId>
      file:/Users/guest/existdb/trunk/schema/collection.xconf.xsd
    </BaseSystemId>
  </grammar>
</report>

The <BaseSystemId> element typically does not provide useful information.

Interactive Client

The interactive shell mode of the Java Admin Client provides a simple validate command that accepts the similar explicit validation arguments.

Special notes

  • To avoid potential deadlocking it is considered good practice to store XML instance documents and grammar documents in separate collections.

  • The explicit validation is performed by Xerces (XML schema, DTD) and by oNVDL: oXygen XML NVDL implementation based on Jing (XSD, RelaxNG, Schematron and Namespace-based Validation Dispatching Language).

References