XML Validation
(2Q19)
eXist-db supports validation of XML documents.
There are two ways to validate documents:
-
Implicit validationhappens automatically when inserting documents into the database.
-
Explicit validation must use one of the provided XQuery extension functions.
Implicit validation
Implicit validation is executed automatically when documents are inserted into the database.
To enable implicit validation, change eXist-db configuration by editing
conf.xml
. The following two items must be configured:
-
Validation mode
-
Catalog Entity Resolver
Validation mode
The validation mode can be set in the <validation>
element in
conf.xml
:
<validation mode="auto">
<entity-resolver>
<catalog uri="${WEBAPP_HOME}/WEB-INF/catalog.xml"/>
</entity-resolver>
</validation>
Attribute mode switches the validation capabilities of the (Xerces) XML parser. Its values are:
-
yes
-
Switch validation on. All XML documents will be validated. If the grammar (XML schema, DTD) document(s) cannot be resolved, the XML document is rejected.
-
no
-
(default) Switch validation off. No grammar validation is performed and all well-formed XML documents will be accepted.
- auto
-
Validation of an XML document is performed based on the content of the document.
-
When a document contains a reference to a grammar document (XML schema or DTD), the XML parser tries to resolve this grammar and the XML document is validated against it (equivalent to mode="yes"). Again, if the grammar cannot be resolved, the XML document will be rejected.
-
When the XML document does not contain a reference to a grammar, it will not be parsed.
-
Catalog Entity Resolver
All grammars (XML schema, DTD) used for implicit validation must be registered with eXist using OASIS catalog files. The actual resolving is performed by the Apache xml-commons resolver library.
Catalogs can be stored on disk and/or in the database.
It is possible to configure multiple catalog entries in <entity-resolver>
child element(s) of <validation>
in conf.xml
. For instance:
<validation mode="auto">
<entity-resolver>
<catalog uri="xmldb:exist:///db/grammar/catalog.xml"/>
<catalog uri="${WEBAPP_HOME}/WEB-INF/catalog.xml"/>
</entity-resolver>
</validation>
A catalog stored in the database can be addressed by a URL like
xmldb:exist:///db/mycollection/catalog.xml
(note the 3 leading slashes,
which implies localhost). Or use the shorter equivalent
/db/mycollection/catalog.xml
.
In the preceding example ${WEBAPP_HOME}
can be substituted by a
file://
URL pointing to the webapp
directory of eXist
(for instance $EXIST_HOME/etc/webapp/
).
Here is an example of a catalog file:
<catalog xmlns="urn:oasis:names:tc:entity: xmlns:xml:catalog">
<public publicId="-//PLAY//EN" uri="entities/play.dtd"/>
<system systemId="play.dtd" uri="entities/play.dtd"/>
<system systemId="mondial.dtd" uri="entities/mondial.dtd"/>
<uri name="http://exist-db.org/samples/shakespeare" uri="entities/play.xsd"/>
<uri name="http://www.w3.org/XML/1998/namespace" uri="entities/xml.xsd"/>
<uri name="http://www.w3.org/2001/XMLSchema" uri="entities/XMLSchema.xsd"/>
<uri name="urn:oasis:names:tc:entity: xmlns:xml:catalog" uri="entities/catalog.xsd"/>
</catalog>
Collection configuration
Within the database the validation mode for each individual collection can be
configured using collection.xconf documents (in the same way
these are used for configuring indexes).
These documents need to be stored in /db/system/config/db/...
.
The following example collection.xconf
file turns implicit validation
off:
<collection xmlns="http://exist-db.org/collection-config/1.0">
<validation mode="no"/>
</collection>
Explicit validation
Explicit validation is performed through the use of provided XQuery extension functions. The following validation options are provided:
-
JAXP
-
JAXV
-
Jing
Each of these options is discussed in the following sections. Consult the function documentation for details.
JAXP
The JAXP validation functions are based on the validation capabilities of the
javax.xml.parsers
API. The actual validation is performed by
the Xerces2 library.
When parsing an XML document and a reference to a grammar (either DTD or XSD) is found, the parser attempts to resolve the grammar reference by following either:
-
The XSD
xsi:schemaLocation
orxsi:noNamespaceSchemaLocation
hints -
The DTD DOCTYPE SYSTEM information
-
by outsourcing the retrieval of the grammars to an XML Catalog resolver. The resolver identifies XSDs by their (target)namespace. DTDs are identified by the
PublicId
.
The jaxp()
and jaxp-report()
functions accept the
following parameters:
-
$instance
-
The XML instance document, either as document node, element node,
xs:anyURI
or as Java file object. -
$cache-grammars
-
Set this to true() to enable grammar caching.
-
$catalogs
-
One or more OASIS catalog files referenced as
xs:anyURI
. Depending on its values different resolvers will be used:-
When passing an empty sequence
()
, the catalog files defined inconf.xml
are used. -
If the URI ends with
.xml
the specified catalog is used. -
If the URI points ends with
/
it is supposed to point a collection. The grammar files are searched in this collection and its sub-collections. XSDs are found by theirtargetNamespace
attribute, DTDs are found by theirpublicId
entries in catalog files.
-
JAXV
The JAXV validation functions are based on the java.xml.validation
API which has been introduced in Java 5
to provide a schema-language-independent interface to validation services. Although
officially the specification allows use of additional schema languages, only XML
schemas can be really used so far.
The jaxv()
and jaxv-report()
functions accept the
following parameters:
-
$instance
-
The XML instance document either as document node, element node,
xs:anyURI
or as Java file object. -
$grammars
-
One or more grammar files either as document nodes, element nodes,
xs:anyURI
, or as Java file objects. -
$language
-
The namespace of the schema language as
xs:anyURI
. The following values are supported by thejaxv.SchemaFactory
:-
For XSD 1.0 either
http://www.w3.org/2001/XMLSchema
orhttp://www.w3.org/XML/XMLSchema/v1.0
-
For XSD 1.1
http://www.w3.org/XML/XMLSchema/v1.1
-
For RELAX NG 1.0
http://relaxng.org/ns/structure/1.0
-
Jing
The Jing validation functions are based on James Clark's Jing library.
eXist uses the maintained version that is available via Google Code. The library
relies on the com.thaiopensource.validate.ValidationDriver
which supports a
wide range of grammar types:
-
XML schema (
.xsd
) -
Namespace-based Validation Dispatching Language (
.nvdl
) -
RelaxNG (
.rng
and.rnc
) -
Schematron 1.5 (
.sch
)
The jing()
and jing-report()
functions accept the
following parameters:
-
$instance
-
The XML instance document as document node, element node,
xs:anyURI
, or as Java file object. -
$grammar
-
The grammar file can be referenced either as document node, element node,
xs:anyURI
, binary document, or as Java file object.
You can use util:binary-doc()
to pass .rnc
files as
binary document
Validation report
A validation report contains the following for a valid document:
<report>
<status>
valid
</status>
<namespace>
MyNameSpace
</namespace>
<duration unit="msec">
106
</duration>
</report>
For an invalid document the following is returned:
<report>
<status>
invalid
</status>
<namespace>
MyNameSpace
</namespace>
<duration unit="msec">
39
</duration>
<message level="Error" line="3" column="20">
cvc-datatype-valid.1.2.1: 'aaaaaaaa' is not a valid value for 'decimal'.
</message>
<message level="Error" line="3" column="20">
cvc-type.3.1.3: The value 'aaaaaaaa' of element 'c' is not valid.
</message>
</report>
When something goes wrong you might the following:
<?xml version='1.0'?>
<report>
<status>invalid</status>
<duration unit="msec">2</duration>
<exception>
<class>java.net.MalformedURLException</class>
<message>unknown protocol: foo</message>
<stacktrace>java.net.MalformedURLException: unknown protocol: foo at java.net.URL.<init>(URL.java:574) at java.net.URL.<init>(URL.java:464) at java.net.URL.<init>(URL.java:413) at
org.exist.xquery.functions.validation.Shared.getStreamSource(Shared.java:140) at org.exist.xquery.functions.validation.Shared.getInputSource(Shared.java:190) at org.exist.xquery.functions.validation.Parse.eval(Parse.java:179) at
org.exist.xquery.BasicFunction.eval(BasicFunction.java:68) at ......
</stacktrace>
</exception>
</report>
Grammar management
The Xerces XML parser compiles all grammar files upon first use. For efficiency reasons these compiled grammars are cached, resulting in a significant increase in validation processing performance. However, sometimes it may be desirable to manually clear this cache. For this purpose two grammar management functions are provided:
-
clear-grammar-cache()
-
Removes all cached grammar and returns the number of removed grammar
-
pre-parse-grammar(xs:anyURI)
-
Parses the referenced grammar and returns the namespace of the parsed XSD.
-
show-grammar-cache()
-
Returns an XML report about all cached grammars. For instance:
<report> <grammar type="http://www.w3.org/2001/XMLSchema"> <Namespace> http://www.w3.org/XML/1998/namespace </Namespace> <BaseSystemId> file:/Users/guest/existdb/trunk/webapp//WEB-INF/entities/XMLSchema.xsd </BaseSystemId> <LiteralSystemId> http://www.w3.org/2001/xml.xsd </LiteralSystemId> <ExpandedSystemId> http://www.w3.org/2001/xml.xsd </ExpandedSystemId> </grammar> <grammar type="http://www.w3.org/2001/XMLSchema"> <Namespace> http://www.w3.org/2001/XMLSchema </Namespace> <BaseSystemId> file:/Users/guest/existdb/trunk/schema/collection.xconf.xsd </BaseSystemId> </grammar> </report>
The
<BaseSystemId>
element typically does not provide useful information.
Interactive Client
The interactive shell mode of the Java Admin Client provides a simple validate command that accepts the similar explicit validation arguments.
Special notes
-
To avoid potential deadlocking it is considered good practice to store XML instance documents and grammar documents in separate collections.
-
The explicit validation is performed by Xerces (XML schema, DTD) and by oNVDL: oXygen XML NVDL implementation based on Jing (XSD, RelaxNG, Schematron and Namespace-based Validation Dispatching Language).
References
-
Apache xml-commons resolver
-
OASIS XML Catalog Specification V1.1
-
Xerces caching grammars.
-
jing-trang Schema validation and conversion based on RELAX NG
-
NVDL (Namespace-based Validation Dispatching Language)