Extracting Content from Binary Files

(2Q19)


The Content Extraction module extends eXist-db's XML abilities to binary files. The module contains functions for extracting the content of the binary files, and returning the content as XML. In this form, the content can then be queried, indexed, and manipulated. It useful especially in conjunction with Lucene indexes.

The Content Extraction is built on the Apache Tika library. Tika understands a large variety of formats, ranging from PDF documents to spreadsheets and image metadata.

Usage

To import the module use the following import statement:

import module namespace content="http://exist-db.org/xquery/contentextraction"
    at "java:org.exist.contentextraction.xquery.ContentExtractionModule";

The module provides three functions:

content:get-metadata($binary as xs:base64Binary) as document-node()
content:get-metadata-and-content($binary as xs:base64Binary) as document-node()
content:stream-content($binary as xs:base64Binary, $paths as xs:string*, $callback as function, $namespaces as element()?, $userData as item()*) as empty-sequence()

The first two functions need little explanation: get-metadata() just returns some metadata extracted from the resource, while get-metadata-and-content() will also provide the text body of the resource (if any). The third function is a streaming variant of the other two and is used to process larger resources whose content may not fit into memory.

All functions produce XHTML. The metadata will be contained in the HTML head, the contents goes into the body. The structure of the body HTML varies depending on the media type of the binary file. For example, the HTML resulting from a PDF is a sequence of <div> elements, one per page. That of a word processing document is often a sequence of paragraphs.

Storage and Indexing Strategies

While you could decide to just store the HTML returned by the content extraction functions as an XML resource into the database, this may not be efficient. For example, a document search applications may not need to retain the extracted HTML.

In such cases the ft:index() function from the full text indexing module can be useful. This function allows users to associate a full text index with any database resource, be it binary or XML. The index will be linked to the resource.

To create an index, call the function with the following arguments:

  1. The path of the resource to which the index should be linked as a string.

  2. An XML fragment describing the fields you want to add and the text content to index.

For example, to associate an index with the document test.txt, call ft:index() as follows:

ft:index("/db/apps/demo/test.txt", <doc>
    <field name="title" store="yes">Indexing</field>
    <field name="para" store="yes">This is the first paragraph.</field>
    <field name="para" store="yes">And a second paragraph.</field>
</doc>)

This creates a Lucene index document, indexes the content using the configured analyzers, and links it to the eXist document with the given path. You may link more than one Lucene document to the same eXist resource. The field elements map to Lucene fields. You can use as many fields as you want or add multiple fields with the same name.

The store="yes" attribute tells the indexer to also store the text string, so you can retrieve it later.

To query the created index, use the ft:search() function:

ft:search("/db/apps/demo/test.txt", "para:paragraph and title:indexing")

The first parameter is the path to the resource or collection to query. Tthe second specifies a Lucene query string. Note how we prefix the query term by the name of the field.

Executing this query returns:

<results>
    <search uri="/db/apps/demo/test.txt" score="6.3111067">
        <field name="para">This is the first
            <exist:match>paragraph</exist:match>.</field>
        <field name="para">And a second
            <exist:match>paragraph</exist:match>.</field>
        <field name="title"><exist:match>Indexing</exist:match></field>
    </search>
</results>

Each matching resource is described by a search element. The score attribute expresses the relevance Lucene computed for the resource (the higher the better). Within the search element, every field which contributed to the query result is returned, but only if store="yes" was defined for this field at indexing time.

Note how the matches in the text are enclosed in <match> elements, just as if you did a full text query on an XML document. This makes it easy to post-process the query result, for example to create a keywords in context display using eXist's standard KWIC module.

The document the index is linked to does not need to be a binary resource. One can also create additional indexes on XML documents. This is a useful feature, because it allows us to index and query information which is not directly contained in the XML itself. For example, one could add metadata fields and retrieve them later using <get-field>. Or we could use fields to pre-process and normalize information already present in the XML to speed up later access.