Content Extraction and Binary Resource Indexing

The content extraction module does not appear to be available in your eXist installation. To enable it, stop eXist, edit $EXIST_HOME/extensions/build.properties and set the corresponding property to true:

# Binary Content and Metadata Extraction Module
include.feature.contentextraction = true

Next, call build.sh/build.bat from eXist's top directory to build the module. You should see in the output how the various libraries required are downloaded and installed.

This page demonstrates how to query binary documents which have been indexed with Lucene after their text content has been extracted. The app defines a trigger on the "binary" collection below the "data" collection in the app root collection. To test the indexing, upload a pdf to the collection and its contents will be extracted and indexed automatically.