Stanford NER XQuery Module

An XQuery module to integrate the Stanford Named Entity Recognizer into eXist-db. The package can be installed via the package manager in the eXist dashboard or you can build it yourself.

Examples

The module basically provides two functions:

  1. classify-string: takes a plain string and returns a sequence of text nodes and elements. Recognized entities are wrapped into an element having the same name as the corresponding category
  2. classify-node: recursively enhances a node (text node, element or entire document) by wrapping all entities into an element. The structure of the original XML is preserved.

For all functions you have to define the serialized classifier model to use in the first argument. The classifier model is a binary resource and can be stored in the database. It is referenced via an xs:anyURI. There are various models, which will produce different output, depending on which data set they have been trained. Stanford NER also allows you to train your own classifiers.

String Processing

classify-string takes a simple string an returns a sequence of text nodes and elements. An element is returned for each identified entity. It has the same name (in lower case) as the category reported by NER.

xquery version "3.0";

import module namespace ner="http://exist-db.org/xquery/stanford-ner";

let $classifier := xs:anyURI("/db/apps/stanford-ner/resources/classifiers/english.muc.7class.distsim.crf.ser.gz")
let $text := "PRISM was first publicly revealed when classified documents about the program were leaked to journalists of the The Washington Post and The Guardian by Edward Snowden – at the time an NSA contractor – during a visit to Hong Kong."
return
    ner:classify-string($classifier, $text)

Node Processing

Simple Version

The two-argument version of classify-node wraps entities into simple elements having the same name as the corresponding category reported by Stanford NER.

xquery version "3.0";

import module namespace ner="http://exist-db.org/xquery/stanford-ner";

let $classifier := xs:anyURI("/db/apps/stanford-ner/resources/classifiers/english.muc.7class.distsim.crf.ser.gz")
let $text := <p>PRISM was first publicly revealed when classified documents about the program were leaked to journalists of the The Washington Post and The Guardian by Edward Snowden – at the time an NSA contractor – during a visit to Hong Kong.</p>
return
    ner:classify-node($classifier, $text)

Using a Callback Function

You can control the output generated for each entity by providing a callback function. It will be called once for every entity and should take to parameters:

xquery version "3.0";

import module namespace ner="http://exist-db.org/xquery/stanford-ner";

let $classifier := xs:anyURI("/db/apps/stanford-ner/resources/classifiers/english.muc.7class.distsim.crf.ser.gz")
let $text := <p>PRISM was first publicly revealed when classified documents about the program were leaked to journalists of the The Washington Post and The Guardian by Edward Snowden – at the time an NSA contractor – during a visit to Hong Kong.</p>
return
    ner:classify-node($classifier, $text, function($tag, $content) {
        <span class="{lower-case($tag)}">{$content}</span>
    })

Support for other Languages

From the Stanford NLP website, you can download two additional classifier models: one for German and one for Chinese. While the German model should work out of the box if you load the corresponding resource into the db, Chinese text has to be segmented first. We provide variants of the two XQuery functions which deal with the segmentation step: ner:classify-string-cn and ner:classify-node-cn.

Please read the Chinese Language Support section in the build documentation.