An XQuery module to integrate the Stanford Named Entity Recognizer into eXist-db. The package can be installed via the package manager in the eXist dashboard or you can build it yourself.
The module basically provides two functions:
classify-string
: takes a plain string and returns a sequence of text nodes and elements. Recognized entities are wrapped into
an element having the same name as the corresponding category
classify-node
: recursively enhances a node (text node, element or entire document) by wrapping all entities into an element.
The structure of the original XML is preserved.
For all functions you have to define the serialized classifier model to use in the first argument. The classifier model is a binary resource and can be stored in the database. It is referenced via an xs:anyURI. There are various models, which will produce different output, depending on which data set they have been trained. Stanford NER also allows you to train your own classifiers.
classify-string
takes a simple string an returns a sequence of text nodes and elements. An element is returned for each
identified entity. It has the same name (in lower case) as the category reported by NER.
xquery version "3.0"; import module namespace ner="http://exist-db.org/xquery/stanford-ner"; let $classifier := xs:anyURI("/db/apps/stanford-ner/resources/classifiers/english.muc.7class.distsim.crf.ser.gz") let $text := "PRISM was first publicly revealed when classified documents about the program were leaked to journalists of the The Washington Post and The Guardian by Edward Snowden – at the time an NSA contractor – during a visit to Hong Kong." return ner:classify-string($classifier, $text)
The two-argument version of classify-node
wraps entities into simple elements having the same name as the corresponding category
reported by Stanford NER.
xquery version "3.0"; import module namespace ner="http://exist-db.org/xquery/stanford-ner"; let $classifier := xs:anyURI("/db/apps/stanford-ner/resources/classifiers/english.muc.7class.distsim.crf.ser.gz") let $text := <p>PRISM was first publicly revealed when classified documents about the program were leaked to journalists of the The Washington Post and The Guardian by Edward Snowden – at the time an NSA contractor – during a visit to Hong Kong.</p> return ner:classify-node($classifier, $text)
You can control the output generated for each entity by providing a callback function. It will be called once for every entity and should take to parameters:
xquery version "3.0"; import module namespace ner="http://exist-db.org/xquery/stanford-ner"; let $classifier := xs:anyURI("/db/apps/stanford-ner/resources/classifiers/english.muc.7class.distsim.crf.ser.gz") let $text := <p>PRISM was first publicly revealed when classified documents about the program were leaked to journalists of the The Washington Post and The Guardian by Edward Snowden – at the time an NSA contractor – during a visit to Hong Kong.</p> return ner:classify-node($classifier, $text, function($tag, $content) { <span class="{lower-case($tag)}">{$content}</span> })
From the Stanford NLP website, you can download two additional classifier models: one for
German and one for Chinese. While the German model should work out of the box if you load the corresponding resource into the db, Chinese text has to be
segmented first. We provide variants of the two XQuery functions which deal with the segmentation step: ner:classify-string-cn
and
ner:classify-node-cn
.
Please read the Chinese Language Support section in the build documentation.