ISSA pipeline is designed to be open to extension. Adding a new document processing requires a few steps.
We anticipate that a new process would be one of three types:
- performing indexation, i.e. associating terms with the entire text
- performing named entity recognition (NER), i.e. associating entities with an exact word or phrase in the text
- performing something else
In any case, the process would be is very similar. For example, the use case specific pyclinrec NER can serve as a template for adding a new step.
Possible inputs:
- metadata: global tsv or separate json documents
- full document text (json)
- results of other processing steps (json)
Possible outputs:
- json files one per document (preferred) in the same folder
- tsv file
Update the env.sh in the corresponding instance config directory with the relative path of the output folder.
For example:
export REL_NEW_NER=annotation/new_ner # New annotations
👉 the env.sh contains variables that are used across ISSA pipeline and environment
If a new processing step is developed in Python preferably:
- update the config.py in the corresponding instance config directory with a new configuration class derived from the cfg_annotation class specifying input-output locations and other configurable parameters (follow the example of a similar existing steps)
- take advantage of logging, file access and dictionary utility functions implemented in util.py
- if a new step can be classified as [NER(./ner/)] or indexing then put its code into a respective directory, otherwise create a new folder for it.
If a new step is not a Python code make sure that the output files are put into a location defined in env.sh.
The transformation of JSON|TSV output into Turtle formatted RDF happens in two steps: loading to MongoDB and mapping fields from a MongoDB collection to Turtle using xR2RML mapping language.
In mongo directory there are scripts that assist an easy integration.
- for JSON output add a line to the run_import.sh, where
- new-collection-name is an arbitrary new collection name
- document-id - a name of json element that would be a key of the collection (typically paper_id)
- relative-path-to-output-directory - output path defined in env.sh (see above)
- post-load-script.js - optional custom script that executes after the load of the target collection and can include aggregation or filtering of unnecessary elements
docker exec -w $WDIR $CONTAINER \
/bin/bash ./import-json-dir.sh \
$DB <new-collection-name> <document_id> \
$IDIR//<relative-path-to-output-directory> \
$SDIR/<post-load-script.js> &>> $log
- for tsv output add a line like the following, where
- document-id is a column that becomes a collection key
- file-name.tsv is the name of a file to load
docker exec -w $WDIR $CONTAINER \
/bin/bash ./import-file.sh \
$IDIR/<new-tsv-file.tsv> tsv \
$DB <new-collection-name> <document_id> \
$SDIR/<post-load-script.js> &>> $log
The only work besides adding a line to the script would be to develop an optional post-load script that requires some familiarity with MongoDB scripting.
In xR2RML directory there are tools that transform MongoDB collections into an RDF using the R2RML language templates. The transformation templates for the existing pipeline are also stored here.
For new kind of data, a new transformation template has to be added. The easiest way to develop such template is to choose an existing one whose input resembles new data and adapt.
👉 to make the RDF files of manageable size the named entities annotations can be split into separate files for title, abstract and body text.
New data should be entered into the graph with its provenance information. At minimum with rdfs:isDefinedBy
and prov:wasAttributedTo
. As in the example below:
# Provenance
rr:predicateObjectMap [
rr:predicate rdfs:isDefinedBy;
rr:objectMap [ rr:constant issa:{{dataset}}; rr:termType rr:IRI ];
];
rr:predicateObjectMap [
rr:predicate prov:wasAttributedTo;
rr:objectMap [ rr:constant issa:Documentalist; rr:termType rr:IRI ];
]
A new step has to be defined as an Agent according to the PROV-O ontology, it could be Person, Organization, or SoftwareAgent. A new _agent has to be described in the provenance.ttl file.
After the template is developed add a line to the run-transformation.sh script:
- for non-annotation data
- new-collection-name is the same new collection name
- new-xr2rml_template.tpl.ttl - developed template
- new-rdf-output.ttl - target output
docker_exec "Generate new output..." \
<new-xr2rml_template.tpl.ttl> \
<new-rdf-output.ttl> \
<new-collection-name>
- for NE annotations split by article part
- article-part is an article part such as title, abstract or body_text
./run_xr2rml_annotation.sh $DS article-part new-collection-name \
new-xr2rml_template.tpl.ttl \
$ODIR/new-rdf-output.ttl
docker_exec_multipart "Generate new output..." \
<new-xr2rml_template.tpl.ttl> \
<new-rdf-output-part.ttl> \
<new-collection-name>
👉 The word part in the name of the output file is important. It will be substituted by the actual annotated part such as title, abstract or body_text.
Identify a named graph where new triples will be uploaded. Most likely it has to be a new graph. Use the existing naming convention to name a graph.
Determine if this graph has to be fully or incrementally updated (most likely the second).
Modify import-all.isql script. Add a line:
ld_dir ('$u{IDIR}', 'new-rdf-output*.ttl', $u{namespace}graph/new-graph-name);
In the case of full update add line at the top of the script:
SPARQL CLEAR GRAPH <$u{namespace}graph/new-graph-name>;
👉 punctuation is important. Make sure that angle brackets and quotation marks are correctly applied.
If a new processing step performs indexation (e.g. associating terms with entire text) add an execution call to the 3_index_articles.sh script.
If a processing step is named entity recognition (NER) (eg. associating entities with an exact word or phrase) add a call to 4_annotate_articles script.
If non of the above then a call can be added to run-pipeline.sh.
👉 Make sure that a new step is called after the pre-requisite steps are called.