From the Old Norse word spá or spæ referring to prophesying and which is cognate with the present English word “spy,” continuing Proto-Germanic *spah- and the Proto-Indo-European root *(s)peḱ (to see, to observe) — vǫlva (wikipedia)
Unstructured PDF documents remain the main vehicle for dissemination of scientific findings. Those interested in gathering and assimilating data must therefore manually peruse published articles and extract from these the elements of interest. Evidence-based medicine provides a compelling illustration of this: many person-hours are spent each year extracting summary information from articles that describe clinical trials. Machine learning provides a potential means of mitigating this burden by automating extraction.
But, for automated approaches to be useful to end-users, we need tools that allow domain experts to interact with, and benefit from, model predictions. To this end, we present an web-based tool called Spá that accepts as input an article and provides as output an automatically visually annotated rendering of this article. More generally, Spá provides a framework for visualizing predictions, both at the document and sentence level, for full-text PDFs.
Spá is our client-side library for rendering and editing annotations on PDF documents. It was initially conceived to render predictions of machine learning systems trained on full-text literature from the biomedical domain.
The original design was published as “Spá: A Web-Based Viewer for Text Mining in Evidence Based Medicine” in the Proceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge (ECML-PKDD 2014) [doi, preprint].
Later Spá was changed to work as a git submodule for the Vortext Annotate and Vortext Demo projects.
The major components of Spá are:
- Mozilla PDF.js
- React
- Backbone.js
- RequireJS
- Hypothesis dom-anchor-bitap (experimental)
PDF.js is responsible for rendering the document.
Normally PDF.js does this by rendering the document to <canvas>
and putting a series of <div>
’s on top for text selection (the textLayer
).
We replaced the textLayer
with our own custom React component, this way we have full control over what happens in the textLayer
without resorting to hacks.
To maintain state we use Backbone models and collections.
We coordinate the model layer and the view layer by using contraptions we call dispatchers.
Dispatchers are defined by the projects that include Spá, not here.
The general idea is that a dispatcher listens for model changes (Backbone events) and updates the React components’ state accordingly using setState
or forceUpdate
methods.
The components receive the Backbone models as props, and are allowed to call their methods to initiate change.
It’s not as a pretty as Flux with immutable data structures (or ClojureScript) but it does the job for now.
Spá can be used by including it in other projects and defining a dispatcher. It is not meant to be used directly. Currently the following projects use Spá:
- Vortext demo (for running predictions)
- Vortext
Currently this is a research object. The API and organizational structure are subject to change. Comments and suggestions are much appreciated. For code contributions: fork, branch, and send a pull request.
Spa is open source, and licensed under GPLv3. See license for more information.