Skip to content
/ spa Public

The javascript front-end for rendering text-extraction on PDF documents

License

Notifications You must be signed in to change notification settings

vortext/spa

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

From the Old Norse word spá or spæ referring to prophesying and which is cognate with the present English word “spy,” continuing Proto-Germanic *spah- and the Proto-Indo-European root *(s)peḱ (to see, to observe) — vǫlva (wikipedia)

Basic idea

Unstructured PDF documents remain the main vehicle for dissemination of scientific findings. Those interested in gathering and assimilating data must therefore manually peruse published articles and extract from these the elements of interest. Evidence-based medicine provides a compelling illustration of this: many person-hours are spent each year extracting summary information from articles that describe clinical trials. Machine learning provides a potential means of mitigating this burden by automating extraction.

But, for automated approaches to be useful to end-users, we need tools that allow domain experts to interact with, and benefit from, model predictions. To this end, we present an web-based tool called Spá that accepts as input an article and provides as output an automatically visually annotated rendering of this article. More generally, Spá provides a framework for visualizing predictions, both at the document and sentence level, for full-text PDFs.

What is Spá concretely

Spá is our client-side library for rendering and editing annotations on PDF documents. It was initially conceived to render predictions of machine learning systems trained on full-text literature from the biomedical domain.

The original design was published as “Spá: A Web-Based Viewer for Text Mining in Evidence Based Medicine” in the Proceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge (ECML-PKDD 2014) [doi, preprint].

Later Spá was changed to work as a git submodule for the Vortext Annotate and Vortext Demo projects.

How does it work?

The major components of Spá are:

PDF.js is responsible for rendering the document. Normally PDF.js does this by rendering the document to <canvas> and putting a series of <div>’s on top for text selection (the textLayer). We replaced the textLayer with our own custom React component, this way we have full control over what happens in the textLayer without resorting to hacks.

To maintain state we use Backbone models and collections. We coordinate the model layer and the view layer by using contraptions we call dispatchers. Dispatchers are defined by the projects that include Spá, not here. The general idea is that a dispatcher listens for model changes (Backbone events) and updates the React components’ state accordingly using setState or forceUpdate methods. The components receive the Backbone models as props, and are allowed to call their methods to initiate change. It’s not as a pretty as Flux with immutable data structures (or ClojureScript) but it does the job for now.

How to use it?

Spá can be used by including it in other projects and defining a dispatcher. It is not meant to be used directly. Currently the following projects use Spá:

Contributing

Currently this is a research object. The API and organizational structure are subject to change. Comments and suggestions are much appreciated. For code contributions: fork, branch, and send a pull request.

License

Spa is open source, and licensed under GPLv3. See license for more information.

About

The javascript front-end for rendering text-extraction on PDF documents

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published