Author: Rinke Hoekstra, VU University Amsterdam, mailto:[email protected]/mailto:[email protected]
Provenance is key in improving the transparency of scientific data publishing. But most people use multiple very different systems to manipulate and analyse data. The goal of the Data2Semantics COMMIT/ project is to use the W3C PROV Standard, that we helped develop, to integrate provenance tracking within and across these systems.
PROV-O-Matic is a library that integrates with the IPython interpreter, an interpreter that works with all Python programs, and in particular the IPython Notebook environment. IPython notebook is a popular data science environment, similar to R.
PROV-O-Matic does the following:
- It wraps Python functions and methods using a decorator, that builds an RDF PROV-O representation of the inputs and outputs of the respective function. The provenance trace is persistent within a Python session. And,
- it integrates provenance tracing in IPython Notebook, a tool frequently used by scientists for analysing data, and reporting on it. All functions defined in the notebook are automatically decorated, and all executions of steps in the notebook are recorded as well (including changing variable values). And
- it integrates a PROV-O-Viz instance for interactive visualization of the provenance graph, and integrates it into IPython notebook.
- Existing provenance traces can be loaded into the notebook, and PROV entities can be revived as Python variables. Use and manipulation of these new variables, will build a provenance trace that connects to the previous trace.
This work is supported by the Dutch national programme COMMIT/ under the Data2Semantics project. See http://www.data2semantics.org and http://www.commit-nl.nl
PROV-O-Matic can be downloaded from GitHub at: https://github.com/Data2Semantics/prov-o-matic
PROV-O-Matic is released under the MIT License. See LICENCE.txt for details.
To start, you will need git
, Python 2.7
, pip
and virtualenv
(MacOS users, please use Homebrew to install a clean Python environment).
Startup your favourite terminal environment (we'll be using forward slashes, sorry Windows users)
Do a recursive clone of the PROV-O-Matic git repository to a directory of your choice, e.g. /example/provomatic
:
git clone https://github.com/Data2Semantics/prov-o-matic.git /example/provomatic --recursive
This will create the /example/provomatic
directory, if needed, and automatically checks out the latest version of PROV-O-Matic, and the git submodule for PROV-O-Viz.
(Obviously, if you clone to a different directory every occurrence of /example/provomatic
must be replaced with the proper path)
Enter the directory
cd /example/provomatic
Initialize a virtual Python environment
virtualenv .
Start your favourite text-editor and open the activate-replacement
file in the /example/provomatic
directory. Make the following changes.
Step 1: Set the VIRTUALENV
variable to point to the root directory of the provomatic installation. In our case, replace the line
VIRTUAL_ENV="/absolute/path/to/your/provomatic/clone/directory"
with
VIRTUAL_ENV="/example/provomatic/"
Step 2: Set the PYTHONPATH
variable to also point to the lib/provoviz
directory in the directory of the provomatic installation. In our case, replace the line
PYTHONPATH="$PYTHONPATH:/absolute/path/to/your/provomatic/clone/directory/lib/provoviz/src"
with
PYTHONPATH="$PYTHONPATH:/example/provomatic/lib/provoviz/src"
Save the file, and overwrite the bin/activate
file with the edited activate-replacement
file:
cp activate-replacement bin/activate
You can now safely activate the virtual environment:
source bin/activate
The requirements.txt
file lists all required libraries. Use
pip -r requirements.txt
from your activated virtualenv to install the dependencies.
The full list of requirements is as follows:
Jinja2==2.7.3
MarkupSafe==0.23
SPARQLWrapper==1.6.4
backports.ssl-match-hostname==3.4.0.2
certifi==14.05.14
chardet==2.3.0
decorator==3.4.0
gnureadline==6.3.3
html5lib==0.999
ipython==2.3.0
isodate==0.5.0
networkx==1.9.1
numpy==1.9.0
pandas==0.14.1
pyparsing==1.5.7
python-dateutil==2.2
pytz==2014.7
pyzmq==14.3.1
rdfextras==0.4
rdflib==4.1.2
requests==2.4.3
six==1.8.0
tornado==4.0.2
wsgiref==0.1.2
You can now start the IPython notebook by entering the src
directory
cd src
and running
ipython notebook
This should open your browser at the address `http://127.0.0.1:8888/tree
Open the PROV-O-Matic Examples
notebook and follow the instructions. This should give you enough information to use PROV-O-Matic in your own notebooks.
Have fun!