Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add leorud post Working with Graph Databases in Python #57

Open
wants to merge 5 commits into
base: master
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
278 changes: 278 additions & 0 deletions _posts/2025-01-08-working-with-graph-databases-in-python.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,278 @@
---
title: "Working with RDF Graph Databases in Python"
author: Leo Rudczenko
categories:
- Software Engineering
tags:
- python
- graph
- database
- rdf
- linked-data
---

## What are Graph Databases?

The logic behind a [Graph Database](https://en.wikipedia.org/wiki/Graph_database),
comes from graph theory in mathematics, where a network of nodes representing data are connected using edges.
This makes a graph database more flexible than a typical hierarchical database,
as new nodes can be added and connected to any existing node, often without constraints.
This lack of constraints can be convenient, but it does mean that a user will need
additional context regarding the existing nodes before adding new ones, as the existing
data is unpredictable.
Constraints can be added to a graph database using additional tools if required.
For example, when using
a [Resource Description Framework (RDF)](https://en.wikipedia.org/wiki/Resource_Description_Framework)
graph, you can add constraints
using [Shapes Constraint Language (SHACL)](https://en.wikipedia.org/wiki/SHACL).

In a graph database, everything is represented using URIs (Uniform Resource Identifier) or literals.
The URIs are what allows us to create new connections between nodes in the graph,
and even connections between multiple graphs.
For this to work, data is stored in `triples`. In a triple, we have 3 pieces of information:
- Subject: the row in a typical database, referring to one feature (URI)
- Predicate: the column in a typical database, referring to one attribute name (URI)
- Object: the value found at the given row/column, the attribute value (URI or literal)

The predicate URI should also be a subject itself, containing more triples that describe its definition.
This means that all of the context behind the data is described within the graph database,
an advantage which cannot be done in a relational database.

Let's apply this concept to a lithology example. If we want to represent Rhyolite's lithology classification
as an Igneous Rock in a triple, we could do the following:
- Subject: URI to Rhyolite
- Predicate: URI to Parent Definition
- Object: URI to Igneous Rock

The value from this data structure comes from the fact that in this example, we could find the predicates
of Igneous Rock using its own URI, and then continue to another subject after that.

The nature of `triples` also means that when you iterate over the `triples` in a
graph database, you get a single predicate object for a single subject at a time.
You do not get all of the predicates for a subject at once.
Therefore, you will likely want to parse the data into a different structure
to suit your requirements, depending on what you wish to do with the data.

It is important to note that because predicates are defined using a URI,
we can use pre-defined predicates from online `namespaces`.
In the example below, we will see many attributes from the online `skos` namespace,
such as `skos:prefLabel` which is essentially a human readable name,
and `skos:broader` which is a parent node.

You can find more information about the `skos` data model [here](https://www.w3.org/2009/08/skos-reference/skos.html).

## Why are we working with graph databases?

We started with hundreds of lithology classifications which were going to be used in a
new Field Data Capture system, where users could input spatial lithology data.
We wanted to plot coloured points onto a map within the system,
so that a user could see all of their spatial lithology data visually, as this would allow them
to quickly make sense of their data.
However, applying hundreds of different colours to points on a map would make it very hard
to see differences between the colours, as they would be very similar.
Therefore, we wanted to select 50~ high level classifications,
so that every lithology could be attributed to one of these.
Finally, these 50~ classifications could be colour coded using colours which are more clearly different.

To do this, we decided to use the data from a graph database provided by the
[IUGS Commission for the Management and Application of Geoscience Information](https://cgi-iugs.org/),
called [CGI Geoscience Vocabularies](https://cgi.vocabs.ga.gov.au/).
This dataset contains information regarding lithology classifications and their relationships.
By parsing this data, we can produce a list of all parent lithology classifications,
for any given lithology classification.
In turn, this would help us identify our 50~ high level classifications.

## How to parse a graph database in Python

Now that we have an overview of graph databases, we will parse one using Python!
To do this, we are going to use the library [`rdflib`](https://rdflib.readthedocs.io/en/stable/).
You can install this library using [`pip`](https://pip.pypa.io/en/stable/).

```
pip install rdflib
```

The file we are going to parse is a Turtle file (.ttl), which is a format for storing
graph databases. This files comes from the CGI Geoscience Vocabularies dataset, which you can find on GitHub
[here](https://raw.githubusercontent.com/CGI-IUGS/cgi-vocabs/9dfe161affbe91de4c25622a9c2cfab5aa65c642/vocabularies/geosciml/simplelithology.ttl).

Firstly, we have to create an empty graph using `rdflib`, and then use it to parse the given `ttl` file.

```python
import rdflib

uri = "https://raw.githubusercontent.com/CGI-IUGS/cgi-vocabs/9dfe161affbe91de4c25622a9c2cfab5aa65c642/vocabularies/geosciml/simplelithology.ttl"
# Create new graph
graph = rdflib.Graph()
# Parse the ttl file from the URI
graph.parse(uri, format="ttl")
```

Next, we are only going to be looking for `lithology` subjects from the graph database,
and we are only going to be looking for their `skos:prefLabel` and `skos:broader` predicates.
Therefore, we will define some URIs which we will use to filter our data.

```python
lithology_subject_prefix = "http://resource.geosciml.org/classifier/cgi/lithology/"
skos_pref_label_uri = rdflib.term.URIRef("http://www.w3.org/2004/02/skos/core#prefLabel")
skos_broader_uri = rdflib.term.URIRef("http://www.w3.org/2004/02/skos/core#broader")
```

We are now ready to start iterating over the `triples` within the graph. We will add this data to
a Python dictionary. Remember, when we iterate over the graph, we get a single predicate object
for a single subject. Therefore, we need to ensure we have the correct
structure to store all of the objects for each predicate of each subject.

```python
graph_dict = {}

for subject, predicate, object_ in graph:

# Filter the triples where the subject URI starts with the lithology URI prefix
if subject.startswith(lithology_subject_prefix):

# Ensure the subject has a child dictionary to store predicates
if subject not in graph_dict:
graph_dict[subject] = {}

# Ensure the predicate has a child list to store objects
# We store the objects as lists because sometime there is more than 1 value
# E.g. parent lithologies
if predicate not in graph_dict[subject]:
graph_dict[subject][predicate] = []

# Add the new object to the dictionary
graph_dict[subject][predicate].append(object_)

print(f"Found {len(graph_dict)} lithologies in the graph database!")
```

```
Found 265 lithologies in the graph database!
```

We now have a dictionary representation of our lithology classifications, from our graph database!
We can use this to find the data for `Rhyolite`, which was mentioned in our earlier example.

```python
# Import pretty printer to output the dictionary nicely
from pprint import pprint
rhyolite_uri = rdflib.term.URIRef('http://resource.geosciml.org/classifier/cgi/lithology/rhyolite')
pprint(graph_dict[rhyolite_uri])
```

```
{rdflib.term.URIRef('http://purl.org/dc/terms/provenance'): [rdflib.term.Literal('LeMaitre et al. 2002', lang='en')],
rdflib.term.URIRef('http://www.w3.org/1999/02/22-rdf-syntax-ns#type'): [rdflib.term.URIRef('http://www.w3.org/2004/02/skos/core#Concept')],
rdflib.term.URIRef('http://www.w3.org/2000/01/rdf-schema#isDefinedBy'): [rdflib.term.URIRef('http://resource.geosciml.org/classifierscheme/cgi/2016.01/simplelithology')],
rdflib.term.URIRef('http://www.w3.org/2004/02/skos/core#altLabel'): [rdflib.term.Literal('liparita', lang='es'),
rdflib.term.Literal('ryolit', lang='sv'),
rdflib.term.Literal('Liparit', lang='de'),
rdflib.term.Literal('liparite', lang='it'),
rdflib.term.Literal('liparit', lang='ru'),
rdflib.term.Literal('lipariitti', lang='fi')],
rdflib.term.URIRef('http://www.w3.org/2004/02/skos/core#broader'): [rdflib.term.URIRef('http://resource.geosciml.org/classifier/cgi/lithology/rhyolitoid')],
rdflib.term.URIRef('http://www.w3.org/2004/02/skos/core#definition'): [rdflib.term.Literal('rhyolitoid in which the ratio of plagioclase to total feldspar is between 0.1 and 0.65.', lang='en')],
rdflib.term.URIRef('http://www.w3.org/2004/02/skos/core#exactMatch'): [rdflib.term.URIRef('http://inspire.ec.europa.eu/codelist/LithologyValue/rhyolite')],
rdflib.term.URIRef('http://www.w3.org/2004/02/skos/core#example'): [rdflib.term.Literal('liparite', lang='en'),
rdflib.term.Literal('rhyodacite', lang='en')],
rdflib.term.URIRef('http://www.w3.org/2004/02/skos/core#inScheme'): [rdflib.term.URIRef('http://resource.geosciml.org/classifierscheme/cgi/2016.01/simplelithology')],
rdflib.term.URIRef('http://www.w3.org/2004/02/skos/core#prefLabel'): [rdflib.term.Literal('silarIyU:lIt', lang='km'),
rdflib.term.Literal('流纹岩', lang='zh'),
rdflib.term.Literal('Rhyolith', lang='de'),
rdflib.term.Literal('ryolit', lang='vi'),
rdflib.term.Literal('流紋岩', lang='ja'),
rdflib.term.Literal('유문암', lang='ko'),
rdflib.term.Literal('riolit', lang='id'),
rdflib.term.Literal('riolit', lang='ru'),
rdflib.term.Literal('ryolit', lang='sv'),
rdflib.term.Literal('¹ó\xadÄëºÄìª', lang='lo'),
rdflib.term.Literal('rhyolite', lang='en'),
rdflib.term.Literal('riolita', lang='es'),
rdflib.term.Literal('ryoliitit', lang='fi'),
rdflib.term.Literal('หินไรโอไรต์', lang='th'),
rdflib.term.Literal('riolit', lang='ms'),
rdflib.term.Literal('rhyolite', lang='fr'),
rdflib.term.Literal('riolite', lang='it')]}
```

If we want to see its parents, we can access the object behind the predicate `skos:broader`.
Here we will also use the predicate `skos:prefLabel` to access the English name for the parents.

```python
broader_uris = graph_dict[rhyolite_uri][skos_broader_uri]
parents = []
for parent_uri in broader_uris:
parent_labels = graph_dict[parent_uri][skos_pref_label_uri]
# Get just the English label for each parent
english_label = [label.toPython() for label in parent_labels if label.language == "en"][0]
parents.append(english_label)

pprint(parents)
```

```
['rhyolitoid']
```

Finally, in this format of linked data, recursive functions are incredibly useful to traverse a graph
from one node to another. Or, in the case of `triples`, from one subject to another. Here, we will
use recursion to find the list of all parent lithology classifications above `Rhyolite`.

```python
def find_lithology_parents(
graph_dict,
target_lithology_uri,
parent_labels = set(),
):
"""
Find all of the parent lithology classifications above the given target_lithology_uri.
A set of all parent labels is returned.
We use a set because this ensures only unique values are stored within it.

The argument 'parent_labels' can be ignored when calling this function,
as it is used to start an empty set which is populated as the recursion iterates.
"""
# If the target lithology has parents
if skos_broader_uri in graph_dict[target_lithology_uri]:

for parent_uri in graph_dict[target_lithology_uri][skos_broader_uri]:
# Get the current parent English label
current_parent_labels = graph_dict[parent_uri][skos_pref_label_uri]
english_label = [label.toPython() for label in current_parent_labels if label.language == "en"][0]

# Add the current parent to the set of all parents
parent_labels.add(english_label)

# Find the parents of the parent
parent_parents = find_lithology_parents(graph_dict, parent_uri, parent_labels)
# Add the parents of the parent to the set of all parents
parent_labels = parent_labels.union(parent_parents)

return parent_labels


rhyolite_parents = find_lithology_parents(graph_dict, rhyolite_uri)
pprint(rhyolite_parents)
```

```
{'acidic_igneous_material',
'acidic_igneous_rock',
'compound_material',
'fine_grained_igneous_rock',
'igneous_material',
'igneous_rock',
'rhyolitoid',
'rock'}
```

## Alternative Solutions

It is worth noting that there are other ways of parsing data within a graph database.
A common method includes [`SPARQL`](https://en.wikipedia.org/wiki/SPARQL),
which allows you to build graph database queries resembling
[`SQL`](https://en.wikipedia.org/wiki/SQL).
You can even use `rdflib` to run `SPARQL` queries in Python.
More information on this from the `rdflib` documentation can be found
[here](https://rdflib.readthedocs.io/en/7.1.1/intro_to_sparql.html).