Zoomathia Project

This work is carried out in the framework of the GDRI Zoomathia which aims to study the transmission of zoological knowledge from Antiquity to the Middle Ages.

Zoomathia Application

This application is being developed within the framework of the HisINum project funded by the Academy of Excellence 5 of IdEx UCA JEDI.

It aims to support the study of the transmission of zoological knowledge from antiquity to the Middle Ages through the analysis of a corpus of texts on animals compiled within the framework of the Zoomathia GDRI funded by the CNRS.

It allows:

exploration of the corpus, via a search for works by concept;
exploration of a selected work from the corpus, with visualisation of the concepts annotating each of its parts;
visualisation of the results of queries implementing competency questions on a selected work from the corpus.

It relies on the exploitation of a knowledge graph annotating the Zoomathia corpus of texts with concepts from the TheZoo thesaurus. The pipeline for the automatic construction of this knowledge graph was developed within the framework of the AutomaZoo project funded by the Academy of Excellence 5 of IdEx UCA JEDI, and further refined within the framework of the HisINum project. GitHub of the project: https://github.com/Wimmics/zoomathia

Access to the knowledge graph through its SPARQL endpoint:http://zoomathia.i3s.unice.fr/sparql

Frontend

https://github.com/Wimmics/zoomathia/tree/main/web-app/web-zoomathia

Backend

https://github.com/Wimmics/zoomathia/tree/main/web-app/backend

Named Entity Recognition

Dependencies

pandas

pip install pandas

deep translator

pip install deep-translator

spacy (python 3.X < 3.13)

pip install spacy
pip install spacy-dbpedia-spotlight
pip install spacyfishing

models needed for spacy:

python -m spacy download en_core_web_lg
python -m spacy download en_core_web_sm

dbpedia_spotlight docker

# pull the official image
docker pull dbpedia/dbpedia-spotlight
# create a volume for persistently saving the language models
docker volume create spotlight-models
# start the container (here assuming we want the en model, but any other supported language code can be used)
docker run -ti --restart unless-stopped --name dbpedia-spotlight.en --mount source=spotlight-models,target=/opt/spotlight -p 2222:80 dbpedia/dbpedia-spotlight spotlight.sh en

When the VM is launch, it will download the english knowledge base. You have to wait the end of this download and the launch of the service before using spacy NER.

lxml-xml

pip install lxml

PyMongo

pip install pymongo

Pipeline

To use this pipeline of annotation, you have to start with the python script xml_to_csv.py. It will scan every folders at the same level of the script to find all XML file. The script will extract metadata of the work, work structure and paragraphs text. All work information will be transform into 4 csv files per XML file in the output folder:

xxx_annotations.csv that contains all the annotation for the work
xxx_link.csv that contains all information for the work structure
xxx_metadata.csv contains all the metadata of the work (author, title, editor...)
xxx_paragraph.csv contains all the paragraph of the work

When the xml_to_csv process is over, the next step is to launch the morph_mongo.py script that upload all the generated CSV to its respective MongoDB collection. This process will also filter annotation based on classes URI specified in the filter_class.json file and find close concept base on the label in the TheZoo Thesaurus.

{
    "class": [
        "<http://dbpedia.org/class/yago/WikicatAnimatedCharacters>",
        "<http://dbpedia.org/class/yago/WikicatTelevisionCharacters>",
        "<http://dbpedia.org/class/yago/WikicatTheSimpsonsCharacters>",
        ...
    ]
}

The last step after the morph_mongo process is the graph generation with morph-xR2RML in the rules folder. Every file has to be specified in the morph.properties file:

# xR2RML mapping file. Mandatory.
mappingdocument.file.path=paragraph.ttl
#mappingdocument.file.path=link.ttl
#mappingdocument.file.path=metadata.ttl
#mappingdocument.file.path=annotation.ttl
#mappingdocument.file.path=vocab.ttl

# -- Where to store the result of the processing. Default: result.txt
output.file.path=output/paragraph.ttl
#output.file.path=output/link.ttl
#output.file.path=output/metadata.ttl
#output.file.path=output/annotation.ttl
#output.file.path=output/vocab.ttl

The produced graph will be formed with 5 turtle files. The vocab.ttl file contains all the DBpedia alignment with TheZoo thesaurus.

Manual Annotation

Dependencies

The script only work with windows due to strong optimisation only available on windows32com API.

pandas

pip install pandas

pywin32

pip install pywin32

opendocx

pip install python-docx

Directories

QC: contains Jupyter Notebook with the SPARQL implementation of the competency questions to evaluate the graph.
Script: contains docx files to be extracted, the script pipeline of extraction and the csv output of the extraction needed for morph-xR2RML
Ontology: contains all the tutle files of the graph
Mapping: contains all xR2RML mapping file to build the graph

Pipeline

The script extract text annotation label from TheZoo thesaurus in docx comments based on the following pattern:

concept label
parent label : child label : grand child label
concept label1 ; concept label2

All labels extracted will be match with a concept in TheZoo if it's label is close enough. The script will generate multiple CSV file to generate graph and correct error of matching label.

The last step are the manual upload of the file in MongoDB Collection "Paragraphe" and "Annotation" and the graph generation with morph-xR2RML in the rules folder. Every file has to be specified in the morph.properties file:

# xR2RML mapping file. Mandatory.
mappingdocument.file.path=paragraph.ttl
#mappingdocument.file.path=annotation.ttl

# -- Where to store the result of the processing. Default: result.txt
output.file.path=output/paragraph.ttl
#output.file.path=output/annotation.ttl

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Zoomathia Project

Zoomathia Application

Frontend

Backend

Named Entity Recognition

Dependencies

Pipeline

Manual Annotation

Dependencies

Directories

Pipeline

Files

README.md

Latest commit

History

README.md

File metadata and controls

Zoomathia Project

Zoomathia Application

Frontend

Backend

Named Entity Recognition

Dependencies

Pipeline

Manual Annotation

Dependencies

Directories

Pipeline