This demo app is part of the document's pseudonymization effort lead at Etalab's Lab IA. Other Lab IA projects can be found at the Lab IA.
The purpose of this repo is to provide a quick demo to the pseudonymization tool we developped. The larger goal of the pseudonymization project is to help France's Conseil d'État open their Justice decisions to the general public, as required by the law. More info about pseudonymization and this project can be found in our French pseudonymization guide here. Behind this web site, there is an API that does the job of text tagging and pseudonymization.
- Natural Language Processing: Information Extraction : Named Entity Recognition
- Natural Language Processing: Language Modelling / Feature Learning: Word embeddings
- Machine Learning: Deep Learning: Recurrent Networks: BiLSTM+CRF
- Python
- Flair, sacremoses
- Dash
- SQLite
- Pandas
The demo consists in four tabs:
- Introduction of the project: a brief insight into our pseudonymisation project,
- Upload of a document to be pseudonymized: allows for an imageless .doc, .docx, or .txt file to be uploaded (up to 100 kB)
- Comparison of volume of training data vs annotation performance: we try to answer the question how much data do I need to get decent results?
- API Stats: the use stats of the API that actually does the work.
This demo depends by default on the pseudo API. The API is automatically pulled from its repo in the docker-compose
file.
You do need to train a NER model with the Flair library. Unfortunately, we cannot share nor the model nor the data it was trained on as it contains non-public information.
The easiest way to run this application is by using Docker and Docker Compose.
- Clone this repo (for help see this tutorial).
- Create a .env file in the repo folder and indicates there the path of the local model to the
.env
file (variable : PSEUDO_MODEL_PATH) + the path of the API database (variable : PSEUDO_API_DB_PATH) + the url of the API (variable : PSEUDO_REST_API_URL). Note that you could also pass this env var to the app directly and you would not need run the API. - Launch the wrapper bash file
run_docker.sh
. This file will clean and rebuild the required Docker containers by callingdocker-compose.yml
. - Go to
localhost/pseudo/
- This Demo
- Pseudonymization API
- Pseudonymization Guide
- Feel free to contact @pedevineau or @psorianom or other Lab IA team members with any questions or if you are interested in contributing!