This is a guide to setting up and using Label Studio for the purpose of annotating data for the entity extraction model.
The finding fossils team setup a privately hosted version of LabelStudio using HuggingFace spaces. The main steps were:
Table of Contents
- Create an Azure blob storage container to house both the data to be labelled as well as the output labelled files.
- Use this guide: Azure Quickstart Upload/Download Blobs
- Create a
raw
folder in the blob storage by specifying the folder name in theUpload to folder
field under the advanced tab, and upload a blank TXT file to it (as a placeholder). - Repeat step 3 for
labelled
folder
To allow LabelStudio to access the objects stored in the bucket, update the Resource Sharing (CORS)
settings of the blob storage following the LabelStudio Prerequisites section here.
-
Create a Azure Database for PostgreSQL - Flexible Server to take advantage of 750 hours of free compute time, which should be able to run the database free each month.
MAKE SURE YOU REMEMBER THE ADMIN ACCOUNT NAME AND PASSWORD -
Create a database once the postgres instance is up and running. e.g.
labelstudiodev
. -
Install the Postgres Azure extensions required by LabelStudio.
Follow section How to use PostgreSQL extensions section from the Azure docs herepg_trgm
- postgres trigram text extensionbtree_gin
- support for indexing common datatypes in GIN
In Label Studio, follow the instructions under section Enable Configuration Persistance and set the required environment variables under settings in label studio.
DJANGO_DB
: Set this todefault
.POSTGRE_NAME
: Set this to the name of the Postgres database.POSTGRE_USER
: Set this to the Postgres username.POSTGRE_PASSWORD
: Set this to the password for your Postgres user.POSTGRE_HOST
: Set this to the host that your Postgres database is running on.- On Azure this is under Overview and is called Server name, e.g.
findingfossilslabelstudio.postgres.database.azure.com
- On Azure this is under Overview and is called Server name, e.g.
POSTGRE_PORT
: Set this to the port that your Pogtgres database is running on. Default is 5432 (recommended).STORAGE_PERSISTENCE
: Set this to1
to remove the warning about ephemeral storage.
Inside the Label Studio instance
- Storage title: Chosen name to be displayed in LS
- Container name: the Azure blob container name
finding-fossils-labelling-dev
- Container prefix: folder path within the contatiner for which to load the data from
labelling-raw
- File filter Regex: pattern matching,
.\*
for all files - Accountname: name of storage account
findingfossilsdev
- Account key: from Azure storage page --> Access keys
To download all currently labelled files follow the following steps:
- Navigate to the Azure blob storage account and locate the
labelled
folder setup above. - All files in
labelled
are numbered according to the LabelStudio task ID and are JSON files even though they don't have the extension. - Expand the window and scroll all the way to the bottom of all files to ensure they're all in view.
- At the top select the radio button that selects all the files and in the menu bar at the top right select the download button, this will begin downloading all files one by one (yes, it's not perfect).
- From your downloads folder select all the files and move into a folder
data/entity-extraction/raw/<DATE>_label-export/
- Now the NER model training can have this folder entered as the raw label input path.
-
Create a Hugging Face hub account: https://huggingface.co/join
-
Send your profile name to Ty Andrews to be added to the Finding Fossils organization on Hugging Face Email: [email protected], or create a new organization for a different project to work collaboratively with teammates.
-
Once in the organization, navigate to the organization page from your profile.
-
Create a LabelStudio account and record your password in your password manager.
-
Open the Green project named like Finding Fossils Labelling - Production or create a new one.
-
Navigate to the settings menu of the project. Here, several options are available to tweak the settings to be compatible for your task,
-
Labeling configuration: After syncing the buckets, the final step is to define the different categories of entities that the named entity recognition model will be trained to predict. A configuration file is used to define the classes and to initialize the UI components to aid a user label entities. A sample config file has the following tags:
<View>
<View style="display:flex;align-items:start;gap:8px;flex-direction:row-reverse">
<Text name="text" valueType="text" value="$text" granularity="word"/>
<Labels name="label" toName="text" showInline="false">
<Label value="SITE" background="#336cf0"/>
<Label value="GEOG" background="#D4380D"/>
<Label value="AGE" background="#f0c528"/>
<Label value="ALTI" background="#86d425"/>
<Label value="TAXA" background="#925ff2"/>
<Label value="EMAIL" background="#ff941a"/>
<Label value="REGION" background="#ff9ee5"/></Labels>
</View>
</View>
For more information about config files to setup a custom LabelStudio NER labeling task, refer the this documentation.
For general information, visit LabelStudios templates page..
- Select the task with global_index of 0, the global index indicates this is the start of the article and start labelling each task by moving onto the next global_index number.
- Ensure pre-labelled entities are correct and/or fix: we have tried to auto-tag entities to make this faster but it’s not perfect and this is what we’re improving, so this commonly misses entities or gets them partially right.
- Label any missed entities: these can be things with typos, words being smushed together, etc.