This tutorial demonstrates a complete workflow of training a machine learning model with the aid of Active Learning using Lightly and Label Studio.
Assume we have a new unlabelled dataset and want to train a new model. We do not want to label all samples because not all of them are valuable. Lightly can help select a good subset of samples to kick off labeling and model training. The loop is as follows:
- Lightly chooses a subset of the unlabelled samples.
- This subset is labeled using Label Studio.
- A machine learning model is trained on the labeled data and generates predictions for the entire dataset.
- Lightly consumes predictions and performs Active Learning to choose the next batch of samples to be labeled.
- This new batch of samples is labeled in Label Studio.
- The machine learning model is re-trained on the enriched labeled dataset and to achieve better performance.
Let's get started!
Make sure you have an account for the Lightly Web App.
You also need to know your API token which is shown under your USERNAME
-> Preferences
.
Clone this repo and install all Python package requirements in the requirements.txt
file, e.g. with pip.
git clone https://github.com/lightly-ai/Lightly_LabelStudio_AL.git
cd Lightly_LabelStudio_AL
pip install -r requirements.txt
We want to train a classifier to predict the weather displayed in an image. We use this dataset: Multi-class Weather Dataset for Image Classification. Download the dataset (zip file) from the here to this directory.
After downloading and extracting the zip file, you will see the extracted directory as follows:
dataset2
├── cloudy1.jpg
├── cloudy2.jpg
├── cloudy3.jpg
├── cloudy4.jpg
...
Here we have images in 4 weather conditions: cloudy
, rain
, shine
, and sunrise
.
To compare results between iterations, we first split the entire dataset into a full training set and a validation set. The training set will be used to select samples, and the validation set will be used as "new data" to evaluate the model's performance.
Run the script below to split the dataset:
python source/setup_data.py
After this, you will find the following files and directories in the current directory:
train_set
: Directory that contains all samples to be used for training the model. Here we pretend that these samples are all unlabelled.val_set
: Directory that contains all samples to be used for model validation. Samples are labeled.full_train.json
: JSON file that records paths to all files intrain_set
.val.json
: JSON file that records paths and labels of all files inval_set
.
These will be used in the following steps.
In this tutorial, samples are stored in the cloud, and Lightly Worker will read the samples from the cloud data source. For details, please refer to Set Up Your First Dataset. Here we use Amazon S3 as an example.
Under your S3 bucket, create two directories: data
and lightly
. We will upload all training samples to data
. For example, run the AWS CLI tool:
aws s3 sync train_set s3://<bucket>/data
After uploading the samples, your S3 bucket should look like
s3://bucket/
├── lightly/
└── data/
├── cloudy1.jpg
├── cloudy2.jpg
├── ...
Now, with all unlabelled data samples in your training dataset, we want to select a good subset, label them, and train our classification model with them. Lightly can do this selection for you in a simple way. The script run_first_selection.py does the job for you. You need to first set up Lightly Worker on your machine and put the correct configuration values in the script. Please refer to Install Lightly and Set Up Your First Dataset for more details.
Run the script after your worker is ready:
python source/run_first_selection.py
In this script, Lightly Worker first creates a dataset named weather-classification
, selects 30 samples based on embeddings of the training samples, and records them in this dataset. These 30 samples are the ones that we are going to label in the first round. You can see the selected samples in the Web App.
We do this using the open source labeling tool Label Studio, which is a browser-based tool hosted on your machine. You have already installed it and can run it from the command line. It will need access to your local files. We will first download the selected samples, import them in Label Studio, label them, and export the annotations.
Curious to get started with Label Studio? Check out this tutorial for help getting started!
We can download the selected samples from the Lightly Platform. The download_samples.py script will do everything for you and download the samples to a local directory called samples_for_labelling
.
python source/download_samples.py
Lightly Worker created a tag for the selected samples. This script pulls information about samples in this tag and downloads the samples.
Now we can launch LabelStudio.
export LABEL_STUDIO_LOCAL_FILES_SERVING_ENABLED=true && label-studio start
You should see it in your browser. Create an account and log in.
Create a new project called "weather-classification".
Then, head to Settings
-> Cloud Storage
-> Add Source Storage
-> Storage Type
: Local files
.
Set the Absolute local path
to the absolute path of directory samples_for_labelling
.
Enable the option Treat every bucket object as a source file
.
Then click Add Storage
. It will show you that you have added a storage.
Now click on Sync Storage
to finally load the 30 images.
In the Settings
-> Labeling Interface
in the Code
insert
<View>
<Image name="image" value="$image"/>
<Choices name="choice" toName="image">
<Choice value="cloudy"/>
<Choice value="rain"/>
<Choice value="shine" />
<Choice value="sunrise" />
</Choices>
</View>
It tells Label Studio that there is an image classification task with 4 distinct choices.
If you want someone else to help you label the images, navigate to Settings
->Instructions
and add some instructions.
Now if you click on your project again, you see 30 tasks and the corresponding images.
Click on Label All Tasks
and get those 30 images labeled.
Pro Tip! Use the keys 1
, 2
, 3
, 4
, on your keyboard as hotkeys to be faster!
Export the labels via Export
in the format JSON-MIN
.
Rename the file to annotation-0.json
and place that in the root directory of this repository.
We can train a classification model with the 30 labeled samples. The train_model_1.py script loads samples from annotation-0.json
and performs this task.
python source/train_model_1.py
The following steps are performed in this script:
- Load the annotations and the labeled images.
- Load the validation set.
- Train a simple model as in model.py.
- Make predictions for all samples for training, including unlabeled samples.
- Dump the predictions in Lightly Prediction format into directory
lightly_predictions
.
We can see that the model performance is not good:
Training Acc: 60.000 Validation Acc: 19.027
It is okay for now. We will improve this. Predictions will be used for active learning.
Lightly Worker also does active learning for you based on predictions. It consumes predictions stored in the data source. We need to place the predictions we just acquired in the data source. For detailed information, please refer to Predictions Folder Structure. Here we still use the AWS S3 bucket as an example.
In the lightly
directory you created earlier in your S3 bucket, you will have a subdirectory .lightly/predictions
where predictions are kept. You need the following additional files. You can create these files directly by copying the code blocks below.
["weather-classification"]
We only have one task here, and let's name it as weather-classification
.
{
"task_type": "classification",
"categories": [
{
"id": 0,
"name": "cloudy"
},
{
"id": 1,
"name": "rain"
},
{
"id": 2,
"name": "shine"
},
{
"id": 3,
"name": "sunrise"
}
]
}
Place these files in the lightly
directory in your bucket along with predictions in the directory lightly_prediction
.
After uploading these files, your S3 bucket should look like
s3://bucket/
├── lightly/
│ └── .lightly/
│ └── predictions/
│ ├── tasks.json
│ └── weather-classification/
│ ├── schema.json
│ ├── cloudy1.json
│ ├── cloudy2.json
│ ├── ...
└── data/
├── cloudy1.jpg
├── cloudy2.jpg
├── ...
where files like cloudy1.json
and cloudy2.json
are prediction files in lightly_prediction
.
With the predictions, Lightly Worker can perform active learning and select new samples for us. The run_second_selection.py script does the job.
python source/run_second_selection.py
This time, Lightly Worker goes through all training samples again and selects another 30 samples based on active learning scores computed from the predictions we uploaded in the previous step. For more details, please refer to Selection Scores and Active Learning Scorer.
You can see the results in the Web App.
You can repeat step 3 to label new samples. To import new samples, go to Settings
-> Cloud Storage
and then click Sync Storage
on the Source Cloud Storage you created earlier. A message Synced 30 task(s)
should show up.
Then, you can go back to the project page and label the new samples. After finishing annotating the samples, export the annotations again. Rename the file to annotation-1.json
and place that in the root directory of this repository.
Very similar to the script in step 4, script train_model_2.py loads samples from annotation-1.json
and trains the classification model again with all 60 labeled samples now.
python source/train_model_2.py
The model indeed does better this time on the validation set:
Training Acc: 90.000 Validation Acc: 44.248