Official implementation of Rectified Contrastive Pseudo Supervision in Semi-Supervised Medical Image Segmentation
Authors:
Xiangyu Zhao, Zengxin Qi, Sheng Wang, Qian Wang, Xuehai Wu, Ying Mao, Lichi Zhang
manuscript link:
This repo contains the implementation of the proposed Rectified Contrastive Pseudo Supervision (RCPS) on two public benchmarks in medical images.
If you find our work useful, please cite the paper:
@article{zhao2023rcps,
title={RCPS: Rectified Contrastive Pseudo Supervision for Semi-Supervised Medical Image Segmentation},
author={Zhao, Xiangyu and Qi, Zengxin and Wang, Sheng and Wang, Qian and Wu, Xuehai and Mao, Ying and Zhang, Lichi},
journal={IEEE Journal of Biomedical and Health Informatics},
doi={10.1109/JBHI.2023.3322590},
year={2023}
}
✅ Provide code for data preparation
✅ Publish model checkpoints
✅ Publish full training code
✅ Publish code for inference
✅ Add support for custom data training
Following previous works, we have validated our method on two benchmark datasets, including 2018 Atrial Segmentation Challenge and NIH Pancreas dataset.
It should be noted that we do not have permissions to redistribute the data. Thus, for those who are interested, please follow the instructions below and process the data, or you will get a mismatching result compared with ours.
Atrial Segmentation: http://atriaseg2018.cardiacatlas.org/
- The above link seems to be out of service. You may find the data at: https://www.cardiacatlas.org/atriaseg2018-challenge/
Pancreas dataset: https://wiki.cancerimagingarchive.net/display/Public/Pancreas-CT
- If you encounter issues downloading the data, you may find the same data at : https://academictorrents.com/details/80ecfefcabede760cdbdf63e38986501f7becd49
- Please note that the orientation of the data downloaded from this link is not correct, please correct them manually.
We split the data following previous works. Detailed split could be found in folder data
, which are stored in .csv files.
Download the data from the url above, then run the script prepare_la_dataset.py
and prepare_pancreas_dataset.py
by passing the arguments of data location.
Our RCPS could be extended to other datasets with some modifications.
- All of the data should be formatted as NIFTI, numpy array, or other file formats that MONAI library could handle.
- Please modify the
prepare_experiment
function inconfigs/experiment.py
: define your own task name, and pass the number of classes, class names, as well as the affine matrix of 3D volume data. - You need to create a
$YOURTASK.cfg
file in theconfigs
folder to pass necessary arguments to the algorithm, where#YOURTASK
is the task name you defined in your case.
In this scenario, all of the training images are labeled. Semi-supervised learning is deployed to investigate model performance with different labeled data ratio.
In this case, split your training and validation data under the root path of your data storage. The expected structure of data storage is listed below:
- data_root
- train_images
- train_labels
- val_images
- val_labels
Note that all of the images should be labeled. If some images are viewed as "unlabeled", their segmentation label remain untouched during training, i.e., the labels are not utilized to supervise the training.
In this scenario, some of the training images are unlabeled. You are using semi-supervised learning to enhance segmentation performance.
In this case, split your training and validation data under the root path of your data storage. The expected structure of data storage is listed below:
- labeled_root
- train_images
- train_labels
- val_images
- val_labels
- unlabeled_root
- train_images
In labeled_root
, you should store labeled training and validation data; in unlabeled_root
, you should store your extra unlabeled data.
The train.py
file should be modified as follows:
from utils.iteration.load_data_v2 import RealSemiSupervisionPipeline
...
data_pipeline = RealSemiSupervisionPipeline(labeled_root, unlabeled_root)
trainset, unlabeled_set, valset = data_pipeline.get_dataset(train_aug, val_aug, cache_dataset=False)
train_sampler = DistributedSampler(trainset, shuffle=True)
unlabeled_sampler = DistributedSampler(unlabeled_set, shuffle=True)
val_sampler = DistributedSampler(valset)
By using RealSemiSupervisionPipeline
, you could generate the corresponding training dataset, unlabeled dataset and test dataset, respectively. Then the training will be identical to LA or Pancreas dataset.
Actually, the difference between real semi-supervised scenarios and changing label ratios is that all of the training data in the later case is labeled (but some of them are treated as unlabeled data, where their labels are ignored and never used for training).
- In the latter case (changing ratios), you could evaluate the effectiveness of the semi-supervised algorithm and check its performance when changing the ratio of unlabeled data;
- In the former case (real semi-supervised scenario), you could use the algorithm to enhance the segmentation performance compared with using merely labeled data.
But these two scenarios do NOT alter model effectiveness. Imagine these two cases:
- 100 labeled images (labeled ratio is set to 10%)
- 10 labeled images with 90 unlabeled ones
The model should yield very close performance in these two scenarios, as essentially they are identical.
We have provided pretrained checkpoints for our RCPS on LA and NIH-Pancreas datasets.
Link: https://drive.google.com/drive/folders/15-2oBw-11bNMhSCRxzLSHisTbyOcD3gv?usp=sharing
Link:https://pan.baidu.com/s/1wNY06gOmxy8lZzcKiYVR-g?pwd=0512
Extraction Code (提取码):0512
To run our code, you need a Linux PC equipped with at least one NVIDIA graphics card. The recommended video memory is at least 8GB. Graphics cards in newer generations (later than Turing) are recommended, as you will get extra speed-up with PyTorch native mixed precision training.
If you encounter CUDA OOM issues, please modify the SAMPLE_NUM
argument in the cfg file in configs
folder to a smaller value (100, for example).
In order to run our code, please install the latest versions of following packages:
numpy
scipy
pandas
matplotlib
pyyaml
wandb
pytorch
torchvision
monai
nibabel
tqdm
Please enter the following command in the terminal:
CUDA_VISIBLE_DEVICES=$CUDA_DEVICE_NUMBER torchrun --nproc_per_node=$NUM_GPU train.py --mixed --benchmark --task $TASK --exp_name $EXP_NAME --wandb --entity $USER_NAME
CUDA_DEVICE_NUMBER
: the CUDA device number visible to the training scripts, could be found by nvidia-smi
command;
NUM_GPU
: number of GPUs used during training, at least 1 (our DDP supports one-card scenario);
TASK
: task name;
EXP_NAME
: experiment name;
USER_NAME
: user name for WandB.
For further instructions, please run the command with -h
argument.
Please enter the following command in the terminal:
CUDA_VISIBLE_DEVICES=$CUDA_DEVICE_NUMBER torchrun --nproc_per_node=$NUM_GPU eval.py --mixed --benchmark --task $TASK --exp_name $EXP_NAME -pc $CKPT
CUDA_DEVICE_NUMBER
: the CUDA device number visible to the training scripts, could be found by nvidia-smi
command;
NUM_GPU
: number of GPUs used during training, at least 1 (our DDP supports one-card scenario);
TASK
: task name;
EXP_NAME
: experiment name;
CKPT
: the file path of checkpoint file.
For further instructions, please run the command with -h
argument.