This repository contains the implementation of DIPS, a data-centric method to improve pseudo-labeling under imperfect/noisy 'labeled' data from the paper "You can’t handle the (dirty) truth: Data-centric insights improve pseudo-labeling"
DIPS improves a variety of state-of-the-art pseudo-labeling algorithms (semi-supervised learning algorithms) via data-centric insights.
For more details, please read our DMLR paper: You can’t handle the (dirty) truth: Data-centric insights improve pseudo-labeling.
- Clone the repository
- (a) Create a new virtual environment with Python 3.10. e.g:
virtualenv dips_env
- (b) Create a new conda environment with Python 3.10. e.g:
conda create -n dips_env python=3.10
- With the venv or conda env activated, run the following command from the repository directory:
- Install the minimum requirements to run DIPS
pip install -r requirements.txt
- Link the environment to the kernel:
python -m ipykernel install --user --name=dips_env
Outputs from scripts can be logged to Weights and Biases - wandb. An account is required and your WANDB_API_KEY and Entity need to be set in wandb.yaml file provided.
To get started with DIPS one can try the tutorial.ipynb notebook in the root folder
To run the tabular experiments one can run the bash scripts found in the scripts folder, with results logged to wandb. For example:
bash run_real_tabular.sh
To run the notebook experiments one can run any of the Jupyter notebooks (.ipynb) found in the notebooks
folder
Details to run DIPS for Computer Vision tasks (such as FixMatch) can be found in the fixmatch
folder. Requirements specific to these experiments are contained therein.
If you use this code, please cite the associated paper:
@article{
dips2024,
title={You can't handle the (dirty) truth: Data-centric insights improve pseudo-labeling},
author={Nabeel Seedat and Nicolas Huynh and Fergus Imrie and Mihaela van der Schaar},
journal={Journal of Data-centric Machine Learning Research},
year={2024},
}