Skip to content

Latest commit

 

History

History
50 lines (41 loc) · 3.47 KB

README.md

File metadata and controls

50 lines (41 loc) · 3.47 KB

Confound-Leakage: Confound Removal In Machine Learing Leads To Leakage

In this repo I am uploading the code and instructions for replication for the paper. To have a look at how the paper's plots and anlyses were created see: Analyses

What you need

  • python3: please install python and all the requirements (pip install -r requirements.txt) ** I did run the analyses again while revising the manuscript using sklearn=1.1 to get some new visualisations So, you need to update to that as well if you want all visualisations to work
  • The data comes from the UCI repository.
  • Computation infrastructure: This repo uses an HTCondor server to submit jobs. You can run all scripts manually on any modern machine, but it could take some time without some copute cluster

How to execute:

  1. setup the environment as mentioned above
  2. Get Data
    • Many datasets are downloaded directly from UCI Repo, but the following once need to be downloaded in put into ./data/raw/:
      • /bank-additional/bank-additional-full.csv
      • /raw/student/student-mat.csv student
        So, please go to the UCI repository following the links above, next click on Data Folder. Here, you will find .zip files with the same name as the .csv you see above. Dowload and unzip them. Lastly, create a ./data/raw/ folder (if not existing yet) and put .csv into it.
  3. Run data preparation script: ./00_prepare_data.sh
  4. Run experiments: ./01_condor_submission.sh:
    • For exact reproduction:
      • This is easiest with a HTCondor. If you are on one just use bash ./01_condor_submission.sh
      • Else you will have to adjust the submission process to your ecosystem:
        • if you run all the lines in ./01_condor_submission.sh wihtout the | condor_submit you will see all the things I am running in the HTCondor submit file style
        • from here you can either run things manually or rewrite it for you environment
    • A good alternative might be:
      • run the experiments you are interested in by running the python file: python3 ./src/run_analysis.py ... where instead of ... you put in the needed arguments
      • you can find the arguements needed in the if __name__ == '__main__': at the bottom of the ./src/run_analysis.py
  5. Now you can create the jupyter book to get an overview of the anaysis by using:
    • bash ./02_build_analyses.sh
    • or run the python script in the hydro style inside of ./analyses/content/

Overview used UCI datasets (with links):