Skip to content

Latest commit

 

History

History
42 lines (42 loc) · 2.65 KB

post-processing.md

File metadata and controls

42 lines (42 loc) · 2.65 KB

NOTE THAT ALL FOLLOWING SCRIPTS ASSUME THAT ALL ALGORITHMS HAVE BEEN RUN FOR ALL DATASETS AND MAY NOT WORK WITHOUT ALL DATA PRESENT

Post-processing

  • result_gathering.ipynb
    • Gathers the predictions from the cluster labelling methods and combines them with the predictions from the cell based methods
    • NOTE Because adobo and scCATCH are semi-supervised and use their own gene signature database, their predictions may not exactly match what is present in the gold standards (for instance we may only know for sure that a cell is a T Cell, but it may be predicted to be a memory T Cell or helper T Cell). I could not find a way to make these match automatically, so the labels are renamed manually prior to running this script
  • fmeasure_scripts/F-Measure-Analysis-Bootstrap.R
    • Obtains bootstrapped F1 Scores for each algorithm on each dataset
    • Disclaimer This takes a long time to run. Multiple hours, especially if it is not modified to exlude some algorithms or the largest datasets.
    • Inputs: Prediction files for each dataset
    • Ouputs: Large dataframe containing bootstrapped F1 scores for each dataset
  • score_all_methods.ipynb
    • Obtains per-class F1 score among other metrics for all datasets and algorithms, gathered into a single large dataframe
    • Inputs: Prediction files for each dataset
    • Ouputs: Large dataframe containing per-class as well as macro and micro averaged scores for each dataset-algorithm pair
  • main_figures.ipynb
    • Generates most of the main figures for our paper
    • Inputs:
      • data_sizes.tsv: A hardcoded dataframe outlining aspects of the datasets for figure 1
      • Rdata/F-Measure-Bootstrap-Ensemble.tsv: A file containing bootstrapped F1 scores from all methods. Output from F-Measure-Analysis-Bootstrap.R
      • times/df_for_heatmap.tsv: A dataframe with running times output from an earlier script
      • times/df_coef.tsv: A dataframe with regression coefficients from ./cell_based_program/other_scripts/time_coefficient_plot.ipynb
      • performance/seurat/bigdf.tsv: The large dataframe output from score_all_methods.ipynb
    • Outputs: plots for the paper under ./plots
  • underrepresented_cell_types.ipynb
    • Makes the main and supplemental underrepresented cell types plots
    • Inputs: performance/seurat/bigdf.tsv: The large dataframe output from score_all_methods.ipynb
    • Outputs: The two underrepresented cell types plots