This version is a major update. The new features and benchmarks are explained in a technical note titled “Improving the performance of ClairS and ClairS-TO with new real cancer cell-line datasets and PoN”. A summary of changes:
- Starting from this version, ClairS will provide two model types. ssrs is a model trained initially with synthetic samples and then real samples augmented (e.g., ont_r10_dorado_sup_5khz_ssrs), ss is a model trained from synthetic samples (e.g., ont_r10_dorado_sup_5khz_ss). The ssrs model provides better performance and fits most usage scenarios. ss model can be used when missing a cancer-type in model training is a concern. In v0.4.0, four real cancer cell-line datasets (HCC1937/BL, HCC1954/BL, H1437/BL, and H2009/BL) covering two cancer types (breast cancer, lung cancer) published by Park et al. were used for ssrs model training.
- Added BQ jittering in model training to address the BQ distribution difference between the training and calling datasets that leads to performance drop.
- Added the --indel_min_af option and adjusted the default minimum allelic fraction requirement to 0.1 for Indels in ONT platform.