Scripts and data used to prepare a Kaggle dataset.
Generate dataset using ClinVar .vcf w/ VEP annotations:
python process_clinvar.py
will generate a version of the file clinvar_conflicting.csv
with vep annotations.
Check out the notebook to see some exploratory data analysis.
The objective is to predict whether a ClinVar variant will have conflicting classifications.
Conflicting classifications are when two of any of the following three classification categories are present for one variant, two submissions of one category is not considered conflicting.
- Likely Benign or Benign
- VUS
- Likely Pathogenic or Pathogenic
The CLASS
feature in clinvar_conflicting.csv
is a binary representation of whether or not a variant has conflicting classifications where 0
represents consistent classifications and 1
represents conflicting classifications.
Since this problem only relates to variants with multiple classifications, I removed all variants from the original ClinVar vcf which were only had one submission.
ClinVar is a public resource containing annotations about human genetic variants. These variants are classified on a spectrum between benign, likely benign, uncertain significance, likely pathogenic, and pathogenic. Variants that have conflicting classifications (defined above) can cause confusion when clinicians or researchers try to interpret whether the variant has an impact on the disease of a given patient.
I'm exploring ideas for applying machine learning to genomics. I'm hoping this project will encourage others to think about the additional feature engineering that's probably necessary to confidently assess the objective. There could be benefit to identifying single submission variants that may yet to have assigned a conflicting classification.
Ensembl's Variant Effect Predictor (VEP) was used to annotate the original ClinVar .vcf
. It provides additional information about variants that can serve as features for the dataset.
Download and rename the annotated .vcf
as clinvar.annotated.vcf
Create the new dataset with vep annotations.
python process_clinvar.py