-
Notifications
You must be signed in to change notification settings - Fork 34
Variant effect prediction
Anusri Pampari edited this page Jul 3, 2023
·
8 revisions
Note: chrombpnet snp_score command will be deprecated in future versions, please refer to https://github.com/kundajelab/variant-scorer for the updated tool.
The following assumptions are made with this script - make changes accordingly if the assumptions dont hold.
- The script is designed to work only with SNPs - so make sure reference and alternate allele are just one character.
- The script returns the following three effect scores -
- log counts difference in alternate allele and reference allele predictions (
log_counts_diff
) - Sum of absolute difference in log probabilities per base (
log_probs_diff_abs_sum
) between alternate allele and reference allele predictions - Jensenshannon distance between alternate allele profile probability predictions and reference allele profile probability predictions (
probs_jsd_diff
).
- log counts difference in alternate allele and reference allele predictions (
- Carefully read the input format of
SNP_DATA
below. - The input sequence length (
inputlen
) is inferred from the model. In chromBPNet models theinputlen
used is an even number to ensure symmetry. So here we insert the allele atinputlen
//2 locus (assuming 0-based indexing of sequence[0,inputlen)
) - which means that the sequence left of allele is 1bp longer than the sequence right of allele. - If the reference/alternate allele are at the edge of chromosome - preventing us from generating an
inputlen
sequence - we will ignore this SNP and print the locus information of the ignored SNP. This might result in the final number of output SNPs being predicted on being smaller than the given input SNPs. Read the output format section to see how this effects the final output.
chrombpnet snp_score [-h] -snps SNP_DATA -m MODEL_H5 -g GENOME -op OUTPUT_PREFIX [-bs BATCH_SIZE] [-dm DEBUG_MODE_ON]
required arguments:
-snps SNP_DATA, --snp-data SNP_DATA
Path to a tsv output with the following information in columns - chr, position to insert allele (0-based), ref
allele, alt allele
-m MODEL_H5, --model-h5 MODEL_H5
Path model .h5 file
-g GENOME, --genome GENOME
Genome fasta
-op OUTPUT_PREFIX, --output-prefix OUTPUT_PREFIX
Output prefix for bigwig files
optional arguments:
-bs BATCH_SIZE, --batch-size BATCH_SIZE
batch size to use for prediction
-dm DEBUG_MODE_ON, --debug-mode-on DEBUG_MODE_ON
Use this mode to print the flanks of first five SNP insert locations
-
SNP_DATA
: A TSV file with the following 5 columns - chr, position (0-based) to insert the allele, reference allele, alternate alllele, meta-data. You can leave the meta-data empty too.- Meta-data column can be used to provide the following information such as p-value significance of the SNP, observed effect scores etc. This information will be added as column information to the output (see below) which you can use for downtream analysis. Provide any SNP related information to this column as comma separated values.
- The reference genome loader used is
pyfaidx
which is 0-based and hence the allele position provided insnp_data_tsv
is expected to be 0-based too. Always check if you are inserting the allele correctly by using the code in thedebug_mode_on
. - Make sure there are no duplicate rows in this file.
- Example
SNP_DATA
with 7 snp insertions looks as follows. The meta-data column provides observed effect scores comma separated with ground truth labels.
chr17 18967175 A G 1.9114519946647748,1
chr4 176935912 C A 3.2978830709253413,1
chr1 144534082 C T -2.6229086754131736,1
chr17 19015380 T A 5.703283573989607,1
chr1 191859377 A G 0.021914150011206716,0
chr1 157776262 T G 0.007600049165843743,0
chr1 179303792 G A 0.0013620075140122916,0
-
DEBUG_MODE_ON
: Takes a value of 1 or 0. This is by default set to 0. When set to 1 we score only the first 5 SNPs inSNP_DATA
. In addition we also print to the console the right and left flank of the locus where the SNP is being inserted in. You can check if these flanks match with the flanks as reported in existing databases such as dbSNP (https://www.ncbi.nlm.nih.gov/snp/). If it does not your position values inSNP_DATA
will need correction.
-
output_prefix
_snp_scores.tsv: A TSV file with 8 columns - five columns copied in from thesnp_data_tsv
along with with the following three added columns -log_counts_diff
,log_probs_diff_abs_sum
,probs_jsd_diff
which represent the following --
log_counts_diff
: log counts difference in alternate allele and reference allele predictions -
log_probs_diff_abs_sum
: Sum of absolute difference in log probabilites per base between alternate allele and reference allele predictions -
probs_jsd_diff
: Jensenshannon distance between alternate allele profile probability predictions and reference allele profile probability predictions. The number of rows in this TSV file can be less than rows provided in thesnp_data_tsv
- this is because we are skipping reference/alternate allele that fall at the edge of chromosome preventing us from generating aninputlen
sequence.
-
-
output_prefix
_predictions_at_snp.pkl: A pickle file containing a dictionary with the following keys -rsids
,ref_logcount_preds
,alt_logcount_preds
,ref_prob_preds
,alt_prob_preds
. This pickle stores the model predictions at each of the SNP locations.-
rsids
consists of a list of strings - each value formed by concatenating the following 5 values (Chr, position (0-based) to insert the allele, reference allele, alternate alllele, meta-data) seperated by a underscore. An example rsid will look like this -chr1_100054_A_G_0.55
when meta data is provided, when meta-data is empty it looks like this -chr1_100054_A_G_
-
ref_logcount_preds
: Consists of the log count predictions when the reference allele is inserted. Has the same length as thersids
-
ref_logcount_preds
: Consists of the log count predictions when the alternate allele is inserted. Has the same length as thersids
-
ref_prob_preds
: Consists of the profile probability predictions when the reference allele is inserted. Has the same length as thersids
-
alt_prob_preds
: Consists of the profile probability predictions when the alternate allele is inserted. Has the same length as thersids
-