- use the effect_start and effected_end to build the reference Is it enough to replace and flatten afterwards? i.e. ref[effected_start:effected_end] = chosen_alt
- we do this only if we are doing `call_base`, not ambiguous, right?
- you don’t want to actually push around the reference until after you do the fckn calls; so could represent
- don’t need to worry about calling N because we fill in the ref ahead of time–(though it seems like this could get it wrong but it’s what we’ve been doing)
- copy ambiguous table from exisitng consensus code
Position = int
- then run on DoD cluster?
- !! look at what someone else is doing w/ viral variant calling, esp. prep for [other] variant callers: we won’t do that prerp in NGSBasecaller runs
- ngs-doit, assuming it’s finished
- consider just using a straightforward (linear) bash-style python script <= prefer this probably to ngs-shake
- use samtools stats and plot-bamstats [might fail for novaseq?]
- look at indel correction, marking duplicates, etc.
- replace tagreads
- See Needs to run on windows
- ignore the sophisticated progress reporting and release pretty looking
STRONGLY prefer electron, because it’s pretty, and it’s mostly finished
- hopefully electron still works? I was going to port that onto VDB? Or I though I could run a server on the cluster?
- or there was some way to run it is an an HTML file in the browser?
- can I run a java app on it? or even python?
int a=1;
int b=1;
printf("%d\n", a+b);
(setq org-confirm-babel-evaluate nil)
(setq org-src-preserve-indentation t)
(org-babel-do-load-languages
'org-babel-load-languages
'(
(emacs-lisp . t)
;; (shell . t)
(C . t)
(python . t)))
(print "hetoo")
- time sheet
- start clock
- discuss goals and map plan
- update Pap after lunch
Papa: what does the end goal look like?
- clone our template
- test file and source file and requirements file
- must start with test_
- goes in tests/test_whatever.py
- just install pytest
- in future would automate
digraph G {
FastaFile -> {RefSeq1, RefSeq2};
VCFFile -> {Chrom1, Chrom2};
Chrom1 -> RowPerPosition;
Bamfile -> DepthPerPosition;
FastaFile -> BasePerPosition;
{BasePerposition, UserOptions, DepthPerPosition, RowPerPosition} -> Caller;
Caller -> {Consensus, Report};
}
note that this is different form the SOP in base_caller,py, which changes the quality of bases based on depth and so on: epilog = ”’WRAIR VDB Base Calling SOP: Depth < 10: All bases with base quality < 25 get set to N Then the base is called on the percentage(See below) Depth > 10: All bases with base quality < 25 get removed Then the base is called on the percentage(See below)
Calling on Percentage: Any base with >= 80% majority is called
- or -
N is called if the depth was < 10 and N is > 20%
- or -
The specific IUPAC ambiguious base is called for all bases over 20%
- or -
The majority base is called ”’,
options = CallOptions(3, 2, 3, 4)
def call_all(ref: str, vars: Dict[Position, Variant], raw_depths: Dict[Position, int], VARCLASS: type, options: CallOptions) -> Sequence[Call]:
calls: Sequence[Call] = []
# consensus2d: Sequence[Sequenece[Base]]
# for i, var in enumerate(vars):
# wait, don't need to cover NGSBasecaller because we already have a consensus generator for it, pretty much, right!?
assert len(ref) == len(raw_depths)
# for pos, in range(len(ref)):
schemas = load_user_call_schemas(VARCLASS, user_schema_path)
# ref_len = len(ref)
MAX_CONSENSUS_LEN = len(ref) # works b/c we're not doing insertions
while pos < ref_len:
ref_base, samtools_depth = ref[pos], raw_depths[pos]
var = vars.get(pos, None)
# if VARCLASS == NGSBaseCaller:
if samtools_depth < options.minTotalDepthNotN:
call = UnderDepth((pos, pos+1), total_depth, MIN_DEPTH)
pos += 1
# use variant caller options in order to enforce
elif (not var):
call = RefNoVariant( (pos, pos+1), [ref_base], ref_ratio)
pos += 1
else:
# TODO: try calling per-bases using user schemas
result = try_call_alts_seperately(var, schemas['callAlternate'])
if not result:
# write now possible insertions give
# TODO: remove duplicate VCF rows
# note that it is possible to have /overlapping/ variants in differen positions with freebayes.
# should be able to over come this in the build_consensus function by taking the union
# of the two calls-basically lookup in the ambiguity table.
# This matches requirements because when there is more than one alt we call it ambiguous.
# For now, we just pick the first one and move on. However, we need some way of telling if it's overlapped I guess? No, we just skip it anyway
# TODO: the above 'more-than-one-alt' needs to be coded somewhere around here.
# or if # sum(var.AF) < options.whatev
non_alt_ratio = 1.0 - sum( var.AF )
call = RefCallOverVairant( (pos, pos+1), [ref_base], non_alt_ratio, var.DP)
calls.append(call) # [ call.window.start : call.window.inclusiveEnd ]
return calls
We give the opportunity here for [schema-]specification of behavior when more than one ALT is found. this can’t be represented in the datatype schema itself. callSingleDespiteOtherVariantsAF : .8 but default to 1, for now. The below `try_call_alts_seperately` should be split up into a ‘call-single’ function which simply applies a `schema` For now we do my default schema. Another function, perhaps simply in the above one, decides whic ALT to use or how to combine them (via ambiguity table). Note that the logic is very simple here because we are just checing the AF threshold. In the case of an overlapped call, if one is ambiguous, we pull out the `contributing bases` to be combined with the overlapped call. However, we shouldn’t actually do this, but instead delay the ambiguiuty collapse until
We also collapse the overlap variants. NOTE that any call made here can be overlapped by a call on a preceding position (ugh). We trust the data of our original call, though, because it should on
#+NAME find_overlapping_variants
from itertools import groupby
TmpVar = dict
def findOverlappingVariants(vars: Sequence[TmpVar]) -> Sequence[Sequence[TmpVar]]:
def overlap(zss, a):
print(zss, a)
try:
z = zss[0][0]
if a[0] <= z[1]: # replace [0] and [1] with window start and end
return [zss[0] + [a]] + zss[1:]
except IndexError as e:
print(e)
pass
default = [[a], zss]
return default
result = reduce(overlap, inputs, [])
return reversed(result)
inputs = [(0, 3), (2, 4), (5, 6), (5, 7), (20, 21)]
from functools import reduce
print(reduce(overlap, inputs, []))
overlap: groupby(
- Main function: interacts with outside world
- Parses everything
- takes user inputs
- validate inputs
- @calls the system logic
- writes the results of logic out somewhere
- Visit decision ‘nodes’ in order. If they ‘pass’ the check, return, else go to next node
digraph G {
UnderDepth -> CallAlt -> CallAmbiguous -> CallRef [label="NO"];
// CallAlt -> Return [label="YES"];
}
# how far can we get with json schema?
# we can express multiple possible alts by using `anyOf`
# -> express that we pick 'that one'
# callAmbiguous requires `sum`, however;
# -> where does the `minimum percentage contribute degen` come from?
percent20 = { "callN" : {"DP" : { "exclusiveMaximum" : 10, "<RESULT>" : "N" } },
"callAlt" : { "AF" : { "inclusiveMinimum" : 80, "<RESULT>" : "ALT" } },
"callAmbiguous" : { "inclusiveMinimum" : 20, "exclusiveMaximum" : 80, "<RESULT>" : "???" },
"callRef" : { "or" : [{ "PRC" : 20 }, { "AF" : { "inclusiveMaximum" : 100 - 80 } }]}}
# 95 & 1 are lofreq and don't use PRC
Percent1 = { "callN" : {"DP" : { "exclusiveMaximum" : 1000, "<RESULT>" : "N" } },
"callAlt" : { "AF" : { "inclusiveMinimum" : 1, "<RESULT>" : "ALT" } },
"callAmbiguous" : { "inclusiveMinimum" : 1, "exclusiveMaximum" : 99, "<RESULT>" : "???" },
"callRef" : { "or" : [{ "PRC" : >99 }, { "AF" : { "inclusiveMaximum" : 100 - 99 } }]}}
Percent5 = { "callN" : {"DP" : { "exclusiveMaximum" : 500, "<RESULT>" : "N" } },
"callAlt" : { "AF" : { "inclusiveMinimum" : 5, "<RESULT>" : "ALT" } },
"callAmbiguous" : { "inclusiveMinimum" : 5, "exclusiveMaximum" : 95, "<RESULT>" : "???" },
"callRef" : { "or" : [{ "PRC" : >95 }, { "AF" : { "inclusiveMaximum" : 100 - 95 } }]}}
@dataclass SchemasExample: underDepth = { “DP” : { “exclusiveMaximum” : MINDEPTH, “<RESULT>” : “N” } }
call_alt = { “AF” : { “inclusiveMinimum” : 0.8 }, “DP” : _over10, “INDEL” : False, “<RESULT>” : “{ALT}” } call_ambiguous_sum = { “DP” : _over10, “AF” { “inclusiveMinimum” : 0.01 }, “<RESULT>” : “<DEGEN>” } _consider_ambiguous_single = { “AF” { “inclusiveMinimum” : 0.01, “exclusiveMaximum” : 0.8 } } # consider this single base for an ambiguous call, “<RESULT>” : “<call_ambiguous_sum>” } call_ref = { “AF” : { “exclusiveMaximum” : 0.01 }, “DP” : _over10, “<RESULT>” : “{REF}” }
collapse_single_position_call as_ambiguous = ”.join(sorted(over20.keys()))
return AMBIGUITY_TABLE[as_ambiguous] if as_ambiguous != ” else ” def try_call_alts_seperately(var_dict): for i in range(num_alts): by_alts_single = dtz.valmap(get(i), by_alts_info) single_var_dict = dtz.merge(var_dict, by_alts_single) def get_debug_info(): return str(single_var_dict)
instance = VARCLASS(**single_var_dict) validations = [instance.validate(s) for s in [schemas.call_alt, schemas.call_ambiguous, # these two are done elesewhere schemas.call_ref] # they require the other ALT infos ] assert sum(map(is_valid, validations)) > 1, f”More than one valid call for {get_info()}, see {validations}” try:
failure1 = instance.validate(schemas.call_alt) if failure1: failure2 = instance.validate(schemas.call_ambiguous) if failure2: failure3 = instance.validate(schemas.call_ref) if failure3: raise ValueError(f”No call for {get_info()}”) except ValueError as e: CALL_LOGGER_DEBUG(f”Did not {trying} due to failure {e}”) from Bio.Seq import Seq from Bio.SeqRecord import SeqRecord #done
def build_consensus(consensus_id: str, calls: Sequence[Call]) -> SeqRecord: consensus: Sequence[Base] = []
for call in calls:
if (call.window.start == call.window.inclusiveEnd): # deletion pass consensus.extend(call.bases) return SeqRecord(seq=Seq(”.join(new_ref)), id=consensus_id)
@dataclass class Window: start: int inclusiveEnd: int
from dataclasses import dataclass
import abc
from typing import *
import vcf
Base = str
decimal = float
@dataclass
class CallOptions:
minTotalDepthNotN: int # => DP (or if missing samtools depth) minimum depth for it not to be N
callVariantRatio: decimal # => AF if any alternate is &&'s with [[minaltdepth]]
minAltDepth: int # => AC minimum depth for the alt to be called
callAmbiguousRatio: decimal # => AF if any alternate is over this ratio, it will be called as ambiguous, unless it is called as a variant (see [[callVariantRatio]])
@dataclass
class AbstractCall(metaclass=abc.ABCMeta):
window = Tuple[int, int] # effected_start: int effected_end: int
bases: Sequence[Base]
def size(self) -> int:
return window.inclusiveEnd - window.start
class UnderDepth(AbstractCall):
bases = ['N']
totalDepth: int
threshold: int
class RefNoVariant(AbstractCall): ...
''' somestimes won't have this info (if not in the vcf, for exmaple)
could put the samtools depth in here but the problem is that will be capped '''
samtoolsDepht: int
class RefCallOverVairant(AbstractCall):
nonAltRatio: decimal
totalDepth: int
record: vcf._Record
class Ambiguous(AbstractCall):
contributing_alts: Sequence[Base]
record: vcf._Record
class Variant(AbstractCall):
af: decimal # note this is a single value. AC etc. can be computed from it w/ totalDepth
totalDepth: int
record: vcf._Record
class Deletion(Variant):
bases = ['']
class Insertion(Variant):
def size(self) -> int:
return len(bases)
class SNP(Variant): ...
- Succeed the test case
- complete process code basic
- Real-world data – get data
- Ambiguity handling
- multiple alts
- picking result 1.
- Not in vcf but under depth
@dataclass
class VCFRow:
AF: float
DP: int
INDEL: bool
ALT: Base
POS: int
REF: Base
i = 0
x = VCFRow(AF=1, DP=1, INDEL=False, ALT='X', POS=i+1, REF='C')
print(x)
!which conda
from dataclasses import dataclass
from collections import OrderedDict
import jsonschema
from toolz import dicttoolz as dtz
from jsonschema.exceptions import ValidationError
Base = str
RefSeq = List[Base]
SamDepth = int
@dataclass
class VCFRow:
AF: float
DP: int
INDEL: bool
ALT: Base
POS: int
REF: Base
Recall that a consensus creation script for ngs_mapper base_caller.py exists already
from typing import List,Tuple
Schema = dict
# MAJORITY = 80
Log = str
from functools import partial
def percentN(MIND, MAJORITY): #TODO: requied
return OrderedDict({ "callN" : {"DP" : { "exclusiveMaximum" : MIND}, "<RESULT>" : lambda x: "N" } ,
"callAlt" : { "AF" : { "inclusiveMinimum" : MAJORITY} , "<RESULT>" : lambda x: x['ALT'] }, #TODO: ambiguous
"callAmbiguous" : {"AF?" : { "inclusiveMinimum" : 100 - MAJORITY, "exclusiveMaximum" : MAJORITY }, "<RESULT>" : "???" },
"callRef" : { "anyOf" : [{ "PRC" : 100 - MAJORITY },
{ "AF" : { "inclusiveMaximum" : 100 - MAJORITY } }] },
"<RESULT>" : lambda x: x["REF"] })
def consensus(depths: List[SamDepth], ref: RefSeq, alts: List[VCFRow], mind: int, majority: int) -> List[Base]:
keys = ["callN", "callAlt", "callAmbiguous", "callRef"]
schema = percentN(mind, majority)
result = list(map(partial(call_base, schema), depths, ref, alts))
return result
def call_base(schemaNodes: OrderedDict, depth: SamDepth, refBase: Base, alt: VCFRow) -> Tuple[Base, Log]:
vcf_dict = alt.__dict__
log = []
for key, node in schemaNodes.items():
try:
justSchema = {"type" : "object", "properties" : dtz.dissoc(node, '<RESULT>') }
print( vcf_dict, justSchema)
jsonschema.validate( vcf_dict, justSchema)
result = node["<RESULT>"](vcf_dict)
return result, log
except ValidationError as e:
log.append((vcf_dict, e)) # TODO: handle`jsonschema.exceptions.SchemaError
log.append( ValueError(f"FAIL"))
result = None
return result, log
def test_simplest_consensus20():
refSeq = [ 'A', 'C', 'C', 'C' ]
depths = [ 100 ] * 4
dumbRows = [VCFRow(AF=1, DP=1, INDEL=False, ALT='X', POS=i+1, REF='C') for i in range(3)]
alts = [VCFRow(AF=81, DP=100, INDEL=False, ALT='G', POS=0, REF='A')] + dumbRows
actual = consensus(depths, refSeq, alts, 10, 80)
actual, log = zip(*actual) # [x[0] for x in actual]
print(f"log: {log}")
expected = ( 'G', 'N', 'N', 'N' )
assert actual == expected, f"{actual} != {expected}"
print ("test_simplest_consensus20 passed!")
return OrderedDict({ "callN" : {"DP" : { "exclusiveMaximum" : MIND}, "<RESULT>" : lambda x: "N" } ,
"callAlt" : { "AF" : { "inclusiveMinimum" : MAJORITY} , "<RESULT>" : lambda x: x['ALT'] }, #TODO: ambiguous
"callAmbiguous" : {"AF?" : { "inclusiveMinimum" : 100 - MAJORITY, "exclusiveMaximum" : MAJORITY }, "<RESULT>" : "???" },
"callRef" : { "anyOf" : [{ "PRC" : 100 - MAJORITY },
{ "AF" : { "inclusiveMaximum" : 100 - MAJORITY } }] },
"<RESULT>" : lambda x: x["REF"] })
# create an "AMBIG" field for each alt which has the summed AF and ambiguous data already?
# only consider the highest AF-sum variant if there are /overlapping/ variants
# there cannot be more than one AF that will be over majority to be called an alt.
# QUESTION: what's the minimum percentage to consider for a degenerate base?
# At what percentage should the ref be considered?
# note that refcall and ambiguous call really operate on SUMAF.
_consider_ambiguous_single = { "AF" { "inclusiveMinimum" : 0.01,
test_simplest_consensus20()
jsonschema.validate({'AF': 81, 'DP': 100, 'INDEL': False, 'ALT': 'G', 'POS': 0, 'REF': 'A'}, {'required' : ['DP'], 'DP': {'exclusiveMaximum': 10}})
Correct behavior:
- be called ambiguous if ANY alt is AF > (100 - MAJORITY)
- consider any ALT or REF in the ambiguous mix if the percentage is greater than (100 - MAJORITY)
- call REF if
- a valid REF can have ambig bases.
Change:
- `consensus` assumes alts for each position: perhaps replace empty positions with a description of the samtools depth and the reference, or handle that after-the-fact.
- match the consensus calls up by POS, not assuming correct order.
A. Setup
- [X] Put nodes in a list
- [ ] setup log
B. Do for all positions:
- Iterate over nodes,
- return <RESULT> when sucess
- log failure
- [X] capture the error
- [ ] get input
- [X] make schema result right place
- [X] make schema and alt match (by expanding __dict__)
- [ ] capture
RefSeq = List[Base]
Base = str
SamDepth = int
@dataclass
class VCFRow:
AF: float
DP: int
INDEL: bool
ALT: Base
POS: int
REF: Base
from dataclasses import dataclass
def test_simplest_consensus20():
refSeq = [ 'A', 'C', 'C', 'C' ]
depths = [ 100 ] * 4
dumbRows = [VCFRow(AF=1, DP=1, INDEL=False, ALT='X', POS=i+1, REF='C') for i in range(3)]
alts = [VCFRow(AF=81, DP=100, INDEL=False, ALT='G', POS=0, REF='A')] + dumbRows
actual = consensus(depths, refSeq, alts)
expected = [ 'G', 'N', 'N', 'N' ]
assert actual == expected, f"{actual} != {expected}"
def main(): # entry point
def consensus(depths: List[SamDepth], ref: RefSeq, alt: List[VCFRow]) -> List[Base]: ...
''' len(ref) >= max(POS in VCFRow)
len(depths) == len(ref)
# ref base in rows and ref should match
Process:
Postcondition: Consensus sequence #TODO: define better
Report: for each possibility, what was met and not and how
'''
results = map(call_base,
def call_base(depth: SamDepth, refBase: Base, alt: VCFRow) -> Base: ... # report
'''
'''
class VCFRow: POS: int DP: int REF: Base VCFCall(VCFRow): AF: float INDEL: bool ALT: Base VCFNoCall(VCFRow): … Sequences are represented as List[Base] (where Base == str) (and in IUPAC_AMBIG)
def consensus(depths: List[int], ref: List[Base], alts: List[VCFRow], pa: PercentAnalysis) -> Tuple[List[Base], List[Log]]: assert len(depths) == len(ref), f”{depths}, {ref}” assert len(depths) > 0 assert 0 <= pa.mind <= 100 assert 0 <= pa.majority <= 100 assert all(map(IUPAC_AMBIG.__contains__, ref))
return a Tuple of List[Base] and List[Log], where there is a base and a log for every base in input reference assert len(returnVal[0]) == len(ref) returnBases[i] will be ‘N’ when depths[i] < pa.mind # as defined by schema returnBases[i] will be an ALT call when callAlt in schema validates for ref[i] returnBases[i] will be an ambiguous call from the union of all ALTS at i when callAmbiguous validates returnBases[i] will be ref[i] when REF `in` IUPAC_AMBIG and the above schemas were not validated (as default)
returnLog[i] will be a list of strings representing only the successful schema’s properties on successful validation, and a list of strings including the falied validation message for each failed validation and schema