Skip to content

Commit

Permalink
refactor: merge master in branch (#549)
Browse files Browse the repository at this point in the history
* chore(deps): bump pre-commit/action from 2.0.3 to 3.0.0 (#542)

Bumps [pre-commit/action](https://github.com/pre-commit/action) from 2.0.3 to 3.0.0.
- [Release notes](https://github.com/pre-commit/action/releases)
- [Commits](pre-commit/action@v2.0.3...v3.0.0)

---
updated-dependencies:
- dependency-name: pre-commit/action
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <[email protected]>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* fix: voc probability (#541)

* fix: Replace outdated df append (#544)

* replaced all df.append with pd.concat

* replaced df.append with pd.concat

* replace df.append with pd.concat

* replaced df.append with pd.concat

* replaced df.append with pd.concat

* replaced df.append with pd.concat

Co-authored-by: lenakinzel <[email protected]>
Co-authored-by: Alexander Thomas <[email protected]>

* refactor: fgbio read clipping (#545)

* change primer file formatting

* change read clipping

* fmt

* fix primer clipping plot

* fix primer clipping plot

* fix

* fix

* typo

* test

* Update pangolin to 4.1 (#546)

Co-authored-by: thomasbtf <[email protected]>
Co-authored-by: Ann-Kathrin Brüggemann <[email protected]>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Alexander Thomas <[email protected]>
Co-authored-by: vBassewitz <[email protected]>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: thomasbtf <[email protected]>
Co-authored-by: Ann-Kathrin Brüggemann <[email protected]>
  • Loading branch information
7 people authored Jul 13, 2022
1 parent f00c1b5 commit 6b98645
Show file tree
Hide file tree
Showing 15 changed files with 122 additions and 87 deletions.
2 changes: 1 addition & 1 deletion .github/workflows/main.yml
Original file line number Diff line number Diff line change
Expand Up @@ -54,7 +54,7 @@ jobs:
steps:
- uses: actions/checkout@v2
- uses: actions/setup-python@v3
- uses: pre-commit/action@v2.0.3
- uses: pre-commit/action@v3.0.0

Technology-Tests:
runs-on: ubuntu-latest
Expand Down
14 changes: 12 additions & 2 deletions resources/lineage-variant-table-formatter.js
Original file line number Diff line number Diff line change
Expand Up @@ -117,13 +117,23 @@
Probability: function (prob) {
prob = parseFloat(prob)
if (!isNaN(prob)) {
var lighting = (0.9 - prob * 0.4) * 100;
if (prob < 0.5) {
var lighting = (0.5 + (prob * 0.8)) * 100;
var color = "10, 85%, " + lighting + "%"

} else if (prob < 0.95 && prob >= 0.5){
var lighting = ((0.9 * prob) + 0.05) * 100;
var color = "47, 100%, " + lighting + "%"
} else {
var lighting = (8.5 - (prob * 8)) * 100;
var color = "138, 72%, " + lighting + "%"
}
if (prob < 0.1) {
prob = prob.toExponential(2)
} else {
prob = prob.toFixed(2)
}
return `<span class="badge p-0 m-1"><span class="p-1 rounded" style="background:hsl(138, 72%, ${lighting}%);">${prob}</span></span>`;
return `<span class="badge p-0 m-1"><span class="p-1 rounded" style="background:hsl(${color});">${prob}</span></span>`;
} else {
return "&nbsp;"
}
Expand Down
2 changes: 1 addition & 1 deletion workflow/envs/pangolin.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2,4 +2,4 @@ channels:
- bioconda
- conda-forge
dependencies:
- pangolin =4.0.6
- pangolin =4.1
51 changes: 18 additions & 33 deletions workflow/rules/read_clipping.smk
Original file line number Diff line number Diff line change
Expand Up @@ -21,63 +21,48 @@ rule samtools_sort:
"0.74.0/bio/samtools/sort"


rule bed_to_bedpe:
rule bed_to_tsv:
input:
check_bed_for_URL(config["preprocessing"]["amplicon-primers"]),
output:
"resources/primer.bedpe",
"resources/primer.tsv",
log:
"logs/bed-to-bedpe.log",
conda:
"../envs/python.yaml"
script:
"../scripts/bed-to-bedpe.py"
"../scripts/bed-to-tsv.py"


rule bamclipper:
rule trim_primers_fgbio:
input:
bam="results/{date}/read-sorted/{read_type}~position/{sample}.initial.bam",
bai="results/{date}/read-sorted/{read_type}~position/{sample}.initial.bam.bai",
bedpe="resources/primer.bedpe",
output:
temp(
"results/{date}/read-clipping/softclipped/{read_type}/{sample}/{sample}.initial.primerclipped.bam"
ref="resources/genomes/{reference}.fasta".format(
reference=config["preprocessing"]["amplicon-reference"]
),
params:
output_dir=get_output_dir,
cwd=lambda w: os.getcwd(),
bed_path=lambda w, input: os.path.join(os.getcwd(), input.bedpe),
bam=lambda w, input: os.path.basename(input.bam),
primers="resources/primer.tsv",
output:
"results/{date}/read-clipping/hardclipped/{read_type}/{sample}/{sample}.bam",
log:
"logs/{date}/bamclipper/{read_type}/{sample}.log",
"logs/{date}/fgbio_primer_clipping/{read_type}/{sample}.log",
conda:
"../envs/bamclipper.yaml"
threads: 6
"../envs/fgbio.yaml"
shell:
"(cp {input.bam} {params.output_dir} &&"
" cp {input.bai} {params.output_dir} &&"
" cd {params.output_dir} &&"
" bamclipper.sh -b {params.bam} -p {params.bed_path} -n {threads} -u 5 -d 5) "
" > {params.cwd}/{log} 2>&1"
"fgbio TrimPrimers -i {input.bam} -p {input.primers} -o {output} -H true -r {input.ref} > {log} 2>&1"


rule fgbio:
rule filter_bam_fgbio:
input:
bam="results/{date}/read-clipping/softclipped/{read_type}/{sample}/{sample}.initial.primerclipped.bam",
bai="results/{date}/read-clipping/softclipped/{read_type}/{sample}/{sample}.initial.primerclipped.bam.bai",
ref="resources/genomes/{reference}.fasta".format(
reference=config["preprocessing"]["amplicon-reference"]
),
bam="results/{date}/read-clipping/hardclipped/{read_type}/{sample}/{sample}.bam",
output:
temp(
"results/{date}/read-clipping/hardclipped/{read_type}/{sample}/{sample}.bam"
),
"results/{date}/read-clipping/hc_filtered/{read_type}/{sample}/{sample}.bam",
log:
"logs/{date}/fgbio/{read_type}/{sample}.log",
"logs/{date}/fgbio_filter_bam/{read_type}/{sample}.log",
conda:
"../envs/fgbio.yaml"
shell:
"fgbio --sam-validation-stringency=LENIENT ClipBam -i {input.bam} -o {output} -H true -r {input.ref} > {log} 2>&1"
"fgbio FilterBam -i {input.bam} -o {output} --min-insert-size 100 --remove-single-end-mappings > {log} 2>&1"


rule samtools_fastq_pe:
Expand Down Expand Up @@ -132,7 +117,7 @@ rule plot_primer_clipping:
),
params:
samples=lambda wildcards: get_samples_for_date(wildcards.date),
bedpe="resources/primer.bedpe",
bedpe="resources/primer.tsv",
log:
"logs/{date}/plot-primer-clipping.log",
conda:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -23,10 +23,9 @@
df_sense["chrom"],
df_sense["start"],
df_sense["end"],
df_antisense["chrom"],
df_antisense["start"],
df_antisense["end"],
]
headers = ["chrom1", "start1", "end1", "chrom2", "start2", "end2"]
headers = ["chrom", "left_start", "left_end", "right_start", "right_end"]
df_bedpe = pd.concat(data, axis=1, keys=headers)
df_bedpe.to_csv(snakemake.output[0], header=None, index=None, sep="\t", mode="a")
df_bedpe.to_csv(snakemake.output[0], index=None, sep="\t", mode="a")
2 changes: 1 addition & 1 deletion workflow/scripts/collect_lineage_calls.py
Original file line number Diff line number Diff line change
Expand Up @@ -57,7 +57,7 @@ def collect_calls(sm_input, sm_output, states, lineage, number, length):
]

# bring them together
call = pangolin_calls.append(call)
call = pd.concat([pangolin_calls, call])

call.to_csv(sm_output, sep="\t", index=False)

Expand Down
2 changes: 1 addition & 1 deletion workflow/scripts/evaluate-strain-call-error.py
Original file line number Diff line number Diff line change
Expand Up @@ -62,7 +62,7 @@ def eval_error(paths, sm_output, max_reads, prefix, separator, percentage, load_

df = df.merge(org_mix_df, how="outer").fillna(0)

results_df = results_df.append(df)
results_df = pd.concat([results_df, df])

for sample in results_df["mix"].unique():
sample_rmse = rmse(
Expand Down
4 changes: 3 additions & 1 deletion workflow/scripts/extract-strains-from-gisaid-provision.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,9 @@ def extract_strains_from_provision(
chunks = pd.read_json(path_to_provision, lines=True, chunksize=9000)
for i, chunk in enumerate(chunks):
print(f"Parsing chunk {i}", file=sys.stderr)
provision = provision.append(select_oldest_strains(chunk), ignore_index=True)
provision = pd.concat(
[provision, select_oldest_strains(chunk)], ignore_index=True
)
provision = select_oldest_strains(provision)

# save strain genomes
Expand Down
55 changes: 39 additions & 16 deletions workflow/scripts/generate-lineage-variant-table.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,11 +21,19 @@ def phred_to_prob(phred):


# np.prod returns 1's as values for a pd series with NaN's. A list would return NaN's
def prod_prob_not_present(probs):
# switched for use of minimum
# def prod_prob_not_present(probs):
# if pd.isna(probs).any():
# return pd.NA
# else:
# return np.prod(probs)


def min_prob_not_present(probs):
if pd.isna(probs).any():
return pd.NA
else:
return np.prod(probs)
return np.min(probs)


def has_numbers(inputString):
Expand Down Expand Up @@ -67,29 +75,44 @@ def rename_enumeration(list_length):
lineages = record.info["LINEAGES"]
for signature in signatures:
# generate df with all signatures + VAF and Prob_not_present from calculation
variants_df = variants_df.append(
{
"Mutations": signature,
"Frequency": vaf,
"ReadDepth": dp,
"Prob_not_present": prob_not_present,
},
variants_df = pd.concat(
[
variants_df,
pd.DataFrame(
{
"Frequency": vaf,
"Mutations": signature,
"Prob_not_present": prob_not_present,
"ReadDepth": dp,
},
index=[0],
),
],
ignore_index=True,
)
# generate df with lineage matrix for all signatures
lineage_df = lineage_df.append(
{
"Mutations": signature,
**{lineage.replace(".", " "): "x" for lineage in lineages},
},

lineage_df = pd.concat(
[
lineage_df,
pd.DataFrame(
{
"Mutations": [signature],
**{
lineage.replace(".", " "): "x"
for lineage in lineages
},
},
index=[0],
),
],
ignore_index=True,
)

# aggregate both dataframes by summing up repeating rows for VAR (maximum=1) and multiply Prob_not_present
variants_df = variants_df.groupby(["Mutations"]).agg(
func={
"Frequency": lambda x: min(sum(x), 1.0),
"Prob_not_present": prod_prob_not_present,
"Prob_not_present": min_prob_not_present,
"ReadDepth": np.min,
},
axis=1,
Expand Down
2 changes: 1 addition & 1 deletion workflow/scripts/generate-overview-table.py
Original file line number Diff line number Diff line change
Expand Up @@ -65,7 +65,7 @@ def is_patient_report():
columns=[eukaryota, bacteria, viruses, sars_cov2, unclassified]
).fillna(0)
kraken_results["sample"] = sample
species_columns = species_columns.append(kraken_results, ignore_index=True)
species_columns = pd.concat([species_columns, kraken_results], ignore_index=True)

data = data.join(species_columns.set_index("sample"))

Expand Down
2 changes: 1 addition & 1 deletion workflow/scripts/plot-all-coverage.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ def plot_coverage(sm_input, sm_output, min_coverage):

sample_df["Sample"] = sample_df["#CHROM"].apply(lambda x: str(x).split(".")[0])

coverage = coverage.append(sample_df, ignore_index=True)
coverage = pd.concat([coverage, sample_df], ignore_index=True)

coverage["# Coverage"] = coverage.Coverage.apply(
lambda x: f"< {min_coverage}"
Expand Down
52 changes: 33 additions & 19 deletions workflow/scripts/plot-assembly-comparison.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,29 +18,43 @@ def register_lengths(sample, file_list, state, amplicon_state, data):
for file, assembler in zip(file_list, snakemake.params.assembler):
if state in ("initial", "scaffolded"):
with pysam.FastxFile(file) as infile:
data = data.append(
{
"Sample": sample,
"Assembler": assembler,
"Amplicon": amplicon_state,
"length (bp)": max(len(contig.sequence) for contig in infile),
"State": state,
},
data = pd.concat(
[
data,
pd.DataFrame(
{
"Sample": sample,
"Assembler": assembler,
"Amplicon": amplicon_state,
"length (bp)": max(
len(contig.sequence) for contig in infile
),
"State": state,
},
index=[0],
),
],
ignore_index=True,
)
else:
quastDf = pd.read_csv(file, sep="\t")
data = data.append(
{
"Sample": sample,
"Assembler": assembler,
"Amplicon": amplicon_state,
"length (bp)": quastDf.loc[0, "N50"],
"State": "N50",
"Genome fraction (%)": quastDf.loc[0, "Genome fraction (%)"]
if "Genome fraction (%)" in quastDf.columns
else float("nan"),
},
data = pd.concat(
[
data,
pd.DataFrame(
{
"Sample": sample,
"Assembler": assembler,
"Amplicon": amplicon_state,
"length (bp)": quastDf.loc[0, "N50"],
"State": "N50",
"Genome fraction (%)": quastDf.loc[0, "Genome fraction (%)"]
if "Genome fraction (%)" in quastDf.columns
else float("nan"),
},
index=[0],
),
],
ignore_index=True,
)
return data
Expand Down
2 changes: 1 addition & 1 deletion workflow/scripts/plot-dependency-of-pangolin-call.py
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,7 @@ def plot_dependency_of_pangolin_call(sm_input, sm_output):
pangolin_output["mixture_content"] = input.split(MIXTURE_PREFIX, 1)[-1].split(
"."
)[0]
all_sampes = all_sampes.append(pangolin_output, ignore_index=True)
all_sampes = pd.concat([all_sampes, pangolin_output], ignore_index=True)

all_sampes["mixture_content"] = all_sampes["mixture_content"].str.replace("-", ".")

Expand Down
2 changes: 1 addition & 1 deletion workflow/scripts/plot-pangolin-conflict.py
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,7 @@ def plot_pangolin_conflict(sm_input, sm_output):
pangolin_output = pd.read_csv(input)
pangolin_output["true_lineage"] = true_lineage
pangolin_output["true_lineage_percent"] = percent
all_sampes = all_sampes.append(pangolin_output, ignore_index=True)
all_sampes = pd.concat([all_sampes, pangolin_output], ignore_index=True)

all_sampes["correct_lineage_assignment"] = (
all_sampes["lineage"] == all_sampes["true_lineage"]
Expand Down
12 changes: 7 additions & 5 deletions workflow/scripts/plot-primer-clipping.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,8 +13,10 @@
from intervaltree import IntervalTree

# read primer bedpe to df
PRIMER = pd.read_csv(snakemake.params.get("bedpe", ""), delimiter="\t", header=None)
PRIMER.drop(PRIMER.columns[[0, 3]], axis=1, inplace=True)
PRIMER = pd.read_csv(snakemake.params.get("bedpe", ""), delimiter="\t", header=0)
print(PRIMER)
PRIMER.drop(PRIMER.columns[[0]], axis=1, inplace=True)
print(PRIMER)
PRIMER.columns = ["p1_start", "p1_end", "p2_start", "p2_end"]

# convert df to interval trees
Expand Down Expand Up @@ -116,7 +118,7 @@ def count_intervals(file):
"uncut primer within",
"cut primer exact",
"cut primer within",
"no mathing win",
"no matching win",
],
}
)
Expand Down Expand Up @@ -147,13 +149,13 @@ def plot_classes(counters):
counts_before = count_intervals(file)
counts_before["sample"] = sample
counts_before["state"] = "before"
all_df = all_df.append(counts_before, ignore_index=True)
all_df = pd.concat([all_df, counts_before], ignore_index=True)

for sample, file in iter_with_samples(snakemake.input.clipped):
counts_after = count_intervals(file)
counts_after["sample"] = sample
counts_after["state"] = "after"
all_df = all_df.append(counts_after, ignore_index=True)
all_df = pd.concat([all_df, counts_after], ignore_index=True)

bars, text = plot_classes(all_df)

Expand Down

0 comments on commit 6b98645

Please sign in to comment.