Trim primer sequences #39

ArtPoon · 2022-01-20T19:43:15Z

Currently we are using cutadapt to trim adapter sequences before mapping with minimap2. We are not removing primer sequences from the amplicon data (where primers may appear in the reads due to read-through), but we should be. Primer trimming is possible with cutadapt:
https://cutadapt.readthedocs.io/en/stable/recipes.html#trimming-amplicon-primers-from-both-ends-of-paired-end-reads
but I'm not sure whether this will support large numbers of primers as used in the ARTIC protocol.

ArtPoon · 2022-01-21T16:25:42Z

For an arbitrarily chosen set of FASTQ files, we have the following results on trimming adapters:

art@Langley:~/git/gromstole/data$ cutadapt -a AGATCGGAAGAGC -A AGATCGGAAGAGC -o test1.fastq.gz -p test2.fastq.gz -j 4 -m 10 redacted_L001_R1_001.fastq.gz redacted_L001_R2_001.fastq.gz 
This is cutadapt 1.18 with Python 3.6.9
Command line parameters: -a AGATCGGAAGAGC -A AGATCGGAAGAGC -o test1.fastq.gz -p test2.fastq.gz -j 4 -m 10 REDACTED_L001_R1_001.fastq.gz REDACTED_L001_R2_001.fastq.gz
Processing reads on 4 cores in paired-end mode ...
Finished in 1.20 s (17 us/read; 3.58 M reads/minute).

=== Summary ===

Total read pairs processed:             71,365
  Read 1 with adapter:                   1,495 (2.1%)
  Read 2 with adapter:                   1,678 (2.4%)
Pairs that were too short:                   0 (0.0%)
Pairs written (passing filters):        71,365 (100.0%)

Total basepairs processed:    35,825,230 bp
  Read 1:    17,912,615 bp
  Read 2:    17,912,615 bp
Total written (filtered):     35,814,462 bp (100.0%)
  Read 1:    17,907,758 bp
  Read 2:    17,906,704 bp

=== First read: Adapter 1 ===

Sequence: AGATCGGAAGAGC; Type: regular 3'; Length: 13; Trimmed: 1495 times.

No. of allowed errors:
0-9 bp: 0; 10-13 bp: 1

Bases preceding removed adapters:
  A: 29.2%
  C: 16.8%
  G: 38.4%
  T: 15.7%
  none/other: 0.0%

Overview of removed sequences
length	count	expect	max.err	error counts
3	1264	1115.1	0	1264
4	197	278.8	0	197
5	27	69.7	0	27
6	3	17.4	0	3
7	1	4.4	0	1
10	1	0.1	1	1
21	1	0.0	1	0 1
86	1	0.0	1	0 1

=== Second read: Adapter 2 ===

Sequence: AGATCGGAAGAGC; Type: regular 3'; Length: 13; Trimmed: 1678 times.

No. of allowed errors:
0-9 bp: 0; 10-13 bp: 1

Bases preceding removed adapters:
  A: 25.4%
  C: 16.6%
  G: 17.0%
  T: 40.9%
  none/other: 0.0%

Overview of removed sequences
length	count	expect	max.err	error counts
3	1087	1115.1	0	1087
4	390	278.8	0	390
5	188	69.7	0	188
6	6	17.4	0	6
7	2	4.4	0	2
9	1	0.3	0	1
10	1	0.1	1	0 1
11	1	0.0	1	0 1
17	1	0.0	1	0 1
53	1	0.0	1	0 1

Following workflow developed by @cfe-lab at https://github.com/cfe-lab/MiCall/blob/master/micall/core/trim_fastqs.py:

art@Langley:~/git/gromstole/data$ cutadapt -j 4 -g file:primers_sarscov2_left.fasta -A file:primers_sarscov2_left_end.fasta -o test1_left.fastq.gz -p test2_left.fastq.gz --trimmed-only --pair-filter=both test1.fastq.gz test2.fastq.gz 
This is cutadapt 1.18 with Python 3.6.9
Command line parameters: -j 4 -g file:primers_sarscov2_left.fasta -A file:primers_sarscov2_left_end.fasta -o test1_left.fastq.gz -p test2_left.fastq.gz --trimmed-only --pair-filter=both test1.fastq.gz test2.fastq.gz
Processing reads on 4 cores in paired-end mode ...
Finished in 9.44 s (132 us/read; 0.45 M reads/minute).

=== Summary ===

Total read pairs processed:             71,365
  Read 1 with adapter:                  58,195 (81.5%)
  Read 2 with adapter:                  57,701 (80.9%)
Pairs written (passing filters):        68,762 (96.4%)

Total basepairs processed:    35,814,462 bp
  Read 1:    17,907,758 bp
  Read 2:    17,906,704 bp
Total written (filtered):     34,016,062 bp (95.0%)
  Read 1:    16,975,652 bp
  Read 2:    17,040,410 bp

=== First read: Adapter nCoV-2019_1_LEFT ===
[OMITTED]

art@Langley:~/git/gromstole/data$ cutadapt -j 4 -G file:primers_sarscov2_right.fasta -a file:primers_sarscov2_right_end.fasta -o test1_right.fastq.gz -p test2_right.fastq.gz --trimmed-only --pair-filter=both test1_left.fastq.gz test2_left.fastq.gz 
This is cutadapt 1.18 with Python 3.6.9
Command line parameters: -j 4 -G file:primers_sarscov2_right.fasta -a file:primers_sarscov2_right_end.fasta -o test1_right.fastq.gz -p test2_right.fastq.gz --trimmed-only --pair-filter=both test1_left.fastq.gz test2_left.fastq.gz
Processing reads on 4 cores in paired-end mode ...
Finished in 9.45 s (137 us/read; 0.44 M reads/minute).

=== Summary ===

Total read pairs processed:             68,762
  Read 1 with adapter:                  54,153 (78.8%)
  Read 2 with adapter:                  60,348 (87.8%)
Pairs written (passing filters):        66,939 (97.3%)

Total basepairs processed:    34,016,062 bp
  Read 1:    16,975,652 bp
  Read 2:    17,040,410 bp
Total written (filtered):     32,626,077 bp (95.9%)
  Read 1:    16,319,040 bp
  Read 2:    16,307,037 bp

=== First read: Adapter nCoV-2019_1_RIGHT ===

[OMITTED]

Need to check if this is working correctly, i.e., are we cutting actual template sequence or primers?

'

ArtPoon · 2022-01-22T03:42:53Z

art@Langley:~/git/gromstole$ python3 scripts/trim_fastqs.py data/redacted_L001_R1_001.fastq.gz data/G1-1007_S4_L001_R2_001.fastq.gz test.R1.fastq.gz test.R2.fastq.gz --datadir data
art@Langley:~/git/gromstole$ python3 scripts/minimap2.py -o . -p iss39 -t 8 --nocut --ref data/NC_045512.fa test.R1.fastq.gz test.R2.fastq.gz 
50000.0 reads, 41649.0 (83%) mapped

ArtPoon · 2022-01-22T17:12:47Z

Drops in coverage in most cases, due to removal of reads (blue=original, red=filtered):

Comparing line counts, only about 20% of the data makes it through:

art@Langley:~/git/gromstole$ gunzip -c test.R1.fastq.gz | wc -l
285428
art@Langley:~/git/gromstole$ gunzip -c /data/wastewater/uploads/waterloo/run12/redacted_L001_R1_001.fastq.gz | wc -l
1457936

Coverage does go up in some spots, need to explain why that is..

ArtPoon · 2022-01-22T17:19:39Z

Comparing mutation frequencies between original and trimmed outputs:

We need to validate on simulated data, or compare to other pipelines.

ArtPoon · 2022-01-22T17:28:44Z

Compare MiCall trimming to iVar on the same sample, let's see if the latter is trimming as aggressively.

ArtPoon · 2022-01-22T17:32:58Z

Hm, iVar trims from aligned BAM files - we could apply the same method directly to the minimap2 output stream. Look at the iVar source code, but it could be as simple as appending positions that correspond to primers to the missing fields.

ArtPoon · 2022-01-22T17:52:37Z

Deactivating the --trimmed-only and --pair-filter=both options in the MiCall script did not change the output size at all.

Removed B.1.1.529 mutations from BA.1 and BA.2 lists - otherwise attempting to estimate the frequency of either sublineage is confounded. Fixed docstrings in minimap2.py Starting to work on issue #39

ArtPoon · 2022-02-08T20:23:44Z

load ARTIC v3 primer coordinates from file
for each read, determine midpoint coordinate of mapped location
estimate amplicon based on read midpoint, using Python module https://pypi.org/project/intervaltree/ ?
based on amplicon index, censor the positions in the read corresponding to the primers using the missing field

ArtPoon · 2022-02-15T20:27:19Z

Priority of this is downgraded due to issue #45

ArtPoon · 2022-03-15T19:50:11Z

Deprecated with the replacement of minimap2 with viralrecon

ArtPoon added a commit that referenced this issue Jan 21, 2022

working on issue #39'

1cc4dcc

'

ArtPoon self-assigned this Feb 1, 2022

ArtPoon added the help wanted Extra attention is needed label Feb 15, 2022

ArtPoon removed their assignment Feb 15, 2022

ArtPoon removed the help wanted Extra attention is needed label Mar 2, 2022

ArtPoon closed this as completed Mar 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Trim primer sequences #39

Trim primer sequences #39

ArtPoon commented Jan 20, 2022

ArtPoon commented Jan 21, 2022 •

edited

Loading

ArtPoon commented Jan 22, 2022 •

edited

Loading

ArtPoon commented Jan 22, 2022 •

edited

Loading

ArtPoon commented Jan 22, 2022

ArtPoon commented Jan 22, 2022

ArtPoon commented Jan 22, 2022

ArtPoon commented Jan 22, 2022

ArtPoon commented Feb 8, 2022

ArtPoon commented Feb 15, 2022

ArtPoon commented Mar 15, 2022

Trim primer sequences #39

Trim primer sequences #39

Comments

ArtPoon commented Jan 20, 2022

ArtPoon commented Jan 21, 2022 • edited Loading

ArtPoon commented Jan 22, 2022 • edited Loading

ArtPoon commented Jan 22, 2022 • edited Loading

ArtPoon commented Jan 22, 2022

ArtPoon commented Jan 22, 2022

ArtPoon commented Jan 22, 2022

ArtPoon commented Jan 22, 2022

ArtPoon commented Feb 8, 2022

ArtPoon commented Feb 15, 2022

ArtPoon commented Mar 15, 2022

ArtPoon commented Jan 21, 2022 •

edited

Loading

ArtPoon commented Jan 22, 2022 •

edited

Loading

ArtPoon commented Jan 22, 2022 •

edited

Loading