-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Trim primer sequences #39
Comments
For an arbitrarily chosen set of FASTQ files, we have the following results on trimming adapters: art@Langley:~/git/gromstole/data$ cutadapt -a AGATCGGAAGAGC -A AGATCGGAAGAGC -o test1.fastq.gz -p test2.fastq.gz -j 4 -m 10 redacted_L001_R1_001.fastq.gz redacted_L001_R2_001.fastq.gz
This is cutadapt 1.18 with Python 3.6.9
Command line parameters: -a AGATCGGAAGAGC -A AGATCGGAAGAGC -o test1.fastq.gz -p test2.fastq.gz -j 4 -m 10 REDACTED_L001_R1_001.fastq.gz REDACTED_L001_R2_001.fastq.gz
Processing reads on 4 cores in paired-end mode ...
Finished in 1.20 s (17 us/read; 3.58 M reads/minute).
=== Summary ===
Total read pairs processed: 71,365
Read 1 with adapter: 1,495 (2.1%)
Read 2 with adapter: 1,678 (2.4%)
Pairs that were too short: 0 (0.0%)
Pairs written (passing filters): 71,365 (100.0%)
Total basepairs processed: 35,825,230 bp
Read 1: 17,912,615 bp
Read 2: 17,912,615 bp
Total written (filtered): 35,814,462 bp (100.0%)
Read 1: 17,907,758 bp
Read 2: 17,906,704 bp
=== First read: Adapter 1 ===
Sequence: AGATCGGAAGAGC; Type: regular 3'; Length: 13; Trimmed: 1495 times.
No. of allowed errors:
0-9 bp: 0; 10-13 bp: 1
Bases preceding removed adapters:
A: 29.2%
C: 16.8%
G: 38.4%
T: 15.7%
none/other: 0.0%
Overview of removed sequences
length count expect max.err error counts
3 1264 1115.1 0 1264
4 197 278.8 0 197
5 27 69.7 0 27
6 3 17.4 0 3
7 1 4.4 0 1
10 1 0.1 1 1
21 1 0.0 1 0 1
86 1 0.0 1 0 1
=== Second read: Adapter 2 ===
Sequence: AGATCGGAAGAGC; Type: regular 3'; Length: 13; Trimmed: 1678 times.
No. of allowed errors:
0-9 bp: 0; 10-13 bp: 1
Bases preceding removed adapters:
A: 25.4%
C: 16.6%
G: 17.0%
T: 40.9%
none/other: 0.0%
Overview of removed sequences
length count expect max.err error counts
3 1087 1115.1 0 1087
4 390 278.8 0 390
5 188 69.7 0 188
6 6 17.4 0 6
7 2 4.4 0 2
9 1 0.3 0 1
10 1 0.1 1 0 1
11 1 0.0 1 0 1
17 1 0.0 1 0 1
53 1 0.0 1 0 1 Following workflow developed by @cfe-lab at https://github.com/cfe-lab/MiCall/blob/master/micall/core/trim_fastqs.py: art@Langley:~/git/gromstole/data$ cutadapt -j 4 -g file:primers_sarscov2_left.fasta -A file:primers_sarscov2_left_end.fasta -o test1_left.fastq.gz -p test2_left.fastq.gz --trimmed-only --pair-filter=both test1.fastq.gz test2.fastq.gz
This is cutadapt 1.18 with Python 3.6.9
Command line parameters: -j 4 -g file:primers_sarscov2_left.fasta -A file:primers_sarscov2_left_end.fasta -o test1_left.fastq.gz -p test2_left.fastq.gz --trimmed-only --pair-filter=both test1.fastq.gz test2.fastq.gz
Processing reads on 4 cores in paired-end mode ...
Finished in 9.44 s (132 us/read; 0.45 M reads/minute).
=== Summary ===
Total read pairs processed: 71,365
Read 1 with adapter: 58,195 (81.5%)
Read 2 with adapter: 57,701 (80.9%)
Pairs written (passing filters): 68,762 (96.4%)
Total basepairs processed: 35,814,462 bp
Read 1: 17,907,758 bp
Read 2: 17,906,704 bp
Total written (filtered): 34,016,062 bp (95.0%)
Read 1: 16,975,652 bp
Read 2: 17,040,410 bp
=== First read: Adapter nCoV-2019_1_LEFT ===
[OMITTED]
art@Langley:~/git/gromstole/data$ cutadapt -j 4 -G file:primers_sarscov2_right.fasta -a file:primers_sarscov2_right_end.fasta -o test1_right.fastq.gz -p test2_right.fastq.gz --trimmed-only --pair-filter=both test1_left.fastq.gz test2_left.fastq.gz
This is cutadapt 1.18 with Python 3.6.9
Command line parameters: -j 4 -G file:primers_sarscov2_right.fasta -a file:primers_sarscov2_right_end.fasta -o test1_right.fastq.gz -p test2_right.fastq.gz --trimmed-only --pair-filter=both test1_left.fastq.gz test2_left.fastq.gz
Processing reads on 4 cores in paired-end mode ...
Finished in 9.45 s (137 us/read; 0.44 M reads/minute).
=== Summary ===
Total read pairs processed: 68,762
Read 1 with adapter: 54,153 (78.8%)
Read 2 with adapter: 60,348 (87.8%)
Pairs written (passing filters): 66,939 (97.3%)
Total basepairs processed: 34,016,062 bp
Read 1: 16,975,652 bp
Read 2: 17,040,410 bp
Total written (filtered): 32,626,077 bp (95.9%)
Read 1: 16,319,040 bp
Read 2: 16,307,037 bp
=== First read: Adapter nCoV-2019_1_RIGHT ===
[OMITTED] Need to check if this is working correctly, i.e., are we cutting actual template sequence or primers? |
art@Langley:~/git/gromstole$ python3 scripts/trim_fastqs.py data/redacted_L001_R1_001.fastq.gz data/G1-1007_S4_L001_R2_001.fastq.gz test.R1.fastq.gz test.R2.fastq.gz --datadir data
art@Langley:~/git/gromstole$ python3 scripts/minimap2.py -o . -p iss39 -t 8 --nocut --ref data/NC_045512.fa test.R1.fastq.gz test.R2.fastq.gz
50000.0 reads, 41649.0 (83%) mapped |
Compare MiCall trimming to iVar on the same sample, let's see if the latter is trimming as aggressively. |
Hm, iVar trims from aligned BAM files - we could apply the same method directly to the minimap2 output stream. Look at the iVar source code, but it could be as simple as appending positions that correspond to primers to the |
Deactivating the |
Removed B.1.1.529 mutations from BA.1 and BA.2 lists - otherwise attempting to estimate the frequency of either sublineage is confounded. Fixed docstrings in minimap2.py Starting to work on issue #39
|
Priority of this is downgraded due to issue #45 |
Deprecated with the replacement of minimap2 with viralrecon |
Currently we are using cutadapt to trim adapter sequences before mapping with minimap2. We are not removing primer sequences from the amplicon data (where primers may appear in the reads due to read-through), but we should be. Primer trimming is possible with cutadapt:
https://cutadapt.readthedocs.io/en/stable/recipes.html#trimming-amplicon-primers-from-both-ends-of-paired-end-reads
but I'm not sure whether this will support large numbers of primers as used in the ARTIC protocol.
The text was updated successfully, but these errors were encountered: