Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trim primer sequences #39

Closed
ArtPoon opened this issue Jan 20, 2022 · 10 comments
Closed

Trim primer sequences #39

ArtPoon opened this issue Jan 20, 2022 · 10 comments

Comments

@ArtPoon
Copy link
Contributor

ArtPoon commented Jan 20, 2022

Currently we are using cutadapt to trim adapter sequences before mapping with minimap2. We are not removing primer sequences from the amplicon data (where primers may appear in the reads due to read-through), but we should be. Primer trimming is possible with cutadapt:
https://cutadapt.readthedocs.io/en/stable/recipes.html#trimming-amplicon-primers-from-both-ends-of-paired-end-reads
but I'm not sure whether this will support large numbers of primers as used in the ARTIC protocol.

@ArtPoon
Copy link
Contributor Author

ArtPoon commented Jan 21, 2022

For an arbitrarily chosen set of FASTQ files, we have the following results on trimming adapters:

art@Langley:~/git/gromstole/data$ cutadapt -a AGATCGGAAGAGC -A AGATCGGAAGAGC -o test1.fastq.gz -p test2.fastq.gz -j 4 -m 10 redacted_L001_R1_001.fastq.gz redacted_L001_R2_001.fastq.gz 
This is cutadapt 1.18 with Python 3.6.9
Command line parameters: -a AGATCGGAAGAGC -A AGATCGGAAGAGC -o test1.fastq.gz -p test2.fastq.gz -j 4 -m 10 REDACTED_L001_R1_001.fastq.gz REDACTED_L001_R2_001.fastq.gz
Processing reads on 4 cores in paired-end mode ...
Finished in 1.20 s (17 us/read; 3.58 M reads/minute).

=== Summary ===

Total read pairs processed:             71,365
  Read 1 with adapter:                   1,495 (2.1%)
  Read 2 with adapter:                   1,678 (2.4%)
Pairs that were too short:                   0 (0.0%)
Pairs written (passing filters):        71,365 (100.0%)

Total basepairs processed:    35,825,230 bp
  Read 1:    17,912,615 bp
  Read 2:    17,912,615 bp
Total written (filtered):     35,814,462 bp (100.0%)
  Read 1:    17,907,758 bp
  Read 2:    17,906,704 bp

=== First read: Adapter 1 ===

Sequence: AGATCGGAAGAGC; Type: regular 3'; Length: 13; Trimmed: 1495 times.

No. of allowed errors:
0-9 bp: 0; 10-13 bp: 1

Bases preceding removed adapters:
  A: 29.2%
  C: 16.8%
  G: 38.4%
  T: 15.7%
  none/other: 0.0%

Overview of removed sequences
length	count	expect	max.err	error counts
3	1264	1115.1	0	1264
4	197	278.8	0	197
5	27	69.7	0	27
6	3	17.4	0	3
7	1	4.4	0	1
10	1	0.1	1	1
21	1	0.0	1	0 1
86	1	0.0	1	0 1

=== Second read: Adapter 2 ===

Sequence: AGATCGGAAGAGC; Type: regular 3'; Length: 13; Trimmed: 1678 times.

No. of allowed errors:
0-9 bp: 0; 10-13 bp: 1

Bases preceding removed adapters:
  A: 25.4%
  C: 16.6%
  G: 17.0%
  T: 40.9%
  none/other: 0.0%

Overview of removed sequences
length	count	expect	max.err	error counts
3	1087	1115.1	0	1087
4	390	278.8	0	390
5	188	69.7	0	188
6	6	17.4	0	6
7	2	4.4	0	2
9	1	0.3	0	1
10	1	0.1	1	0 1
11	1	0.0	1	0 1
17	1	0.0	1	0 1
53	1	0.0	1	0 1

Following workflow developed by @cfe-lab at https://github.com/cfe-lab/MiCall/blob/master/micall/core/trim_fastqs.py:

art@Langley:~/git/gromstole/data$ cutadapt -j 4 -g file:primers_sarscov2_left.fasta -A file:primers_sarscov2_left_end.fasta -o test1_left.fastq.gz -p test2_left.fastq.gz --trimmed-only --pair-filter=both test1.fastq.gz test2.fastq.gz 
This is cutadapt 1.18 with Python 3.6.9
Command line parameters: -j 4 -g file:primers_sarscov2_left.fasta -A file:primers_sarscov2_left_end.fasta -o test1_left.fastq.gz -p test2_left.fastq.gz --trimmed-only --pair-filter=both test1.fastq.gz test2.fastq.gz
Processing reads on 4 cores in paired-end mode ...
Finished in 9.44 s (132 us/read; 0.45 M reads/minute).

=== Summary ===

Total read pairs processed:             71,365
  Read 1 with adapter:                  58,195 (81.5%)
  Read 2 with adapter:                  57,701 (80.9%)
Pairs written (passing filters):        68,762 (96.4%)

Total basepairs processed:    35,814,462 bp
  Read 1:    17,907,758 bp
  Read 2:    17,906,704 bp
Total written (filtered):     34,016,062 bp (95.0%)
  Read 1:    16,975,652 bp
  Read 2:    17,040,410 bp

=== First read: Adapter nCoV-2019_1_LEFT ===
[OMITTED]

art@Langley:~/git/gromstole/data$ cutadapt -j 4 -G file:primers_sarscov2_right.fasta -a file:primers_sarscov2_right_end.fasta -o test1_right.fastq.gz -p test2_right.fastq.gz --trimmed-only --pair-filter=both test1_left.fastq.gz test2_left.fastq.gz 
This is cutadapt 1.18 with Python 3.6.9
Command line parameters: -j 4 -G file:primers_sarscov2_right.fasta -a file:primers_sarscov2_right_end.fasta -o test1_right.fastq.gz -p test2_right.fastq.gz --trimmed-only --pair-filter=both test1_left.fastq.gz test2_left.fastq.gz
Processing reads on 4 cores in paired-end mode ...
Finished in 9.45 s (137 us/read; 0.44 M reads/minute).

=== Summary ===

Total read pairs processed:             68,762
  Read 1 with adapter:                  54,153 (78.8%)
  Read 2 with adapter:                  60,348 (87.8%)
Pairs written (passing filters):        66,939 (97.3%)

Total basepairs processed:    34,016,062 bp
  Read 1:    16,975,652 bp
  Read 2:    17,040,410 bp
Total written (filtered):     32,626,077 bp (95.9%)
  Read 1:    16,319,040 bp
  Read 2:    16,307,037 bp

=== First read: Adapter nCoV-2019_1_RIGHT ===

[OMITTED]

Need to check if this is working correctly, i.e., are we cutting actual template sequence or primers?

ArtPoon added a commit that referenced this issue Jan 21, 2022
@ArtPoon
Copy link
Contributor Author

ArtPoon commented Jan 22, 2022

art@Langley:~/git/gromstole$ python3 scripts/trim_fastqs.py data/redacted_L001_R1_001.fastq.gz data/G1-1007_S4_L001_R2_001.fastq.gz test.R1.fastq.gz test.R2.fastq.gz --datadir data
art@Langley:~/git/gromstole$ python3 scripts/minimap2.py -o . -p iss39 -t 8 --nocut --ref data/NC_045512.fa test.R1.fastq.gz test.R2.fastq.gz 
50000.0 reads, 41649.0 (83%) mapped

@ArtPoon
Copy link
Contributor Author

ArtPoon commented Jan 22, 2022

Drops in coverage in most cases, due to removal of reads (blue=original, red=filtered):
image

Comparing line counts, only about 20% of the data makes it through:

art@Langley:~/git/gromstole$ gunzip -c test.R1.fastq.gz | wc -l
285428
art@Langley:~/git/gromstole$ gunzip -c /data/wastewater/uploads/waterloo/run12/redacted_L001_R1_001.fastq.gz | wc -l
1457936

Coverage does go up in some spots, need to explain why that is..

@ArtPoon
Copy link
Contributor Author

ArtPoon commented Jan 22, 2022

Comparing mutation frequencies between original and trimmed outputs:

We need to validate on simulated data, or compare to other pipelines.

@ArtPoon
Copy link
Contributor Author

ArtPoon commented Jan 22, 2022

Compare MiCall trimming to iVar on the same sample, let's see if the latter is trimming as aggressively.

@ArtPoon
Copy link
Contributor Author

ArtPoon commented Jan 22, 2022

Hm, iVar trims from aligned BAM files - we could apply the same method directly to the minimap2 output stream. Look at the iVar source code, but it could be as simple as appending positions that correspond to primers to the missing fields.

@ArtPoon
Copy link
Contributor Author

ArtPoon commented Jan 22, 2022

Deactivating the --trimmed-only and --pair-filter=both options in the MiCall script did not change the output size at all.

ArtPoon added a commit that referenced this issue Jan 24, 2022
Removed B.1.1.529 mutations from BA.1 and BA.2 lists - otherwise attempting to estimate the frequency of either sublineage is confounded.
Fixed docstrings in minimap2.py
Starting to work on issue #39
@ArtPoon ArtPoon self-assigned this Feb 1, 2022
@ArtPoon
Copy link
Contributor Author

ArtPoon commented Feb 8, 2022

  • load ARTIC v3 primer coordinates from file
  • for each read, determine midpoint coordinate of mapped location
  • estimate amplicon based on read midpoint, using Python module https://pypi.org/project/intervaltree/ ?
  • based on amplicon index, censor the positions in the read corresponding to the primers using the missing field

@ArtPoon ArtPoon added the help wanted Extra attention is needed label Feb 15, 2022
@ArtPoon ArtPoon removed their assignment Feb 15, 2022
@ArtPoon
Copy link
Contributor Author

ArtPoon commented Feb 15, 2022

Priority of this is downgraded due to issue #45

@ArtPoon ArtPoon removed the help wanted Extra attention is needed label Mar 2, 2022
@ArtPoon
Copy link
Contributor Author

ArtPoon commented Mar 15, 2022

Deprecated with the replacement of minimap2 with viralrecon

@ArtPoon ArtPoon closed this as completed Mar 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant