Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Assertion Error during MSA #10

Closed
jdakota1305 opened this issue Jan 7, 2021 · 10 comments
Closed

Assertion Error during MSA #10

jdakota1305 opened this issue Jan 7, 2021 · 10 comments

Comments

@jdakota1305
Copy link

Hi,

First just wanted to say thanks for creating such a valuable tool.

I am currently experiencing an issue when running the MSA step of the workflow. I am getting this:

MSA length: 1,756,900 bp
Traceback (most recent call last):
File "/home/djones2/anaconda3/envs/trycycler/bin/trycycler", line 8, in
sys.exit(main())
File "/home/djones2/anaconda3/envs/trycycler/lib/python3.8/site-packages/trycycler/main.py", line 46, in main
msa(args)
File "/home/djones2/anaconda3/envs/trycycler/lib/python3.8/site-packages/trycycler/msa.py", line 36, in msa
merge_pieces(temp_dir, args.cluster_dir, seqs)
File "/home/djones2/anaconda3/envs/trycycler/lib/python3.8/site-packages/trycycler/msa.py", line 187, in merge_pieces
assert seqs[n] == msa_minus_dashes
AssertionError

Does it seem like there is a resolve for this? Any experience with this issue? Thanks in advance!

@rrwick
Copy link
Owner

rrwick commented Jan 7, 2021

Thank is a strange one! After the sequences are aligned in the MSA, the Trycycler code checks to make sure that they are the same sequences as before. I.e. if you take each sequence and remove the gaps (-), you should have the same sequence you started with. This was intended a sanity check and really shouldn't fail 😬

I can't say what the problem is without deeper investigation. Is there any chance you could share the 2_all_seqs.fasta file with me? If I could reproduce the problem on my end, I could get to the bottom of it.

Ryan

@jdakota1305
Copy link
Author

Thank is a strange one! After the sequences are aligned in the MSA, the Trycycler code checks to make sure that they are the same sequences as before. I.e. if you take each sequence and remove the gaps (-), you should have the same sequence you started with. This was intended a sanity check and really shouldn't fail 😬

I can't say what the problem is without deeper investigation. Is there any chance you could share the 2_all_seqs.fasta file with me? If I could reproduce the problem on my end, I could get to the bottom of it.

Ryan

@jdakota1305
Copy link
Author

2_all_seqs.fasta.gz

Hi Ryan,

Thanks for the quick response. I have attached the requested file. I think this might have to do with the size of the max indel--much appreciated for checking into this.

Best,
Dakota

@rrwick
Copy link
Owner

rrwick commented Jan 10, 2021

This turned out to be a fairly straightforward bug: the lowercase characters in your inputs confused Trycycler. I guess I never tested it with an assembler that uses lowercase in its FASTA files!

I've fixed the problem by making Trycycler explicitly convert to uppercase loading and saving FASTAs. You can grab the fixed version by either pulling from the master branch or from this fresh release: v0.4.2.

Thanks for letting me know and sharing your sequence for debugging purposes!

Ryan

@rrwick rrwick closed this as completed Jan 10, 2021
@spock
Copy link

spock commented May 28, 2021

I have encountered the same issue with version 0.5.0, so this isn't upper/lower case anymore.
And I haven't seen any warnings about MUSCLE alignments, so probably not #4 either.

At first, I had 4 assemblies, which looked like this after reconcile:

A_tig00000001 vs B_contig_3...    98.94% identity, max indel = 42
A_tig00000001 vs C_Utg2960...     99.53% identity, max indel = 242
A_tig00000001 vs D_ctg1...        99.92% identity, max indel = 3
   B_contig_3 vs C_Utg2960...     98.51% identity, max indel = 264
   B_contig_3 vs D_ctg1...        98.90% identity, max indel = 251
    C_Utg2960 vs D_ctg1...        99.48% identity, max indel = 224

Pairwise identities:
  A_tig00000001: 100.00%   98.94%   99.53%   99.92%
  B_contig_3:     98.94%  100.00%   98.51%   98.90%
  C_Utg2960:      99.53%   98.51%  100.00%   99.48%
  D_ctg1:         99.92%   98.90%   99.48%  100.00%

Maximum insertion/deletion sizes:
  A_tig00000001:   0   42  242    3
  B_contig_3:     42    0  264  251
  C_Utg2960:     242  264    0  224
  D_ctg1:          3  251  224    0

(To achieve the above, I had to slightly increase the max_indel_size to 265 and max_add_seq to 11000.)

At trycyler msa step this resulted in:

    Each of the MSA pieces are now merged together and saved to file.

MSA length: 3,327,393 bp
Traceback (most recent call last):
  File ".local/bin/trycycler", line 8, in <module>
    sys.exit(main())
  File ".local/lib/python3.7/site-packages/trycycler/__main__.py", line 51, in main
    msa(args)
  File ".local/lib/python3.7/site-packages/trycycler/msa.py", line 36, in msa
    merge_pieces(temp_dir, args.cluster_dir, seqs)
  File ".local/lib/python3.7/site-packages/trycycler/msa.py", line 187, in merge_pieces
    assert seqs[n].upper() == msa_minus_dashes
AssertionError

I then tried removing the most obvious outlier C:

A_tig00000001 vs B_contig_3...    98.94% identity, max indel = 42
A_tig00000001 vs D_ctg1...        99.92% identity, max indel = 3
   B_contig_3 vs D_ctg1...        98.90% identity, max indel = 251

Pairwise identities:
  A_tig00000001: 100.00%   98.94%   99.92%
  B_contig_3:     98.94%  100.00%   98.90%
  D_ctg1:         99.92%   98.90%  100.00%

Maximum insertion/deletion sizes:
  A_tig00000001:   0   42    3
  B_contig_3:     42    0  251
  D_ctg1:          3  251    0

Still got AssertionError, so I removed the "second worst" B, and then trycycler msa worked:

A_tig00000001 vs D_ctg1...        99.92% identity, max indel = 3

Pairwise identities:
  A_tig00000001: 100.00%   99.92%
  D_ctg1:         99.92%  100.00%

Maximum insertion/deletion sizes:
  A_tig00000001: 0  3
  D_ctg1:        3  0

Unfortunately, I can't share the sequences for detailed debugging, but I do have some observations about these assemblies:

  • the B assembly (flye) had an isolated 36kb fragment which all other assemblers placed within the chromosome; it is also the smallest assembly of all
  • the C assembly (raven) only stands out to me with perceptibly higher base error rate, but I didn't spot other irregularities with it

This is unlikely to help much, but here are some more outputs from the first stages of the pipeline:

A (canu_2019.fasta):
  288,715 alignments, mean depth = 605.7x
  A_tig00000001: 3,335,954 bp, 609.6x
  A_tig00000003:    29,452 bp, 167.2x

B (flye_2019.fasta):
  288,162 alignments, mean depth = 610.7x
  B_contig_1:    36,610 bp, 499.2x
  B_contig_3: 3,286,113 bp, 611.9x

C (raven_2019.fasta):
  288,532 alignments, mean depth = 608.5x
  C_Utg2958:    18,561 bp, 231.9x
  C_Utg2960: 3,319,309 bp, 610.6x

D (redbean_2019.fasta):
  287,825 alignments, mean depth = 611.1x
  D_ctg2:     8,729 bp,   0.0x
  D_ctg1: 3,307,815 bp, 612.8x

...

    Mash is used to build a distance matrix of all contigs in the assemblies.

A_tig00000001: 0.000  0.250  0.192  0.001  0.250  0.002  0.001
A_tig00000003: 0.250  0.000  0.250  0.250  0.005  0.250  0.250
B_contig_1:    0.192  0.250  0.000  0.250  0.250  0.192  0.204
B_contig_3:    0.001  0.250  0.250  0.000  0.250  0.002  0.001
C_Utg2958:     0.250  0.005  0.250  0.250  0.000  0.250  0.250
C_Utg2960:     0.001  0.250  0.178  0.002  0.250  0.000  0.002
D_ctg1:        0.001  0.250  0.204  0.001  0.250  0.002  0.000

# I have only worked on cluster_001: cluster_003 is a clear artifact,
# and I'm not entirely sure what cluster_002 is - could be a  plasmid assembled to 2x of it's length
# (at least that's what the dotplot suggests).

2019/cluster_001/1_contigs:
  2019/cluster_001/1_contigs/A_tig00000001.fasta: 3,335,954 bp, 609.6x
  2019/cluster_001/1_contigs/B_contig_3.fasta:    3,286,113 bp, 611.9x
  2019/cluster_001/1_contigs/C_Utg2960.fasta:     3,319,309 bp, 610.6x
  2019/cluster_001/1_contigs/D_ctg1.fasta:        3,307,815 bp, 612.8x

2019/cluster_002/1_contigs:
  2019/cluster_002/1_contigs/A_tig00000003.fasta:    29,452 bp, 167.2x
  2019/cluster_002/1_contigs/C_Utg2958.fasta:        18,561 bp, 231.9x

2019/cluster_003/1_contigs:
  2019/cluster_003/1_contigs/B_contig_1.fasta:       36,610 bp, 499.2x

and a dotplot for cluster_001

image

@DingJingZhi
Copy link

Merging MSA (2024-09-13 16:33:34)
Each of the MSA pieces are now merged together and saved to file.
File "/home/djz/miniconda3/envs/trycycler/lib/python3.10/site-packages/trycycler/msa.py", line 196, in merge_pieces
assert seqs[n].upper() == msa_minus_dashes
AssertionError
my solution: line 196, add "#", results is ok, passed!

AssertionError
File "/home/djz/miniconda3/envs/trycycler/lib/python3.10/site-packages/trycycler/consensus.py", line 675, in sanity_check_msa
assert seqs[n] == msa_seqs[n].replace('-', '')

my solution: line 675, add "#", results is ok, passed!

then, I harvested this:
Saving sequence to file: trycycler/cluster_001/7_final_consensus.fasta

RRwich, OK?

@rrwick
Copy link
Owner

rrwick commented Sep 16, 2024

It might be okay, but I would like to understand why that assertion caused a crash. Simply commenting out the assertion line prevents the crash, but it might have allowed corrupted data through.

If you can share your data, could you send me the 2_all_seqs.fasta file? Then I can try running trycycler msa myself to replicate the problem.

@marade
Copy link

marade commented Oct 2, 2024

Hi Ryan. I've encountered this error a couple times recently. While I cannot send the sequences, I did inspect 2_all_seqs.fasta in Vim. Specifically I used :/[^ATCG] to search for non-ATCG characters, and there were none except for the headers:

>C_utg000001c
>F_utg000001c
>G_contig_2

Hope this helps.

@rrwick
Copy link
Owner

rrwick commented Oct 3, 2024

Sorry, but that doesn't shed any light on the problem 😕

If you're able to email me the file, I'll delete it once I've debugged the issue. I won't include my email here (for bot-scraping reasons), but you can find it in the license message at the top of Trycycler's source files.

@marade
Copy link

marade commented Oct 3, 2024

Okay, I sent you an e-mail.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants