Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Assertion Error during MSA #10

Closed
jdakota1305 opened this issue Jan 7, 2021 · 16 comments
Closed

Assertion Error during MSA #10

jdakota1305 opened this issue Jan 7, 2021 · 16 comments

Comments

@jdakota1305
Copy link

Hi,

First just wanted to say thanks for creating such a valuable tool.

I am currently experiencing an issue when running the MSA step of the workflow. I am getting this:

MSA length: 1,756,900 bp
Traceback (most recent call last):
File "/home/djones2/anaconda3/envs/trycycler/bin/trycycler", line 8, in
sys.exit(main())
File "/home/djones2/anaconda3/envs/trycycler/lib/python3.8/site-packages/trycycler/main.py", line 46, in main
msa(args)
File "/home/djones2/anaconda3/envs/trycycler/lib/python3.8/site-packages/trycycler/msa.py", line 36, in msa
merge_pieces(temp_dir, args.cluster_dir, seqs)
File "/home/djones2/anaconda3/envs/trycycler/lib/python3.8/site-packages/trycycler/msa.py", line 187, in merge_pieces
assert seqs[n] == msa_minus_dashes
AssertionError

Does it seem like there is a resolve for this? Any experience with this issue? Thanks in advance!

@rrwick
Copy link
Owner

rrwick commented Jan 7, 2021

Thank is a strange one! After the sequences are aligned in the MSA, the Trycycler code checks to make sure that they are the same sequences as before. I.e. if you take each sequence and remove the gaps (-), you should have the same sequence you started with. This was intended a sanity check and really shouldn't fail 😬

I can't say what the problem is without deeper investigation. Is there any chance you could share the 2_all_seqs.fasta file with me? If I could reproduce the problem on my end, I could get to the bottom of it.

Ryan

@jdakota1305
Copy link
Author

Thank is a strange one! After the sequences are aligned in the MSA, the Trycycler code checks to make sure that they are the same sequences as before. I.e. if you take each sequence and remove the gaps (-), you should have the same sequence you started with. This was intended a sanity check and really shouldn't fail 😬

I can't say what the problem is without deeper investigation. Is there any chance you could share the 2_all_seqs.fasta file with me? If I could reproduce the problem on my end, I could get to the bottom of it.

Ryan

@jdakota1305
Copy link
Author

2_all_seqs.fasta.gz

Hi Ryan,

Thanks for the quick response. I have attached the requested file. I think this might have to do with the size of the max indel--much appreciated for checking into this.

Best,
Dakota

@rrwick
Copy link
Owner

rrwick commented Jan 10, 2021

This turned out to be a fairly straightforward bug: the lowercase characters in your inputs confused Trycycler. I guess I never tested it with an assembler that uses lowercase in its FASTA files!

I've fixed the problem by making Trycycler explicitly convert to uppercase loading and saving FASTAs. You can grab the fixed version by either pulling from the master branch or from this fresh release: v0.4.2.

Thanks for letting me know and sharing your sequence for debugging purposes!

Ryan

@rrwick rrwick closed this as completed Jan 10, 2021
@spock
Copy link

spock commented May 28, 2021

I have encountered the same issue with version 0.5.0, so this isn't upper/lower case anymore.
And I haven't seen any warnings about MUSCLE alignments, so probably not #4 either.

At first, I had 4 assemblies, which looked like this after reconcile:

A_tig00000001 vs B_contig_3...    98.94% identity, max indel = 42
A_tig00000001 vs C_Utg2960...     99.53% identity, max indel = 242
A_tig00000001 vs D_ctg1...        99.92% identity, max indel = 3
   B_contig_3 vs C_Utg2960...     98.51% identity, max indel = 264
   B_contig_3 vs D_ctg1...        98.90% identity, max indel = 251
    C_Utg2960 vs D_ctg1...        99.48% identity, max indel = 224

Pairwise identities:
  A_tig00000001: 100.00%   98.94%   99.53%   99.92%
  B_contig_3:     98.94%  100.00%   98.51%   98.90%
  C_Utg2960:      99.53%   98.51%  100.00%   99.48%
  D_ctg1:         99.92%   98.90%   99.48%  100.00%

Maximum insertion/deletion sizes:
  A_tig00000001:   0   42  242    3
  B_contig_3:     42    0  264  251
  C_Utg2960:     242  264    0  224
  D_ctg1:          3  251  224    0

(To achieve the above, I had to slightly increase the max_indel_size to 265 and max_add_seq to 11000.)

At trycyler msa step this resulted in:

    Each of the MSA pieces are now merged together and saved to file.

MSA length: 3,327,393 bp
Traceback (most recent call last):
  File ".local/bin/trycycler", line 8, in <module>
    sys.exit(main())
  File ".local/lib/python3.7/site-packages/trycycler/__main__.py", line 51, in main
    msa(args)
  File ".local/lib/python3.7/site-packages/trycycler/msa.py", line 36, in msa
    merge_pieces(temp_dir, args.cluster_dir, seqs)
  File ".local/lib/python3.7/site-packages/trycycler/msa.py", line 187, in merge_pieces
    assert seqs[n].upper() == msa_minus_dashes
AssertionError

I then tried removing the most obvious outlier C:

A_tig00000001 vs B_contig_3...    98.94% identity, max indel = 42
A_tig00000001 vs D_ctg1...        99.92% identity, max indel = 3
   B_contig_3 vs D_ctg1...        98.90% identity, max indel = 251

Pairwise identities:
  A_tig00000001: 100.00%   98.94%   99.92%
  B_contig_3:     98.94%  100.00%   98.90%
  D_ctg1:         99.92%   98.90%  100.00%

Maximum insertion/deletion sizes:
  A_tig00000001:   0   42    3
  B_contig_3:     42    0  251
  D_ctg1:          3  251    0

Still got AssertionError, so I removed the "second worst" B, and then trycycler msa worked:

A_tig00000001 vs D_ctg1...        99.92% identity, max indel = 3

Pairwise identities:
  A_tig00000001: 100.00%   99.92%
  D_ctg1:         99.92%  100.00%

Maximum insertion/deletion sizes:
  A_tig00000001: 0  3
  D_ctg1:        3  0

Unfortunately, I can't share the sequences for detailed debugging, but I do have some observations about these assemblies:

  • the B assembly (flye) had an isolated 36kb fragment which all other assemblers placed within the chromosome; it is also the smallest assembly of all
  • the C assembly (raven) only stands out to me with perceptibly higher base error rate, but I didn't spot other irregularities with it

This is unlikely to help much, but here are some more outputs from the first stages of the pipeline:

A (canu_2019.fasta):
  288,715 alignments, mean depth = 605.7x
  A_tig00000001: 3,335,954 bp, 609.6x
  A_tig00000003:    29,452 bp, 167.2x

B (flye_2019.fasta):
  288,162 alignments, mean depth = 610.7x
  B_contig_1:    36,610 bp, 499.2x
  B_contig_3: 3,286,113 bp, 611.9x

C (raven_2019.fasta):
  288,532 alignments, mean depth = 608.5x
  C_Utg2958:    18,561 bp, 231.9x
  C_Utg2960: 3,319,309 bp, 610.6x

D (redbean_2019.fasta):
  287,825 alignments, mean depth = 611.1x
  D_ctg2:     8,729 bp,   0.0x
  D_ctg1: 3,307,815 bp, 612.8x

...

    Mash is used to build a distance matrix of all contigs in the assemblies.

A_tig00000001: 0.000  0.250  0.192  0.001  0.250  0.002  0.001
A_tig00000003: 0.250  0.000  0.250  0.250  0.005  0.250  0.250
B_contig_1:    0.192  0.250  0.000  0.250  0.250  0.192  0.204
B_contig_3:    0.001  0.250  0.250  0.000  0.250  0.002  0.001
C_Utg2958:     0.250  0.005  0.250  0.250  0.000  0.250  0.250
C_Utg2960:     0.001  0.250  0.178  0.002  0.250  0.000  0.002
D_ctg1:        0.001  0.250  0.204  0.001  0.250  0.002  0.000

# I have only worked on cluster_001: cluster_003 is a clear artifact,
# and I'm not entirely sure what cluster_002 is - could be a  plasmid assembled to 2x of it's length
# (at least that's what the dotplot suggests).

2019/cluster_001/1_contigs:
  2019/cluster_001/1_contigs/A_tig00000001.fasta: 3,335,954 bp, 609.6x
  2019/cluster_001/1_contigs/B_contig_3.fasta:    3,286,113 bp, 611.9x
  2019/cluster_001/1_contigs/C_Utg2960.fasta:     3,319,309 bp, 610.6x
  2019/cluster_001/1_contigs/D_ctg1.fasta:        3,307,815 bp, 612.8x

2019/cluster_002/1_contigs:
  2019/cluster_002/1_contigs/A_tig00000003.fasta:    29,452 bp, 167.2x
  2019/cluster_002/1_contigs/C_Utg2958.fasta:        18,561 bp, 231.9x

2019/cluster_003/1_contigs:
  2019/cluster_003/1_contigs/B_contig_1.fasta:       36,610 bp, 499.2x

and a dotplot for cluster_001

image

@DingJingZhi
Copy link

Merging MSA (2024-09-13 16:33:34)
Each of the MSA pieces are now merged together and saved to file.
File "/home/djz/miniconda3/envs/trycycler/lib/python3.10/site-packages/trycycler/msa.py", line 196, in merge_pieces
assert seqs[n].upper() == msa_minus_dashes
AssertionError
my solution: line 196, add "#", results is ok, passed!

AssertionError
File "/home/djz/miniconda3/envs/trycycler/lib/python3.10/site-packages/trycycler/consensus.py", line 675, in sanity_check_msa
assert seqs[n] == msa_seqs[n].replace('-', '')

my solution: line 675, add "#", results is ok, passed!

then, I harvested this:
Saving sequence to file: trycycler/cluster_001/7_final_consensus.fasta

RRwich, OK?

@rrwick
Copy link
Owner

rrwick commented Sep 16, 2024

It might be okay, but I would like to understand why that assertion caused a crash. Simply commenting out the assertion line prevents the crash, but it might have allowed corrupted data through.

If you can share your data, could you send me the 2_all_seqs.fasta file? Then I can try running trycycler msa myself to replicate the problem.

@marade
Copy link

marade commented Oct 2, 2024

Hi Ryan. I've encountered this error a couple times recently. While I cannot send the sequences, I did inspect 2_all_seqs.fasta in Vim. Specifically I used :/[^ATCG] to search for non-ATCG characters, and there were none except for the headers:

>C_utg000001c
>F_utg000001c
>G_contig_2

Hope this helps.

@rrwick
Copy link
Owner

rrwick commented Oct 3, 2024

Sorry, but that doesn't shed any light on the problem 😕

If you're able to email me the file, I'll delete it once I've debugged the issue. I won't include my email here (for bot-scraping reasons), but you can find it in the license message at the top of Trycycler's source files.

@marade
Copy link

marade commented Oct 3, 2024

Okay, I sent you an e-mail.

@marade
Copy link

marade commented Oct 31, 2024

Hi Ryan. It happened again. I hesitate to send you the sequences because I think it'll just mysteriously work for you again. Taking a cue from @DingJingZhi, I simply commented out line 196 of msa.py and lines 673-675 of consensus.py to get it to work. Note I needed to comment another two lines from consensus.py compared to @DingJingZhi.

@rrwick
Copy link
Owner

rrwick commented Nov 1, 2024

Very frustrating! Since your previous case worked on my computer but not on yours, I suspect this could be a machine-specific problem. Do you have access to another computer where you could try running it?

Also, you could try both MUSCLE v3 and MUSCLE v5. Both worked for me on your last case, but since MUSCLE is doing the heavy lifting for Trycycler MSA, I wouldn't be surprised if the problem relates to MUSCLE somehow.

Ryan

@marade
Copy link

marade commented Nov 1, 2024

I've been running this on AWS EC2 custom AMIs. Each time I do a run, I'm starting with a fresh machine image, with all the Bioconda stuff preinstalled. So it's a predefined, pristine sandbox each time with the same software versions. For MUSCLE, I've been using the 3.8.31 version from Bioconda. I'll try some other versions.

@marade
Copy link

marade commented Nov 1, 2024

...and that did it. Bioconda muscle-3.8.1551 works, but muscle-3.8.31 does not. I can upgrade and downgrade, and these results are reproducable.

@rrwick
Copy link
Owner

rrwick commented Nov 2, 2024

Excellent!

I can see on the MUSCLE v3 download page that v3.8.425 has a 'bug fix for long sequences', and presumably v3.8.1551 is a later version and also contains that fix. The alignment @marade sent me had a very long partition, so I suspect the MUSCLE bug may be behind this Trycycler problem.

Hopefully using v3.8.1551 also fixes the problem for others. I'll add a note about this to Trycycler's FAQ.

@DingJingZhi
Copy link

DingJingZhi commented Nov 2, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants