Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Continue crashed analysis from tree inference step #61

Open
diegomarquezp opened this issue Aug 9, 2021 · 8 comments
Open

Continue crashed analysis from tree inference step #61

diegomarquezp opened this issue Aug 9, 2021 · 8 comments

Comments

@diegomarquezp
Copy link

diegomarquezp commented Aug 9, 2021

Hello Siavash. Hoping you are well.
I'm reaching the final steps to build a tree from the Silva 13.8 dataset.
Unfortunately, it crashed during the tree inference step.
The only iteration for the realignment step took a bit more than 3 weeks. I was checking the wiki for a way to continue with the alignment produced with this step, but apparently the --aligned option still goes over a realignment step.
I did some time estimations with subsets of the Silva database and it should take about 3 more days before finishing the tree, only if we manage to skip the realignment, otherwise it would be 3 more weeks again.

I was wondering if I'm missing an option from the wiki to continue from this substep of the iteration. Otherwise, I can try to modify the code to provide the last alignment to the first iteration. If that's the case, I will need to kindly ask you to refer me to the involved files in this change or any development documentation to aid in solving this situation.

Update: I found out that the inference step consists of a call to fasttreeMP - the debug output shows the exact args to execute the binary with. I'm thinking that the final steps would involve running a modified version of treeholder.py

Thanks beforehand for your help.

@smirarab
Copy link
Owner

smirarab commented Aug 9, 2021 via email

@diegomarquezp
Copy link
Author

diegomarquezp commented Aug 9, 2021

Last month I started pasta with -i pastajob_temp_iteration_initialsearch_seq_alignment.txt --aligned
The last file produced in the folder today was pastajob_temp_iteration_0_seq_alignment.txt

So is it possible to just obtain the final tree with fasttreeMP from ...iteration_0_seq_alignment.txt ?

From the subset tests logs, I'm guessing the command below would be useful for this last step?:

/home/ec2-user/pasta-code/pasta/bin/fasttreeMP -quiet -nt -gtr -gamma -                                              **configuration)
fastest -intree /home/ec2-user/.pasta/pastajob/tempBaXNdl/step0/mincluster/tempfasttreeGOBzhJ/start.tre -log /home/ec2-user/.pasta/past
ajob/tempBaXNdl/step0/mincluster/tempfasttreeGOBzhJ/log /home/ec2-user/.pasta/pastajob/tempBaXNdl/step0/mincluster/tempfasttreeGOBzhJ/i                         pmj.launch_alignment(context_str=context_str)
nput.fasta

(assuming input.fasta == ...iteration_0_seq_alignment.txt)

edit: Yes, it did perform a realignment step (one iteration)

Thank you!

@smirarab
Copy link
Owner

smirarab commented Aug 10, 2021 via email

@diegomarquezp
Copy link
Author

Thanks so much Siavash. I'll let you know about how fasttree goes.

@smirarab
Copy link
Owner

smirarab commented Aug 10, 2021 via email

@diegomarquezp
Copy link
Author

Hi Siavash.

Thanks for the added steps on the tutorial.
I could restart the crashed step using the first and only iteration's alignment and finally obtain a tree this week.
With the tree and aligned sequences, I tried to run SEPP on it through QIIME after importing the PASTA results, but it took way too long (15+ hours) compared with the SEPP reference database that you published for SILVA 12.8 (40 minutes).
What I have noticed is, the aligned sequences contained in the 12.8 QZA file are only a subset of the whole 12.8 reference database.
I was wondering if you used any special criteria to extract the subset. Would restrict the aligned sequences set to 2-3 sequences per species do the work? That would roughly match the size of the sequences of 12.8.
That would be the only step needed as we already have the alignment.

Thank you very much.

@smirarab
Copy link
Owner

Hi Diego,

There are two potential reasons.

  1. Default output of PASTA is not masked for super gappy sites. There are many sites that have just a couple of letters in them among millions of species. We need to remove those before using them as input to SEPP. For removing gappy sites, I suggest you use the run_seqtools.py method that you learned about in the tutorial. I would remove sites with 99.9% gaps or 99% gaps. You can try different thresholds and see how many sites are left in the final alignment. You should hopefully have something in the same order as 12.8 (thousands of sites).

  2. Once (1) is taken care of, if the running time is still high, we can think about removing sequences that are too similar to each other. For doing that, I would suggest 99% similarity or something like that. You can also use our tool TreeCluster (https://github.com/niemasd/TreeCluster) to find the optimal subset given the tree you already have.

Thanks

@diegomarquezp
Copy link
Author

Hi Siavash, thanks for the response. I will let you know about this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants