Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Building a better pangenome #306

Open
GeorgeBGM opened this issue Jun 18, 2023 · 25 comments
Open

Building a better pangenome #306

GeorgeBGM opened this issue Jun 18, 2023 · 25 comments

Comments

@GeorgeBGM
Copy link

Hi, How to add new sample genomes and contigs to an existing pan-genome producted by PGGB, and whether it can be done directly using the Minigraph or GraphAligner tool. Any suggestions on how to do this.

@subwaystation
Copy link
Member

Hi @George-du,

there are several possibilities:

  • Rebuild the whole graph with the new sample genomes and contigs added.
  • You might be able to map against the graph using GraphAligner, but only if you masked all complex and repetitive regions in your input sequences. Else it won't scale.
  • minigraph is not an option here, because it is reference-based and only accepts rGFA.

I would recommend the first option, though I am aware of the computational overhead.

@GeorgeBGM
Copy link
Author

Thank you for your reply, I will take your suggestion and feel that adding the new function of PGGB to add new samples will be very helpful and useful.

@ekg
Copy link
Collaborator

ekg commented Jun 20, 2023 via email

@GeorgeBGM
Copy link
Author

Sounds great, thanks so much for your help.

@GeorgeBGM
Copy link
Author

Hi, I am using the PGGB process to split chromosomes to build a pan-genome and the smoothxg software is generating errors on some of the chromosomes. do you have some suggestions about these reported errors?Thanks!

Software : Smoothxg(v0.6.8-0-ga8a0e9e)
Error1:
smoothxg -t 30 -T 30 -g ./graphs/chrY.pan/chrY.pan.new.fa.gz.bf8016f.04f1c29.seqwish.gfa -r 114 --base ./graphs/chrY.pan --chop-to 100 -I .9800 -R 0 -j 0 -e 0 -l 700,900,1100 -P 1,19,39,3,81,1 -O 0.03 -Y 11400 -d 0 -D 0 -S -Q Consensus_ -V -o ./graphs/chrY.pan/chrY.pan.new.fa.gz.bf8016f.04f1c29.5ef21f9.smooth.gfa
259730.80s user 26577.66s system 82% cpu 345179.20s total 53544752Kb max memory

image

Error2:
smoothxg -t 30 -T 30 -g ./graphs/chr9.pan/chr9.pan.new.fa.gz.2ca993e.04f1c29.seqwish.gfa -r 236 --base ./graphs/chr9.pan --chop-to 100 -I .9800 -R 0 -j 0 -e 0 -l 700,900,1100 -P 1,19,39,3,81,1 -O 0.03 -Y 23600 -d 0 -D 0 -S -Q Consensus_ -V -o ./graphs/chr9.pan/chr9.pan.new.fa.gz.2ca993e.04f1c29.03ca4fb.smooth.gfa
image

@GeorgeBGM
Copy link
Author

Hi, I'm curious if I described the problem clearly and if there are some suggestions about the solution to this problem?

@GeorgeBGM
Copy link
Author

Hi, developers!
What should I do to avoid the above reported error?
Should I re-run the smoothxg program without the -Q Consensus_ parameter and the -O 0, or do I need to reduce my mash length from 50kb to 10kb (Reference: #182). Are there some other suggestions? Besides, do these two strategies have a significant impact on the final result?

@GeorgeBGM
Copy link
Author

GeorgeBGM commented Aug 2, 2023

Hi, @subwaystation @ekg ,

I tried the above strategy on human chromosome 13, but the smoothxg step is still giving errors at the moment. Are there any some suggestions about this problem or can I just use the results before the smoothxg step?

  1. re-run the smoothxg program without the -Q Consensus_ parameter and the -O 0

the command:
smoothxg -t 30 -T 30 -g ./graphs/chr13.pan/chr13*seqwish.gfa -r 236 --base ./graphs/chr13.pan --chop-to 100 -I .9800 -R 0 -j 0 -e 0 -l 700,900,1100 -P 1,19,39,3,81,1 -O 0 -Y 23 -D 0 -o ./g13.pan/chr13.pan/chr13.pan.fa.gz.2ca993e.04f1c29.03ca4fb.smooth.gfa

the error message:
[smoothxg::(1-3)::smooth_and_lace] embedding 79826 path fragments: 0.01% @ 2.81e+04/s elapsed: 00:00:00:00 remain: 00:00:00:02smoothxg: /opt/conda/conda-bld/smoothxg_1671059618733/work/src/smooth.cpp:2117: odgi::graph_t* smoothxg::smooth_and_lace(const xg::XG&, smoothxg::blockset_t*&, int, int, int, int, int, int, const bool&, const uint64_t&, float, uint64_t, bool, int, int, const string&, std::string&, bool, bool, double, bool, const string&, std::vector<std::__cxx11::basic_string >&, bool, uint64_t, const string&): Assertion `false' failed.

  1. reduce my mash length from 50kb to 10kb

the command:
$RUN_PGGB -r -i /home/u20111010010/Project/Pan-genome/002.Merge_Pan_V2/Merge-V1/001.Sequence_partitioning/parts/chr$i.pan.new.fa.gz -o ./graphs/new_chr$i.pan -t 30 -p 98 -s 10000 -n 236 -k 311 -O 0.03 -T 30

the error message:
[smoothxg::(1-3)::break_and_split_blocks] cutting and splitting 869849 blocks: 100.00% @ 4.33e+04/s elapsed: 00:00:00:20 remain: 00:00:00:00smoothxg: /opt/conda/conda-bld/smoothxg_1671059618733/work/build/sdsl-lite-prefix/src/sdsl-lite-build/include/sdsl/enc_vector.hpp:193: sdsl::enc_vector<t_coder, t_dens, t_width>::value_type sdsl::enc_vector<t_coder, t_dens, t_width>::operator[](sdsl::enc_vector<t_coder, t_dens, t_width>::size_type) const [with t_coder = sdsl::coder::elias_delta; unsigned int t_dens = 128; unsigned char t_width = 0; sdsl::enc_vector<t_coder, t_dens, t_width>::value_type = long unsigned int; sdsl::enc_vector<t_coder, t_dens, t_width>::size_type = long unsigned int]: Assertion `i < m_size' failed.
Command terminated by signal 6

I'm looking forward to your reply.
Best,Du

@AndreaGuarracino
Copy link
Member

Can you try the same command lines, but installing PGGB via Docker/Singularity?

@GeorgeBGM
Copy link
Author

Hi, developers!

I will try to install PGGB via Docker/Singularity, Do I need to install a specific version?

@AndreaGuarracino
Copy link
Member

The latest version available, thanks!

@GeorgeBGM
Copy link
Author

Got that. I'll try it again.

@GeorgeBGM
Copy link
Author

GeorgeBGM commented Aug 7, 2023

Hi, @subwaystation @ekg @AndreaGuarracino,

I installed the latest PGGB (pggb 8eaf354) using Singularity with non-root privileges, but still get a similar error. The details of the reported error are as follows:

1.re-run the smoothxg program without the -Q Consensus_ parameter and the -O 0:(mash length: 50kb/10kb)

the command:

10kb
RUN_PGGB="singularity exec /home/Software/pggb/pggb.simg pggb"
$RUN_PGGB -r -i chr13.pan.new.fa.gz -o new_chr13.pan -t 45 -p 98 -s 10000 -n 236 -k 311 -O 0.03 -T 45

50kb
singularity exec /home/Software/pggb/pggb.simg smoothxg -t 30 -T 30 -g chr13*seqwish.gfa -r 236 --base ./graphs/chr13.pan --chop-to 100 -I .9800 -R 0 -j 0 -e 0 -l 700,900,1100 -P 1,19,39,3,81,1 -O 0 -Y 23 -D 0 -o ./graphs/chr13.pan/chr13.pan.fa.gz.2ca993e.04f1c29.03ca4fb.smooth.gfa"

the error message:

10kb
e+04 bp/s elapsed: 00:00:00:14 remain: 00:00:00:00
^M[smoothxg::(2-3)::smooth_and_lace] embedding 395114099 path fragments: 0.00% @ 0.00e+00 bp/s elapsed: 00:00:00:00 remain: 00:00:00:00^M[smoothxg::(2-3)::smooth_and_lace] embedding 395114099 path fragments: 0.00% @ 2.25e+04 bp/s elapsed: 00:00:00:00 remain: 00:04:52:01smoothxg: /smoothxg/src/smooth.cpp:2551: odgi::graph_t* smoothxg::smooth_and_lace(const xg::XG&, smoothxg::blockset_t*&, int, int, int, int, int, int, const bool&, const uint64_t&, float, uint64_t, bool, int, int, const string&, std::string&, bool, bool, double, bool, const string&, std::vector<std::_cxx11::basic_string >&, uint64_t, const string&): Assertion `false' failed.
Command terminated by signal 6
smoothxgINFO: Cleaning up image...
-t 45 -T 45 -g ./graphs/new_chr13.pan/chr13.pan.new.fa.gz.402d19f.04f1c29.seqwish.gfa -r 236 --base ./graphs/new_chr13.pan --chop-to 100 -I .9800 -R 0 -j 0 -e 0 -l 700,900,1100 -P 1,19,39,3,81,1 -O 0.03 -Y 23600 -d 0 -D 0 -Q Consensus
-V -o ./graphs/new_chr13.pan/chr13.pan.new.fa.gz.402d19f.04f1c29.03ca4fb.smooth.gfa
1732131.33s user 1389324.41s system 2502% cpu 124738.32s total 286104940Kb max memory

50kb
02 remain: 00:00:00:04^M[smoothxg::(1-3)::smooth_and_lace] adding edges from 992731 graphs: 100.00% @ 3.97e+05 bp/s elapsed: 00:00:00:02 remain: 00:00:00:00
^M[smoothxg::(1-3)::smooth_and_lace] embedding 76537735 path fragments: 0.00% @ 0.00e+00 bp/s elapsed: 00:00:00:00 remain: 00:00:00:00^M[smoothxg::(1-3)::smooth_and_lace] embedding 76537735 path fragments: 0.00% @ 1.97e+04 bp/s elapsed: 00:00:00:00 remain: 00:01:04:52smoothxg: /smoothxg/src/smooth.cpp:2551: odgi::graph_t* smoothxg::smooth_and_lace(const xg::XG&, smoothxg::blockset_t*&, int, int, int, int, int, int, const bool&, const uint64_t&, float, uint64_t, bool, int, int, const string&, std::string&, bool, bool, double, bool, const string&, std::vector<std::__cxx11::basic_string >&, uint64_t, const string&): Assertion `false' failed.
INFO: Cleaning up image...

2. re-run the PGGB pipeline using Singularity:

the command:

RUN_PGGB="singularity exec /home/Software/pggb/pggb.simg pggb"
$RUN_PGGB -r -i chr13.pan.new.fa.gz -o ./graphs/rerun-new_chr13.pan -t 45 -p 98 -s 10000 -n 236 -k 311 -O 0.03 -T 45

the error message:

[wfmash::skch::Map::mapQuery] count of mapped reads = 13369, reads qualified for mapping = 13641, total input reads = 13641, total input bp = 24623027601
[wfmash::map] time spent mapping the query: 3.71e+03 sec
[wfmash::map] mapping results saved in: /dev/stdout
wfmash -s 10000 -l 50000 -p 98 -n 235 -k 19 -H 0.001 -X -t 45 --tmp-base ./graphs/rerun-new_chr13.pan chr13.pan.new.fa.gz --approx-map
126560.51s user 7903.55s system 3462% cpu 3883.39s total 20414144Kb max memory
/usr/local/bin/pggb: line 497: /dev/fd/63: No such file or directory
INFO: Cleaning up image...

Do you have any suggestions for these reported errors? Thanks!

I'm looking forward to your reply.
Best,Du

@ekg
Copy link
Collaborator

ekg commented Aug 8, 2023

It looks like two different issues.

If you re run do you ever get the exact same error in smooth and lace?

@GeorgeBGM
Copy link
Author

@ekg @subwaystation @AndreaGuarracino

Hi, developers!

The second attempt is the result of running the PGGB process completely from scratch using the Singularity image (non-root install), which produces an error after the wfmash step , so it could not run to the smoothxg step.

The first attempt was based on the output of the Linux installation version (the smoothxg step was incorrect), and then this step was re-executed using the smoothxg software in Singularity Images.

It really is two different issue. Thanks in advance!

@subwaystation
Copy link
Member

Hi @George-du,
would it be possible to share your input data or a tiny subset of it, which produces the issues? Thanks!

@GeorgeBGM
Copy link
Author

Dear @subwaystation @AndreaGuarracino,

Here is the raw data I used for the above pipeline, please help me check the exact errors. Thanks!
(https://sandbox.zenodo.org/record/1234413)

@GeorgeBGM
Copy link
Author

GeorgeBGM commented Aug 25, 2023 via email

@GeorgeBGM
Copy link
Author

Dear @subwaystation @AndreaGuarracino,

Can the data be downloaded and used properly?

@AndreaGuarracino
Copy link
Member

@George-du, thank you for the data. I am running pggb with it on our cluster, installed by building each tool from GitHub source (so no Docker/Singularity).

pggb -i chr13.pan.new.fa.gz -p 98 -s 50000 -n 236 -k 311 -t 48 -o xxx -D /scratch

It is taking a while. At the moment it is at the 2nd round of SPOA, without issues.

@GeorgeBGM
Copy link
Author

GeorgeBGM commented Aug 30, 2023

Dear @AndreaGuarracino @subwaystation,

Wow, that sounds good. The version of the software I'm using in the PGGB pipeline is as follows. Additionally, I found that some of the chromosome Smoothxg steps were taking an extraordinarily long time to run and ended up generating errors (chr15 ; ~1 month ; Command terminated by signal 7 ) . The detail is as follows:

The software version of PGGB pipeline:
Wfmash : v0.10.3-3-g8ba3c53
Seqwish : v0.7.9-0-gd9e7ab5
Smoothxg : v0.6.8-0-ga8a0e9e
Odgi : v0.8.2-0-g8715c55

The commands and results are as follows (chr15 ; ~1 month ; Command terminated by signal 7) :
RUN_PGGB=“/home/Software/Anaconda/mambaforge-pypy3/envs/pggb/bin/pggb”
sbatch -p tissue --job-name=chr15 --mem=300G -c 30 -o ./log/001.test-pggb-graph-chr15.out --wrap "$RUN_PGGB -r -i /home/Project/Pan-genome/002.Merge_Pan_V2/Merge-V1/001.Sequence_partitioning/parts/chr15.pan.new.fa.gz -o ./graphs/chr15.pan -t 30 -p 98 -s 50000 -n 236 -k 311 -O 0.03 -T 30"

image

Looking forward to the resolution of this issue. Thanks in advance.

@subwaystation
Copy link
Member

Is there a possibility for you @George-du to run our latest Docker image?
You have quite a lot of data as input ^^
Maybe you ran out of disk space?

@GeorgeBGM
Copy link
Author

Dear @subwaystation,

I will contact the administrator and try to run the latest Docker image. Thanks.

@AndreaGuarracino
Copy link
Member

AndreaGuarracino commented Aug 30, 2023

@George-du, I was able to finish PGGB. It seems the problem is specific to your installation and/or cluster.

I've used

general:                                                                                                                                                 
  input-fasta:        /lizardfs/guarracino/bug_smoothxg/chr13.pan.new.fa.gz                                                                              
  output-dir:         /lizardfs/guarracino/bug_smoothxg/xxx                                                                                              
  temp-dir:           /scratch          
  resume:             false             
  compress:           false              
  threads:            48                                                                                                                                                               
  poa_threads:        48                                                                                                                                                               
wfmash:                                                                                                                                                                                
  version:            v0.10.4-7-g0981b92                                                                                                                                               
  segment-length:     50000                                                                                                                                                            
  block-length:       250000                                                                                                                                                           
  map-pct-id:         98                                                                                                                                                               
  n-mappings:         236                                                                                                                                                              
  no-splits:          false                                                                                                                                                            
  sparse-map:         false              
  mash-kmer:          19                 
  mash-kmer-thres:    0.001              
  exclude-delim:      false              
  no-merge-segments:  false              
seqwish:                                 
  version:            v0.7.9-2-gf44b402  
  min-match-len:      311                
  sparse-factor:      0                                                                                           
  transclose-batch:   10000000                                                                                    
smoothxg:                                
  version:            v0.7.0-18-g4ff4cf2 
  skip-normalization: false              
  n-haps:             236                
  path-jump-max:      0                  
  edge-jump-max:      0                                                                                                                                  
  poa-length-target:  700,900,1100                                                                                                                       
  poa-params:         1,19,39,3,81,1    
  poa_padding:        0.001             
  run_abpoa:          false             
  run_global_poa:     false             
  pad-max-depth:      100                
  write-maf:          false              
  consensus-spec:     false              
  consensus-prefix:   Consensus_         
  block-id-min:       .9800              
  block-ratio-min:    0                  
odgi:                                    
  version:            v0.8.3-26-gbc7742ed
  viz:                true                                       
  layout:             true                                       
  stats:              false                                      
gfaffix:                                                         
  version:            v0.1.5                                     
  reduce-redundancy:  true                                       
vg:                                                                                                                                                      
  version:            v1.50.1                                                                                                                            
  deconstruct:        false                                                                                                                                                                                                                                                                                                                               

@GeorgeBGM
Copy link
Author

Wow, I'll reinstall the latest version of PGGB and test it out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants