Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to use Bgzip #167

Closed
mictadlo opened this issue Feb 12, 2024 · 32 comments
Closed

How to use Bgzip #167

mictadlo opened this issue Feb 12, 2024 · 32 comments
Labels
question Further information is requested

Comments

@mictadlo
Copy link

Description of feature

Hi,
I have 5 genomes and 5 FASTA files. How to use Bgzip?

Thank you in advance,

Best wishes,

Michal

@mictadlo mictadlo added the enhancement Improvement for existing functionality label Feb 12, 2024
@mictadlo
Copy link
Author

mictadlo commented Feb 12, 2024

I did the following steps:

cat ../genomes/ChinaLab/NbeHZ1_genome_1.0.fa ../genomes/JapanLab/Nbe_v1_scf.fa ../genomes/LAB360/NbLab360.genome.fasta ../genomes/UsaLab/Niben261_genome.fasta > ../genomes/chnVSjapVSauVSusa.fasta

bgzip -@ 4 ../genomes/chnVSjapVSauVSusa.fasta
samtools faidx ../genomes/chnVSjapVSauVSusa.fasta.gz

nextflow run nf-core/pangenome \
      -r 1.0.0 \
      --wfmash_chunks 20 \
      --input ../genomes/chnVSjapVSauVSusa.fasta.gz \
      --n_haplotypes 1 \
      --outdir results \
      -profile singularity \
      -resume

Unfortunately, I got the following error:

  error [nextflow.exception.ProcessFailedException]: Process `NFCORE_PANGENOME:PANGENOME:PGGB:WFMASH_MAP (chnVSjapVSauVSusa.fasta.gz)` terminated with an error exit status (1)
Feb-12 17:49:57.268 [Task monitor] ERROR nextflow.processor.TaskProcessor - Error executing process > 'NFCORE_PANGENOME:PANGENOME:PGGB:WFMASH_MAP (chnVSjapVSauVSusa.fasta.gz)'

Caused by:
  Process `NFCORE_PANGENOME:PANGENOME:PGGB:WFMASH_MAP (chnVSjapVSauVSusa.fasta.gz)` terminated with an error exit status (1)

Command executed:

  wfmash \
      chnVSjapVSauVSusa.fasta.gz \
      chnVSjapVSauVSusa.fasta.gz \
       \
      --threads 6 \
       \
      -n 0 -s 5000 -p 90  -X  -l 25000 -k 19 -H 0.001   -m > chnVSjapVSauVSusa.fasta.gz.paf
  
  
  cat <<-END_VERSIONS > versions.yml
  "NFCORE_PANGENOME:PANGENOME:PGGB:WFMASH_MAP":
      wfmash: $(echo $(wfmash --version 2>&1) | cut -f 1 -d '-' | cut -f 2 -d 'v')
  END_VERSIONS

Command exit status:
  1

Command output:
  (empty)

Command error:
  [wfmash] ERROR, skch::parseandSave, the number of mappings to retain for each segment has to be grater than 0.

what did I miss?

@subwaystation
Copy link
Collaborator

Hi @mictadlo,

from my understanding you have 5 different haplotypes in your FASTA.
So please set --n_haplotypes 5. This should do the trick.

@mictadlo
Copy link
Author

Hi,
Thank you, I noticed that my FASTA files do not follow. PanSN-spec naming scheme. For each FASTA file I increased haplotype_id := number. Are those changes correct?

> grep ">" ChinaLab/NbeHZ1_genome_1.0.fa | head
>Chr01
>Chr02
>Chr03
>Chr04
>Chr05
>Chr06
>Chr07
>Chr08
>Chr09
>Chr10
grep ">" ChinaLab/NbeHZ1_genome_1.0.fa | tail
>scaffold_990
>scaffold_991
>scaffold_992
>scaffold_993
>scaffold_994
>scaffold_995
>scaffold_996
>scaffold_997
>scaffold_998
>scaffold_999

Changing to:

[NbeHZ1][#][1][#][Chr01]
...
[NbeHZ1][#][1][#][scaffold_999]

> grep ">" JapanLab/Nbe_v1_scf.fa | head
>Nbe.v1.s00010
>Nbe.v1.s00020
>Nbe.v1.s00030
>Nbe.v1.s00040
>Nbe.v1.s00050
>Nbe.v1.s00060
>Nbe.v1.s00070
>Nbe.v1.s00080
>Nbe.v1.s00090
>Nbe.v1.s00100
> grep ">" JapanLab/Nbe_v1_scf.fa | tail
>Nbe.v1.s16590
>Nbe.v1.s16600
>Nbe.v1.s16610
>Nbe.v1.s16620
>Nbe.v1.s16630
>Nbe.v1.s16640
>Nbe.v1.s16650
>Nbe.v1.s16660
>Nbe.v1.s16670
>Nbe.v1.s16680

Changing to:

[Nbe_v1][#][2][#][Nbe.v1.s00010]
...
[Nbe_v1][#][2][#][Nbe.v1.s16680]
> grep ">" LAB360/NbLab360.genome.fasta | head
>NbLab360C00 64506568 NbLab350C00
>NbLab360C01 NbLab350C08
>NbLab360C02 NbLab350C02
>NbLab360C03 NbLab350C06
>NbLab360C04 NbLab350C03
>NbLab360C05 NbLab350C05
>NbLab360C06 NbLab350C04
>NbLab360C07 NbLab350C07
>NbLab360C08 NbLab350C01
>NbLab360C09 NbLab350C09
> grep ">" LAB360/NbLab360.genome.fasta | tail
>NbLab360C10 NbLab350C10
>NbLab360C11 NbLab350C18
>NbLab360C12 NbLab350C12
>NbLab360C13 NbLab350C16
>NbLab360C14 NbLab350C13
>NbLab360C15 NbLab350C15
>NbLab360C16 NbLab350C14
>NbLab360C17 NbLab350C17
>NbLab360C18 NbLab350C11
>NbLab360C19 NbLab350C19

Changing to:

[NbLab360][#][3][#][NbLab360C01]
...
[NbLab360][#][3][#][NbLab360C00]
> grep ">" UsaLab/Niben261_genome.fasta | head
>Niben261Chr01
>Niben261Chr02
>Niben261Chr03
>Niben261Chr04
>Niben261Chr05
>Niben261Chr06
>Niben261Chr07
>Niben261Chr08
>Niben261Chr09
>Niben261Chr10
> grep ">" UsaLab/Niben261_genome.fasta | tail
>Niben261Scf17630
>Niben261Scf17631
>Niben261Scf17632
>Niben261Scf17633
>Niben261Scf17634
>Niben261Scf17635
>Niben261Scf17636
>Niben261Scf17637
>Niben261Scf17638
>Niben261Scf17639

Changing to:

[Niben261][#][4][#][Niben261Chr01]
...
[Niben261][#][4][#][Niben261Scf17639]
> abyss-fac ChinaLab/NbeHZ1_genome_1.0.fa JapanLab/Nbe_v1_scf.fa  LAB360/NbLab360.genome.fasta UsaLab/Niben261_genome.fasta

n	n:500	L50	min	N75	N50	N25	E-size	max	sum	name
2831	2831	10	1000	136e6	143.2e6	148.2e6	142.1e6	183.5e6	2.929e9	ChinaLab/NbeHZ1_genome_1.0.fa
1668	1668	10	1000	132.1e6	141.7e6	145e6	132.2e6	184.4e6	2.926e9	JapanLab/Nbe_v1_scf.fa
20	20	10	58.94e6	132.9e6	140.6e6	145.2e6	144.4e6	180.8e6	2.806e9	LAB360/NbLab360.genome.fasta
17639	17627	10	522	130.9e6	137.1e6	142.2e6	138.5e6	176.9e6	2.751e9	UsaLab/Niben261_genome.fasta

Thank you in advance,

Michal

@subwaystation
Copy link
Collaborator

Your abstract renaming schemes look good to me :)

@mictadlo
Copy link
Author

Thank you, but I am little confused with For instance, HG002#1#ctg1234 names ctg1234 on the first haplotype or phase group of HG002, while HG002#2#ctg9876 is contig ctg9876 on the other. . Does it mean that the haplotype_id should be 1 for each genome?

  • [sample_name][delim][haplotype_id][delim][contig_or_scaffold_name]
  • [NbeHZ1][#][1][#][Chr01]
  • [Nbe_v1][#][1][#][Nbe.v1.s00010]
  • [NbLab360][#][1][#][NbLab360C01]
  • [Niben261][#][4][#][Niben261Chr01]

@subwaystation
Copy link
Collaborator

Assuming you only have one haplotype (with bacteria for example, we always assume this), then setting the id to 1 should be fine here. Seeing your examples, I would do:

  • [NbeHZ1][#][1][#][Chr01]
  • [Nbe_v1][#][1][#][s00010]
  • [NbLab360][#][1][#][C01]
  • [Niben261][#][1][#][Chr01]

Does it make sense?

@mictadlo
Copy link
Author

mictadlo commented Mar 1, 2024

Hi,
Thank you. I could run the pipeline, but unfortunately, I did not get any VG results.

nextflow run nf-core/pangenome \
      -r 1.0.0 \
      --wfmash_chunks 200 \
      --input ../genomes/chnVSjapVSkorVSauVSusa.PanSN-1-19.fasta.gz \
      --n_haplotypes 5 \
      --outdir results \
      -profile singularity \
      -resume

Reading the documentation I noticed that I need to use --vcf_spec. However, I do not understand how to change "REF:DELIM[:LEN][,REF:DELIM:[LEN]]*" for my below genomes

NbeHZ1#1#Chr02
Nbe_v1#1#s00170
GCA_034376525_1#1#CM067333.1
NbLab360#1#C11

Could you please help me?
Michal

@subwaystation
Copy link
Collaborator

Hi @mictadlo ,

which haplotype would you like have as the reference? I assume NbeHz1?
Then I would set `--vcf-spec "NbeHz1". Please tell me if that worked for you.

Some more details from the PGGB Docs:

   [vg]
    -V, --vcf-spec SPEC         specify a set of VCFs to produce with SPEC = REF[:LEN][,REF[:LEN]]*
                                the paths matching ^REF are used as a reference, while the sample haplotypes
                                are derived from path names, assuming they match the PanSN; e.g. '-V chm13',
                                a path named HG002#1#ctg would be assigned to sample HG002 phase 1.
                                If LEN is specified and greater than 0, the VCFs are decomposed, filtering 
                                sites whose max allele length is greater than LEN. [default: off]

@mictadlo
Copy link
Author

Thank you, it finished successfully, but I can't see any VG (Variant Graph Output):

> ls -1 results/
FINAL_GFA
FINAL_ODGI
gfaffix
multiqc
odgi_build
odgi_draw
odgi_layout
odgi_stats
odgi_unchop
odgi_viz
pipeline_info
seqwish
smoothxg
split_approx_mappings_in_chunks
wfmash_align
wfmash_map

What did I miss?

@subwaystation
Copy link
Collaborator

Hi @mictadlo,
it could be that the VG_DECONSTRUCT process is buggy in the v1.0.0 (and v1.1.0). I recently pushed a fix to dev #183, but as it turns out, I still need to update bcftools to v1.19 in that process.
I will do that tomorrow. And directly make a new release. I will let you know, once this is done.

@subwaystation
Copy link
Collaborator

To get a glimpse of the fix, you can try out the current dev branch. Else a new release is close: #186.

@mictadlo
Copy link
Author

Hi, Thank you. I'm just running it and will let you know how it went.

@mictadlo
Copy link
Author

Hi,
I ran the dev version but I can't see any VG results:

> ls -1 results/
FINAL_GFA
FINAL_ODGI
gfaffix
multiqc
odgi_build
odgi_draw
odgi_layout
odgi_stats
odgi_unchop
odgi_viz
pipeline_info
seqwish
smoothxg
split_approx_mappings_in_chunks
wfmash_align
wfmash_map

I have got some tower errors:

Mar-15 13:04:45.953 [Tower-thread] WARN  i.s.tower.plugin.TowerJsonGenerator - Tower request field `tasks.script` exceeds expected size
 | offending value: `
    seqwish \
        --threads 6 \
        --paf-alns=chnVSjapVSkorVSauVSusa.PanSN-1-19.fasta.gz.chunk_173.paf,chnVSjapVSkorVSauVSusa.PanSN-1-19.fasta.gz.chunk_174.paf,ch
nVSjapVSkorVSauVSusa.PanSN-1-19.fasta.gz.chunk_176.paf,chnVSjapVSkorVSauVSusa.PanSN-1-19.fasta.gz.chunk_175.paf,chnVSjapVSkorVSauVSusa.
PanSN-1-19.fasta.gz.chunk_178.paf

Susa.PanSN-1-19.fasta.gz.chunk_90.paf,chnVSjapVSkorVSauVSusa.PanSN-1-19.fasta.gz.chunk_94.paf,chnVSjapVSkorVSauVSusa.PanSN-1-19.fasta.g
z.chunk_87.paf,chnVSjapVSkorVSauVSusa.PanSN-1-19.fasta.gz.chunk_96.paf,chnVSjapVSkorVSauVSusa.PanSN-1-19.fasta.gz.chunk_99.paf,chnVSjap
VSkorVSauVSusa.PanSN-1-19.fasta.gz.chunk_23.paf,chnVSjapVSkorVSauVSusa.PanSN-1-19.fasta.gz.chunk_80.paf \
        --seqs=chnVSjapVSkorVSauVSusa.PanSN-1-19.fasta.gz \
        --gfa=chnVSjapVSkorVSauVSusa.PanSN-1-19.fasta.gz.seqwish.gfa \
        -k 19 -f 0.0 -B 10000000 -P

    cat <<-END_VERSIONS > versions.yml
    "NFCORE_PANGENOME:PANGENOME:PGGB:SEQWISH":
        seqwish: $(echo $(seqwish --version 2>&1) | cut -f 1 -d '-' | cut -f 2 -d 'v')
    END_VERSIONS
    `, size: 11710 (max: 10240)
Mar-15 13:04:46.083 [Actor Thread 30] INFO  nextflow.container.SingularityCache - Pulling Singularity image https://depot.galaxyproject.org/singularity/smoothxg:0.7.2--h40c17d1_0 [cache /home/lorencm/.nextflow/NXF_SINGULARITY_CACHEDIR/depot.galaxyproject.org-singularity-smoothxg-0.7.2--h40c17d1_0.img]

I am attaching the
nextflow.log.txt

What did I miss?

@subwaystation
Copy link
Collaborator

That's strange. Can you please share the content of your FASTA index, so I can get a better idea if you set --vcf_spec in the right way.

As far a I can see tower only gives out a warning ,right?

@subwaystation subwaystation added question Further information is requested and removed enhancement Improvement for existing functionality labels Mar 18, 2024
@mictadlo
Copy link
Author

Where can I find the Fasta index?

Yes, tower seems to have given only warnings. However, when I logged into tower it showed successful but when I clicked on the run it shows that some tasked failed.

@subwaystation
Copy link
Collaborator

Would it be possible for you to share the tower run somehow?

As documented in https://nf-co.re/pangenome/1.1.1/docs/output#input-check, it should be in the folder samtools_faidx.

@mictadlo
Copy link
Author

It seems I don't have the folder

results> find . -name "samtools_faidx"

I will ask how to share the tower run with our admins.

@subwaystation
Copy link
Collaborator

So you created FASTA.gz.fai before running the pipeline?

@mictadlo
Copy link
Author

mictadlo commented Mar 19, 2024

Completly, forgottten:

> ls ../genomes/chnVSjapVSkorVSauVSusa.PanSN-1-19.fasta.gz*
../genomes/chnVSjapVSkorVSauVSusa.PanSN-1-19.fasta.gz      
../genomes/chnVSjapVSkorVSauVSusa.PanSN-1-19.fasta.gz.gzi
../genomes/chnVSjapVSkorVSauVSusa.PanSN-1-19.fasta.gz.fai

Please find chnVSjapVSkorVSauVSusa.PanSN-1-19.fasta.gz.fai.txt

nextflow run nf-core/pangenome \
      -r dev \
      --wfmash_chunks 200 \
      --input ../genomes/chnVSjapVSkorVSauVSusa.PanSN-1-19.fasta.gz \
      --n_haplotypes 5 \
      --outdir results \
      --vcf-spec NbLab360 \
      -profile singularity \
      -resume

@subwaystation
Copy link
Collaborator

I think you have to do --vcf_spec 'NbLab360'. There is an underscore, not a hyphen. Didn't you get a warning?

@mictadlo
Copy link
Author

I haven't seen any warnings about underscores.

@subwaystation
Copy link
Collaborator

subwaystation commented Mar 19, 2024

I mean something like

WARN: The following invalid input values have been detected:

* --vcf-spec: gi|345525392:5000-18402

@mictadlo
Copy link
Author

Yes, now I see it:

Mar-19 23:46:58.789 [main] WARN  nextflow.validation.SchemaValidator - The following invalid input values have been detected:

* --vcf-spec: NbLab360
* --vcfSpec: NbLab360

@subwaystation
Copy link
Collaborator

Maybe you can extract the whole log file from tower, so no need to share more.

@mictadlo
Copy link
Author

I will try. I ran it again with --vcf-spec 'NbLab360' but I still got the following warning:

Mar-20 00:00:31.128 [main] WARN  nextflow.validation.SchemaValidator - The following invalid input values have been detected:

* --vcf-spec: NbLab360
* --vcfSpec: NbLab360

@subwaystation
Copy link
Collaborator

I think you have to do --vcf_spec 'NbLab360'. There is an underscore, not a hyphen. Didn't you get a warning?

Please use a _ and not a -.

--vcf_spec 'NbLab360'

@mictadlo
Copy link
Author

Hi, Thank you. Now, I got a new error:

 The exit status of the task that caused the workflow execution to fail was: 1

Error executing process > 'NFCORE_PANGENOME:PANGENOME:VG_DECONSTRUCT (chnVSjapVSkorVSauVSusa.PanSN-1-19.fasta.gz.gfaffix)'

Caused by:
  Process `NFCORE_PANGENOME:PANGENOME:VG_DECONSTRUCT (chnVSjapVSkorVSauVSusa.PanSN-1-19.fasta.gz.gfaffix)` terminated with an error exit status (1)

Command executed:

  ref=$(echo "NbLab360" | cut -f 1 -d:)
  delim=$(echo "NbLab360" | cut -f 2 -d:)
  pop_length=$(echo "NbLab360" | cut -f 3 -d:)
  
  if [[ -z $pop_length ]]; then
      pop_length=0
  fi
  
  vcf="chnVSjapVSkorVSauVSusa.PanSN-1-19.fasta.gz.gfaffix.unchop.Ygs.view.gfa".$(echo $ref | tr '/|' '_').vcf
  vg deconstruct -P $ref -H $delim -e -a -t "6" "chnVSjapVSkorVSauVSusa.PanSN-1-19.fasta.gz.gfaffix.unchop.Ygs.view.gfa" > $vcf
  bcftools stats $vcf > $vcf.stats
  if [[ $pop_length -gt 0 ]]; then
      vcf_decomposed=chnVSjapVSkorVSauVSusa.PanSN-1-19.fasta.gz.gfaffix.unchop.Ygs.view.gfa.final.$(echo $ref | tr '/|' '_').decomposed.vcf
      vcf_decomposed_tmp=$vcf_decomposed.tmp.vcf
      bgzip -c -@ 6 $vcf > $vcf.gz
      vcfbub -l 0 -a $pop_length --input $vcf.gz | vcfwave -I 1000 -t 6 > $vcf_decomposed_tmp
      #TODO: to remove when vcfwave will be bug-free
      # The TYPE info sometimes is wrong/missing
      # There are variants without the ALT allele
      bcftools annotate -x INFO/TYPE $vcf_decomposed_tmp  | awk '$5 != "."' > $vcf_decomposed
      rm $vcf_decomposed_tmp $vcf.gz
      bcftools stats $vcf_decomposed > $vcf_decomposed.stats
  fi
  
  cat <<-END_VERSIONS > versions.yml
  "NFCORE_PANGENOME:PANGENOME:VG_DECONSTRUCT":
      pggb: $(pggb --version 2>&1 | grep -o 'pggb .*' | cut -f2 -d ' ')
  END_VERSIONS

Command exit status:
  1

Command output:
  (empty)

Command error:
  Warning [vg deconstruct]: -H is deprecated, and will be ignored
  .command.sh: line 13: NbLab360: unbound variable

Work dir:
  /mnt/hpccs01/scratch/waterhouse_team/pangenome/nf-dev02/work/09/6d70b3cf40e26972132b3d245452e0

Tip: you can replicate the issue by changing to the process work dir and entering the command `bash .command.run`

@subwaystation
Copy link
Collaborator

Hi @mictadlo I think this is because you use an older version of the pipeline which was buggy with respect to the variant calling. I would recommend to use the latest one 1.1.2. Sorry for this.

@mictadlo
Copy link
Author

Hi @subwaystation, I got this error
Cannot find revision `1.1.2` -- Make sure that it exists in the remote repository `https://github.com/nf-core/pangenome

@subwaystation
Copy link
Collaborator

subwaystation commented Mar 28, 2024

I don't know why, but this also always happens to me when I run a new revision of a pipeline. Please try again. Nextflow is confusing things.

@mictadlo
Copy link
Author

mictadlo commented Apr 10, 2024

Hi, Thank you for all your help. It worked successfully.

@subwaystation
Copy link
Collaborator

Fantastic news, thanks for letting me know @mictadlo .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants