Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RESTRUCTUREBUSCODIR needs to be more robust #126

Open
DLBPointon opened this issue Nov 22, 2024 · 8 comments
Open

RESTRUCTUREBUSCODIR needs to be more robust #126

DLBPointon opened this issue Nov 22, 2024 · 8 comments
Assignees
Labels
bug Something isn't working

Comments

@DLBPointon
Copy link

Description of the bug

Hi all,

I am running BTK as a nested pipeline in ASCC using the odCymConc1_PRIMARY pre-decontaminated genome. However, it crashed with the following:

  Command error:
    tar: odCymConc1_PRIMARY_filtered-bacteria_odb10-busco/*/run_*/busco_sequences: Cannot open: No such file or directory
    tar: Error is not recoverable: exiting now

the busco_sequences is missing as there were no genes found when running against bacteria_odb10.

Talking to Matthieu, the RESTRUCTUREBUSCODIR could be made more resilient to deal with this.

Thanks.

Command used and terminal output

Command

bsub -Is -tty -e test.e -o test.log -n 2 -q oversubscribed -M1400 -R'select[mem>1400] rusage[mem=1400] span[hosts=1]' 'nextflow run sanger-tol/blobtool
kit \
      -r 0.6.0 \
      -profile  singularity,sanger \
      --input "$(realpath odCymConc1_PRIMARY_samplesheet.csv)" \
      --outdir odCymConc1_PRIMARY_btk_out \
      --fasta "$(realpath odCymConc1_PRIMARY_filtered.fasta)" \
      --busco_lineages metazoa_odb10 \
      --taxon 352914 \
      --taxdump "$(realpath new_taxdump)" \
      --blastp "$(realpath blastp.dmnd)" \
      --blastn "$(realpath current)" \
      --blastx "$(realpath reference_proteomes.dmnd)" \
      --blastx_outext "txt" \
      '

Output:

Command output:
  [bd/f360ee] SAN…CymConc1_PRIMARY_filtered) | 1 of 1 ✔
  [10/ed4583] SAN…CymConc1_PRIMARY_filtered) | 1 of 1 ✔
  Plus 12 more processes waiting for tasks…
  -[sanger-tol/blobtoolkit] Pipeline completed with errors-
  ERROR ~ Error executing process > 'SANGERTOL_BLOBTOOLKIT:BLOBTOOLKIT:BUSCO_DIAMOND:RESTRUCTUREBUSCODIR (odCymConc1_PRIMARY_filtered_bacteria_odb10)'
 
  Caused by:
    Process `SANGERTOL_BLOBTOOLKIT:BLOBTOOLKIT:BUSCO_DIAMOND:RESTRUCTUREBUSCODIR (odCymConc1_PRIMARY_filtered_bacteria_odb10)` terminated with an error exit status (2)
 
 
  Command executed:
 
    mkdir bacteria_odb10
 
    cp --dereference odCymConc1_PRIMARY_filtered-bacteria_odb10-busco.batch_summary.txt        bacteria_odb10/short_summary.tsv
    [ -n ""  ] && cp --dereference   bacteria_odb10/short_summary.txt
    [ -n "" ] && cp --dereference  bacteria_odb10/short_summary.json
 
    # Should we compress these ?
    [ -e odCymConc1_PRIMARY_filtered-bacteria_odb10-busco/*/run_*/full_table.tsv         ] && cp odCymConc1_PRIMARY_filtered-bacteria_odb10-busco/*/run_*/full_table.tsv         bacteria_odb10/
    [ -e odCymConc1_PRIMARY_filtered-bacteria_odb10-busco/*/run_*/missing_busco_list.tsv ] && cp odCymConc1_PRIMARY_filtered-bacteria_odb10-busco/*/run_*/missing_busco_list.tsv bacteria_odb10/
 
    tar czf bacteria_odb10/single_copy_busco_sequences.tar.gz -C odCymConc1_PRIMARY_filtered-bacteria_odb10-busco/*/run_*/busco_sequences single_copy_busco_sequences
    tar czf bacteria_odb10/multi_copy_busco_sequences.tar.gz  -C odCymConc1_PRIMARY_filtered-bacteria_odb10-busco/*/run_*/busco_sequences multi_copy_busco_sequences
    tar czf bacteria_odb10/fragmented_busco_sequences.tar.gz  -C odCymConc1_PRIMARY_filtered-bacteria_odb10-busco/*/run_*/busco_sequences fragmented_busco_sequences
    tar czf bacteria_odb10/hmmer_output.tar.gz --exclude=.checkpoint -C odCymConc1_PRIMARY_filtered-bacteria_odb10-busco/*/run_* hmmer_output
 
    cat <<-END_VERSIONS > versions.yml
    "SANGERTOL_BLOBTOOLKIT:BLOBTOOLKIT:BUSCO_DIAMOND:RESTRUCTUREBUSCODIR"
        tar: $(tar --version| awk 'NR==1 {print $3}' )
    END_VERSIONS
 
  Command exit status:
    2
 
  Command output:
    (empty)
 
  Command error:
    tar: odCymConc1_PRIMARY_filtered-bacteria_odb10-busco/*/run_*/busco_sequences: Cannot open: No such file or directory
    tar: Error is not recoverable: exiting now
 
  Work dir:
    work/73/fc5fa080866bf10e4432ec700d9e5b
 
  Tip: you can try to figure out what's wrong by changing to the process work dir and showing the script file named `.command.sh`
 
   -- Check '.nextflow.log' file for details


### Relevant files

Files can be found:

/nfs/treeoflife-01/teams/tola/users/dp24/ascc/work/ab/90a1d70939cb9294b193ee430148be$ ls
blastp.dmnd current odCymConc1_PRIMARY_btk_out odCymConc1_PRIMARY_samplesheet.csv reference_proteomes.dmnd test.log
BTK.yaml new_taxdump odCymConc1_PRIMARY_filtered.fasta odCymConc1_PRIMARY_sorted.bam test.e work


### System information

_No response_
@DLBPointon DLBPointon added the bug Something isn't working label Nov 22, 2024
@DLBPointon
Copy link
Author

Running the pipeline this morning there is a linked issue

  Command error:
    tar: hmmer_output/initial_run_results/6917at7147.out: file changed as we read it
    tar: hmmer_output/rerun_results/27070at7147.out: file changed as we read it

This is using the same input commands as prior.

@muffato
Copy link
Member

muffato commented Nov 26, 2024

It's very strange because I haven't been able to reproduce the error !

I can see that Busco finds no genes for bacteria/archaea

==> work/93/6916c600699c15d4dbf6f4d757cae6/odCymConc1_PRIMARY_filtered-metazoa_odb10-busco.batch_summary.txt <==
Input_file      Dataset Complete        Single  Duplicated      Fragmented      Missing n_markers       Scaffold N50    Contigs N50     Percent gaps    Number of scaffolds
odCymConc1_PRIMARY_filtered.fasta       metazoa_odb10   81.0    75.8    5.2     8.0     11.0    954     8696481 1428606 0.021%  287

==> work/a9/2d788f47a173ec6e738f8436ece0fc/odCymConc1_PRIMARY_filtered-bacteria_odb10-busco.batch_summary.txt <==
Input_file      Dataset Complete        Single  Duplicated      Fragmented      Missing n_markers       Scaffold N50    Contigs N50     Percent gaps    Number of scaffolds
odCymConc1_PRIMARY_filtered.fasta       No genes found  

==> work/d8/054576f713db51c01d3603b3ad6af9/odCymConc1_PRIMARY_filtered-archaea_odb10-busco.batch_summary.txt <==
Input_file      Dataset Complete        Single  Duplicated      Fragmented      Missing n_markers       Scaffold N50    Contigs N50     Percent gaps    Number of scaffolds
odCymConc1_PRIMARY_filtered.fasta       No genes found  

==> work/e0/3da0c525bb7c7dcb9abcddc8f29aa6/odCymConc1_PRIMARY_filtered-eukaryota_odb10-busco.batch_summary.txt <==
Input_file      Dataset Complete        Single  Duplicated      Fragmented      Missing n_markers       Scaffold N50    Contigs N50     Percent gaps    Number of scaffolds
odCymConc1_PRIMARY_filtered.fasta       eukaryota_odb10 91.0    84.7    6.3     5.1     3.9     255     8696481 1428606 0.021%  287

RESTRUCTUREBUSCODIR runs without issues because the busco_sequences output directories all exist (and are empty)

$ tree -a work/a9/2d788f47a173ec6e738f8436ece0fc/odCymConc1_PRIMARY_filtered-bacteria_odb10-busco/odCymConc1_PRIMARY_filtered.fasta/
work/a9/2d788f47a173ec6e738f8436ece0fc/odCymConc1_PRIMARY_filtered-bacteria_odb10-busco/odCymConc1_PRIMARY_filtered.fasta/
├── logs
│   ├── bbtools_err.log
│   └── bbtools_out.log
├── prodigal_output
│   ├── .checkpoint
│   └── predicted_genes
│       └── tmp
│           ├── prodigal_mode_meta_code_11_err.log
│           ├── prodigal_mode_meta_code_11.faa
│           ├── prodigal_mode_meta_code_11.fna
│           ├── prodigal_mode_meta_code_11_out.log
│           ├── prodigal_mode_meta_code_4_err.log
│           ├── prodigal_mode_meta_code_4.faa
│           ├── prodigal_mode_meta_code_4.fna
│           ├── prodigal_mode_meta_code_4_out.log
│           ├── prodigal_mode_single_code_11_err.log
│           ├── prodigal_mode_single_code_11.faa
│           ├── prodigal_mode_single_code_11.fna
│           ├── prodigal_mode_single_code_11_out.log
│           ├── prodigal_mode_single_code_4_err.log
│           ├── prodigal_mode_single_code_4.faa
│           ├── prodigal_mode_single_code_4.fna
│           └── prodigal_mode_single_code_4_out.log
└── run_bacteria_odb10
    ├── .bbtools_output
    │   └── .checkpoint
    ├── busco_sequences
    │   ├── fragmented_busco_sequences
    │   ├── multi_copy_busco_sequences
    │   └── single_copy_busco_sequences
    └── hmmer_output

11 directories, 20 files

Regarding tar: hmmer_output/initial_run_results/6917at7147.out: file changed as we read it, I don't know anything in the pipeline and the underlying btk commands that uses the hmmer outputs 🤔

@muffato
Copy link
Member

muffato commented Nov 26, 2024

Hopefully unrelated, you should set the --busco parameter to /lustre/scratch123/tol/resources/busco/latest to avoid the pipeline redownloading the busco lineages over and over

@DLBPointon
Copy link
Author

Hi Matthieu,
Such an odd error then!
Apologies for the long wait, partially due to running a number of tests.

So it looks like the --busco flag did indeed help, however, only by kicking the bucket down the road.

The pipeline command for ascc (/nfs/treeoflife-01/teams/tola/users/dp24/ascc) is:

bsub -Is -tty -e test.e -o test.log -n 2 -q normal -M1200 -R'select[mem>1200] rusage[mem=1200] span[hosts=1]' 'nextflow run ./main.nf --input assets/samplesheet.csv -params-file assets/test.yaml --outdir fcs_test -profile sanger,singularity -resume'

The input for btk command is:

nextflow run sanger-tol/blobtoolkit \
    -r 0.6.0 \
    -profile  singularity,sanger \
    --input "$(realpath asccTinyTest_V2_PRIMARY_samplesheet.csv)" \
    --outdir asccTinyTest_V2_HAPLO_btk_out \
    --fasta "$(realpath asccTinyTest_V2_HAPLO_filtered.fasta)" \
    --busco /lustre/scratch123/tol/resources/busco/latest/lineages/ \
    --busco_lineages diptera_odb10,insecta_odb10 \
    --taxon 352914 \
    --taxdump "$(realpath new_taxdump)" \
    --blastp "$(realpath blastp.dmnd)" \
    --blastn "$(realpath blastdb)" \
    --blastx "$(realpath ascc_tinytest_diamond_db.dmnd)" \
    --use_work_dir_as_temp true

My new output error is:

busco \
        --cpu 8 \
        --in "$INPUT_SEQS" \
        --out asccTinyTest_V2_HAPLO_filtered-diptera_odb10-busco \
        --mode genome \
        --lineage_dataset diptera_odb10 \
        --download_path lineages --offline \
         \
        --force
 
    # clean up
    rm -rf "$INPUT_SEQS"
 
    # Move files to avoid staging/publishing issues
    mv asccTinyTest_V2_HAPLO_filtered-diptera_odb10-busco/batch_summary.txt asccTinyTest_V2_HAPLO_filtered-diptera_odb10-busco.batch_summary.txt
    mv asccTinyTest_V2_HAPLO_filtered-diptera_odb10-busco/*/short_summary.*.{json,txt} . || echo "Short summaries were not available: No genes were found."
 
    cat <<-END_VERSIONS > versions.yml
    "SANGERTOL_BLOBTOOLKIT:BLOBTOOLKIT:BUSCO_DIAMOND:BUSCO":
        busco: $( busco --version 2>&1 | sed 's/^BUSCO //' )
    END_VERSIONS
 
  Command exit status:
    0
 
  Command output:
    New AUGUSTUS_CONFIG_PATH=/tmp/nxf.hQcRAcZtzS/tmp.CbX2RD4BYp
    2024-12-06 09:10:03 INFO:   ***** Start a BUSCO v5.5.0 analysis, current time: 12/06/2024 09:10:03 *****
    2024-12-06 09:10:03 INFO:   Configuring BUSCO with local environment
    2024-12-06 09:10:03 INFO:   Mode is genome
    2024-12-06 09:10:03 INFO:   Running in batch mode. 1 input files found in /tmp/nxf.hQcRAcZtzS/input_seqs
    2024-12-06 09:10:03 INFO:   Input file is /tmp/nxf.hQcRAcZtzS/input_seqs/asccTinyTest_V2_HAPLO_filtered.fasta
    Short summaries were not available: No genes were found.
 
  Command error:
    2024-12-06 09:10:03 ERROR:  Unable to run BUSCO in offline mode. Dataset /tmp/nxf.hQcRAcZtzS/lineages/lineages/diptera_odb10 does not exist.
    mv: cannot stat 'asccTinyTest_V2_HAPLO_filtered-diptera_odb10-busco/*/short_summary.*.json': No such file or directory
    mv: cannot stat 'asccTinyTest_V2_HAPLO_filtered-diptera_odb10-busco/*/short_summary.*.txt': No such file or directory
 
  Work dir:
    work/1d/9817d38cadd0f36074dfd266152562
 
  Tip: you can replicate the issue by changing to the process work dir and entering the command `bash .command.run`
 
   -- Check '.nextflow.log' file for details

Command error:
  Nextflow 24.10.2 is available - Please consider updating your version to it

Work dir:
  /nfs/treeoflife-01/teams/tola/users/dp24/ascc/work/49/df2c9f8115adcf46a3f81fa6c5dc7c

@muffato
Copy link
Member

muffato commented Dec 7, 2024

That error is because you've appended lineages at the end of the --busco parameter.

The following works very neatly:

cp -a /nfs/treeoflife-01/teams/tola/users/dp24/ascc/work/49/df2c9f8115adcf46a3f81fa6c5dc7c/ .
nextflow pull sanger-tol/blobtoolkit -r 0.6.0
cd df2c9f8115adcf46a3f81fa6c5dc7c
rm -rf .nextflow .nextflow.log work asccTinyTest_V2_HAPLO_btk_out
cp .command.sh command.sh # and remove `lineages/` at the end of the `--busco` parameter
bsub -M1200 -R"select[mem>1200] rusage[mem=1200] span[hosts=1]" -n 1 -q yesterday -Is bash -euo pipefail $PWD/command.sh

(you can check the outputs in /lustre/scratch123/tol/teams/tolit/users/mm49/nextflow/df2c9f8115adcf46a3f81fa6c5dc7c.)

By the way, I've found a way of mitigating the risks of nextflow run sanger-tol/blobtoolkit. Remember that it assumes the pipeline is in the user's central cache and doesn't support having multiple versions in use at the time. There's a parallel command named nextflow clone that makes the copy into the local copy. You could do this in your module:

nextflow clone sanger-tol/blobtoolkit -r 0.6.0
nextflow run ./blobtoolkit \
    -profile  singularity,sanger \
    (...)

@DLBPointon
Copy link
Author

I've run the odCymConc twice now and we are getting close!

Error is:

ERROR ~ Error executing process > 'SANGERTOL_BLOBTOOLKIT:BLOBTOOLKIT:COLLATE_STATS:BLOBTOOLKIT_WINDOWSTATS (odCymConc1_HAPLO_
filtered)'
 
  Caused by:
    Missing output file(s) `*_window_stats*.tsv` expected by process `SANGERTOL_BLOBTOOLKIT:BLOBTOOLKIT:COLLATE_STATS:BLOBTOOLK
IT_WINDOWSTATS (odCymConc1_HAPLO_filtered)`
 
 
  Command executed:
 
    btk pipeline window-stats \
            --in odCymConc1_HAPLO_filtered.tsv \
            --window 0.1 --window 0.01 --window 1 --window 100000 --window 1000000 \
            --out odCymConc1_HAPLO_filtered_window_stats.tsv
 
    cat <<-END_VERSIONS > versions.yml
    "SANGERTOL_BLOBTOOLKIT:BLOBTOOLKIT:COLLATE_STATS:BLOBTOOLKIT_WINDOWSTATS":
        blobtoolkit: $(btk --version | cut -d' ' -f2 | sed 's/v//')
    END_VERSIONS

WorkDir is: /nfs/treeoflife-01/teams/tola/users/dp24/ascc/work/8e/0e4caed82bcb329c93fed499a247fe

and process dir is: /nfs/treeoflife-01/teams/tola/users/dp24/ascc/work/8e/0e4caed82bcb329c93fed499a247fe/work/32/b8a090c5198313d5f4ce6c377e918e

@muffato
Copy link
Member

muffato commented Dec 13, 2024

Winding things back:

  • The input TSV (odCymConc1_HAPLO_filtered.tsv) only has a header, no actual data.
  • It comes from WINDOWSTATS_INPUT (in /nfs/treeoflife-01/teams/tola/users/dp24/ascc/work/8e/0e4caed82bcb329c93fed499a247fe/work/4c/fcc6967996b33d3dd19eb2f3ad1ddd) which runs seemingly fine. What's striking is that the input TSV files all have sequence IDs such as atg000021l or hap_ptg003667l_1 whereas odCymConc1_PRIMARY_T1.regions.bed.gz is using SCAFFOLD_*. Most likely the output is empty because the sequence IDs don't match
  • I can trace that problem back to the input files of the pipeline itself: odCymConc1_HAPLO_filtered.fasta has atg00*l and hap_ptg00*l_1 whereas /nfs/treeoflife-01/teams/tola/users/dp24/ascc/work/0d/88eecfe3e42b9886390382147c31a9/odCymConc1_PRIMARY_sorted.bam is using SCAFFOLD_*. In other words, the pipeline is given a BAM file aligned to a different set of sequences from the input Fasta.

@DLBPointon
Copy link
Author

Oh no, so i've been chasing you over race conditions!

I thought I caught off these. turning ascc into a pipeline that can run multiple assemblies has uncovered a number of race conditions where files are being mixed up by the pipeline even though they should be under different instances of the pipeline, e,g Primary run, haplo run.

Thank you very much for the help, I'll run some testing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants