Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mkfastq fails on AWS Batch #281

Closed
KallyopeComp opened this issue Oct 25, 2024 · 5 comments · Fixed by nf-core/modules#6932
Closed

mkfastq fails on AWS Batch #281

KallyopeComp opened this issue Oct 25, 2024 · 5 comments · Fixed by nf-core/modules#6932
Labels
bug Something isn't working

Comments

@KallyopeComp
Copy link

Description of the bug

Thank you for preparing a helpful tool to standardize genomics workflows. There is a small issue with running the pipeline using AWS Batch with mkfastq.

Some of the required output directories from the CellRanger output are empty, leading the pipeline to raise an error. For the test data *_outs/outs/fastq_path/Reports is empty. Possibly *_outs/outs/fastq_path/Stats is empty in other cases (I'm not sure).

At least, line 13 should be modified to read:
tuple val(meta), path("*_outs/outs/fastq_path/Reports") , optional:true, emit: reports
Possibly line 14 should be modified as well to:
tuple val(meta), path("*_outs/outs/fastq_path/Stats") , optional:true, emit: stats

With these changes, the pipeline completes successfully.

Command used and terminal output

$ nextflow run demultiplex -profile docker -config ../nextflow.config --skip_tools samshee,falco,fastp --input test_pipeline_samplesheet.csv --demultiplexer mkfastq --outdir {private s3 directory}

Relevant part of the terminal output:
ERROR ~ Error executing process > 'NFCORE_DEMULTIPLEX:DEMULTIPLEX:MKFASTQ_DEMULTIPLEX:CELLRANGER_MKFASTQ (test_sample.1)'

Caused by:
Missing output file(s) *_outs/outs/fastq_path/Reports expected by process NFCORE_DEMULTIPLEX:DEMULTIPLEX:MKFASTQ_DEMULTIPLEX:CELLRANGER_MKFASTQ (test_sample.1)

Relevant files

Archive includes the pipeline samplesheet (which specifies test data from 10X), a nextflow.config (which specifies AWS Batch executor), and the nextflow log
Archive.zip

System information

Nextflow Version: 24.04.4
Hardware: Cloud (AWS batch with custom AMI built as described in the Nextflow documentation)
Executor: AWS BAtch
Container engine: Docker
OS: Launched from machine Ubuntu 22.04, AMI uses Amazon Linux 2
Version of nf-core/demultiplex: Latest master branch, commit ebefeef

@KallyopeComp KallyopeComp added the bug Something isn't working label Oct 25, 2024
@alanmmobbs93
Copy link
Contributor

Hello @KallyopeComp! I was not able to reproduce the error. Can you please try this:

nextflow run nf-core/demultiplex -latest -profile test_mkfastq,docker --skip_tools samshee,falco,fastp --outdir <your_s3_directory>

And let us know if the error persists.

@KallyopeComp
Copy link
Author

Thanks for looking into it. Running the pipeline locally via the command you sent works as expected (with a small modification, I needed to manually specify -r 1.5.1). But the issue is when running on AWS Batch:

nextflow run nf-core/demultiplex -r 1.5.1 -latest -profile test_mkfastq,docker -config /tmp/nextflow.config --skip_tools samshee,falco,fastp --outdir <s3_output_directory>

with the following config file to reproduce the error:

 process.executor = 'awsbatch'
 process.queue = '<aws_batch_queue>'
 aws.region = 'us-east-1'
 aws.batch.cliPath = '/home/ec2-user/miniconda/bin/aws'
 workDir = '<s3_work_bucket>' 

The config file requires an AWS batch queue configured in your AWS account, and an AMI configured with the CLI path as described here: https://www.nextflow.io/docs/latest/aws.html

Note that even when running locally, the output on s3 does not contain a cellranger-tiny-bcl-simple/L001/Reports directory as would be expected from line 13 of modules/nf-core/cellranger/mkfastq/main.nf, which I why I suggested making this output optional.

@nschcolnicov
Copy link
Contributor

nschcolnicov commented Nov 1, 2024

Hi @KallyopeComp @alanmmobbs93 , This is an interesting error you found! I was able to reproduce it by running the pipeline on AWS with a setup similar to yours. I also see this error:

Missing output file(s) `*_outs/outs/fastq_path/Reports` expected by process `NFCORE_DEMULTIPLEX:DEMULTIPLEX:MKFASTQ_DEMULTIPLEX:CELLRANGER_MKFASTQ (test_sample.1)`

However, if I run the exact same command locally, the error doesn’t occur. I verified that this folder was indeed not generated in the S3 workdir, but it was generated in the local workdir. My guess is that even though the folder gets created locally, it contains only subdirectories with no files inside, so maybe it’s not being created in the S3 bucket because it’s empty.

Since the input files are Illumina test files that only produce empty folders in the reports directory, it seems safe to mark this output as optional.

FYI @apeltzer @grst @atrigila

@KallyopeComp
Copy link
Author

It is a bit unusual that the directory is there in the workdir when running locally...

Note that this isn't only an issue with the test files. I also get the error with real data.

@nschcolnicov
Copy link
Contributor

@KallyopeComp @alanmmobbs93 fixed with #283

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants