Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sarek stops after alignment and does not perform variant calling #1313

Closed
sci-kai opened this issue Nov 6, 2023 · 16 comments
Closed

Sarek stops after alignment and does not perform variant calling #1313

sci-kai opened this issue Nov 6, 2023 · 16 comments
Labels
bug Something isn't working

Comments

@sci-kai
Copy link

sci-kai commented Nov 6, 2023

Description of the bug

Hi,
I recently run sarek with the UMI pipeline for WES data, it successfully finishs but only performs alignment and does not perform variant calling.
For the UMI pipeline I skipped markduplicates and baserecalibration and extract read names from FASTQ file according to issue #746 .
Also, this workflow is running in offline mode.
I do not know what configuration to change so the workflow also perform variant calling.

Command used and terminal output

nextflow run nf-core-sarek_3.3.2/3_3_2 \
-profile singularity \
-c config/run.config \
-params-file config/nf-params.json \
-plugins [email protected] \
-resume

The workflow performs software dump and multiQC after finishing the following process for each sample:
NFCORE_SAREK:SAREK:FASTQ_ALIGN_BWAMEM_MEM2_DRAGMAP_SENTIEON:BWAMEM2_MEM



### Relevant files

`run.config` file:

process {
    // extract UMIS from read names, also see https://github.com/nf-core/sarek/issues/746
    withName: 'FASTQTOBAM' {
        ext.args         = '--extract-umis-from-read-names'
    }

    // Increase memory for fgbio processes
    withName: 'GROUPREADSBYUMI' {
        publishDir       = [
            [   path: { "${params.outdir}/reports/umi/" },
                mode: params.publish_dir_mode,
                pattern: "*.{txt}"
            ]
        ]
        memory           = 90.GB
        ext.args         = '-Xmx80g'
    }

    withName: 'CALLUMICONSENSUS' {
        memory           = 90.GB
        ext.args         = '-Xmx80g -M 1 -S Coordinate'
        ext.prefix       = {"${meta.id}_umi-consensus"}
        publishDir       = [
            path: { "${params.outdir}/preprocessing/umi/${meta.sample}" },
            mode: params.publish_dir_mode,
            pattern: "*.{bam}"
        ]
    }
}


`nf-params.json file:

{
"input": "config/samplesheet.csv",
"step": "mapping",
"outdir": "results/",
"fasta": "db/references/Homo_sapiens/GATK/GRCh38/Sequence/WholeGenomeFasta/Homo_sapiens_assembly38.fasta",
"split_fastq": 50000000,
"wes": true,
"intervals": "data/Target_Region_Hg38_V8_chr.bed",
"nucleotides_per_second": 200000.0,
"no_intervals": false,
"tools": "freebayes,mutect2,strelka",
"skip_tools": "markduplicates,baserecalibrator",
"trim_fastq": false,
"clip_r1": 0,
"clip_r2": 0,
"three_prime_clip_r1": 0,
"three_prime_clip_r2": 0,
"trim_nextseq": 0,
"save_trimmed": false,
"umi_read_structure": "NA",
"group_by_umi_strategy": "Adjacency",
"save_split_fastqs": false,
"aligner": "bwa-mem2",
"save_mapped": true,
"save_output_as_bam": true,
"concatenate_vcfs": false,
"only_paired_variant_calling": true,
"joint_germline": false,
"joint_mutect2": false,
"ascat_min_base_qual": 20.0,
"ascat_min_counts": 10.0,
"ascat_min_map_qual": 35.0,
"cf_coeff": 0.05,
"cf_contamination_adjustment": false,
"cf_contamination": 0.0,
"cf_minqual": 0.0,
"cf_mincov": 0.0,
"cf_ploidy": "2",
"ignore_soft_clipped_bases": false,
"sentieon_haplotyper_emit_mode": "variant",
"vep_cache": "db/vep/",
"vep_include_fasta": false,
"vep_dbnsfp": false,
"dbnsfp_fields": "rs_dbSNP,HGVSc_VEP,HGVSp_VEP,1000Gp3_EAS_AF,1000Gp3_AMR_AF,LRT_score,GERP++_RS,gnomAD_exomes_AF",
"vep_loftee": false,
"vep_spliceai": false,
"vep_spliceregion": false,
"vep_custom_args": "--everything --filter_common --per_gene --total_length --offline --format vcf",
"use_annotation_cache_keys": false,
"vep_out_format": "vcf",
"genome": "GATK.GRCh38",
"save_reference": true,
"build_only_index": false,
"download_cache": false,
"igenomes_base": "db/references",
"igenomes_ignore": false,
"custom_config_version": "master",
"custom_config_base": "nf-core-sarek_3.3.2/configs",
"test_data_base": null,
"seq_platform": "ILLUMINA",
"max_cpus": 16,
"max_memory": "128.GB",
"max_time": "240.h",
"help": false,
"version": false,
"publish_dir_mode": "copy",
"plaintext_email": false,
"max_multiqc_email_size": "25.MB",
"monochrome_logs": false,
"validate_params": true,
"validationShowHiddenParams": false,
"validationFailUnrecognisedParams": false,
"validationLenientMode": false,
"snpeff_db": null,
}


### System information

Nextflow version: 23.04.3
Hardware: HPC
Executor: local
Container engine: singularity
version of sarek: 3.3.2
@sci-kai sci-kai added the bug Something isn't working label Nov 6, 2023
@asp8200
Copy link
Contributor

asp8200 commented Nov 6, 2023

I see you have "umi_read_structure": "NA" in your params-file. I wonder if that is valid.

I was able to run

nextflow run nf-core/sarek -r 3.3.2 -profile singularity,test_cache,umi -plugins [email protected] --outdir results --skip_tools markduplicates,baserecalibrator --tools freebayes,strelka

but when I added --umi_read_structure NA to that cmd, FastqToBam crashed with

  Argument 'read-structures' could not be constructed from string
  	...Could not build a 'ReadStructure' from a string: NA: null

@FriederikeHanssen
Copy link
Contributor

Can you please provide the .nextflow.log?

@sci-kai
Copy link
Author

sci-kai commented Nov 6, 2023

nextflow.log

Yes, I did not run the software version dump in this run since this produced another error probably unrelated to this.

@sci-kai
Copy link
Author

sci-kai commented Nov 6, 2023

@asp8200 I already discussed this setting in issue #802 and it worked in previous versions (but with other data and slightly different configuration). I added the flag --extract-umis-from-read-names to fgbio in my run.config file which extracts the UMIs from the read names. I need to give the 'NA' value to --umi_read_structure to trigger the run of the UMI workflow, otherwise it just skips this process.

@FriederikeHanssen
Copy link
Contributor

do you also have the pipeline_info/execution_trace.html?

@sci-kai
Copy link
Author

sci-kai commented Nov 6, 2023

execution_trace_2023-11-06_09-55-29.txt
Here you go

@FriederikeHanssen
Copy link
Contributor

Hey! I am unable to reproduce. I don't have any UMI "real" data at the moment. Any chance you can extract a few reads and send them as a minimal reproducible example?

@sci-kai
Copy link
Author

sci-kai commented Nov 6, 2023

I cannot share from the current data, but I try to reproduce the error with another minimal datasets which I can share.
I tried myself finding the problem and run Sarek without the UMI workflow, which still produces the same behaviour.
Changing the following parameters:

"umi_read_structure": null,
"group_by_umi_strategy": null,

and removing the FASTQTOBAM modification in run.config (and running for only one sample).
It still stops at the exact same timepoint:
nextflow.log
execution_trace_2023-11-06_11-50-07.txt

@FriederikeHanssen
Copy link
Contributor

very odd. I have non-umi data. I'll check if I see the same when I run it with that. I have a hunch actually, can you check what happens if you run with --split_fastq 0 ?

@sci-kai
Copy link
Author

sci-kai commented Nov 6, 2023

I tested the --split_fastq 0 option, still has the same result.
For the non-UMI workflow I also set --skip_tools null.
I prepared some reads from the newest GIAB HG008 tumor-normal Illumina WGS data which also reproduced the error, but did not run the UMI workflow.
For clarity I will provide all config files again for this minimal example:

sampledata (HG008, Illumina WGS, first 1000 lines (250 reads) extracted from here): HG008-Illumina-250reads.tar.gz
.nextflow.log: nextflow.log
execution_trace: execution_trace_2023-11-06_18-03-49.txt

nextflow run nf-core-sarek_3.3.2/3_3_2 \
-profile singularity \
-params-file config/nf-params-GIAB.json \
-plugins [email protected] \
-resume

samplesheet.csv:

patient,status,lane,sample,sex,fastq_1,fastq_2
HG008,0,HG008-normal,1,XX,data/GIAB_HG008/HG008-N_Illumina_R1.fastq.gz,data/GIAB_HG008/HG008-N_Illumina_R2.fastq.gz
HG008,1,HG008-tumor,1,XX,data/GIAB_HG008/HG008-T_Illumina_R1.fastq.gz,data/GIAB_HG008/HG008-T_Illumina_R2.fastq.gz

nf-params-GIAB.json:

{
    "input": "config/samplesheet-GIAB.csv",
    "step": "mapping",
    "outdir": "results/",
    "fasta": "db/references/Homo_sapiens/GATK/GRCh38/Sequence/WholeGenomeFasta/Homo_sapiens_assembly38.fasta",
    "split_fastq": 50000000,
    "wes": true,
    "intervals": "data/Target_Region_Hg38_V8_chr.bed",
    "nucleotides_per_second": 200000.0,
    "no_intervals": false,
    "tools": "freebayes,mutect2,strelka",
    "skip_tools": null,
    "trim_fastq": false,
    "clip_r1": 0,
    "clip_r2": 0,
    "three_prime_clip_r1": 0,
    "three_prime_clip_r2": 0,
    "trim_nextseq": 0,
    "save_trimmed": false,
    "umi_read_structure": null,
    "group_by_umi_strategy": null,
    "save_split_fastqs": false,
    "aligner": "bwa-mem2",
    "save_mapped": true,
    "save_output_as_bam": true,
    "concatenate_vcfs": false,
    "only_paired_variant_calling": true,
    "joint_germline": false,
    "joint_mutect2": false,
    "ascat_min_base_qual": 20.0,
    "ascat_min_counts": 10.0,
    "ascat_min_map_qual": 35.0,
    "cf_coeff": 0.05,
    "cf_contamination_adjustment": false,
    "cf_contamination": 0.0,
    "cf_minqual": 0.0,
    "cf_mincov": 0.0,
    "cf_ploidy": "2",
    "ignore_soft_clipped_bases": false,
    "sentieon_haplotyper_emit_mode": "variant",
    "vep_cache": "db/vep/",
    "vep_include_fasta": false,
    "vep_dbnsfp": false,
    "dbnsfp_fields": "rs_dbSNP,HGVSc_VEP,HGVSp_VEP,1000Gp3_EAS_AF,1000Gp3_AMR_AF,LRT_score,GERP++_RS,gnomAD_exomes_AF",
    "vep_loftee": false,
    "vep_spliceai": false,
    "vep_spliceregion": false,
    "vep_custom_args": "--everything --filter_common --per_gene --total_length --offline --format vcf",
    "use_annotation_cache_keys": false,
    "vep_out_format": "vcf",
    "genome": "GATK.GRCh38",
    "save_reference": true,
    "build_only_index": false,
    "download_cache": false,
    "igenomes_base": "db/references",
    "igenomes_ignore": false,
    "custom_config_version": "master",
    "custom_config_base": "nf-core-sarek_3.3.2/configs",
    "test_data_base": null,
    "seq_platform": "ILLUMINA",
    "max_cpus": 16,
    "max_memory": "128.GB",
    "max_time": "240.h",
    "help": false,
    "version": false,
    "publish_dir_mode": "copy",
    "plaintext_email": false,
    "max_multiqc_email_size": "25.MB",
    "monochrome_logs": false,
    "validate_params": true,
    "validationShowHiddenParams": false,
    "validationFailUnrecognisedParams": false,
    "validationLenientMode": false,
    "snpeff_db": null,
}

@itszhengan
Copy link

Same thing happened here.

@FriederikeHanssen
Copy link
Contributor

Can you test with the test_full profile to see if that works you?

@sci-kai
Copy link
Author

sci-kai commented Nov 10, 2023

I now also tried running it locally and a colleague at another institute tried the minimal example, both giving the same error as described.
The test profile is working as expected also running the variant calling, while the test_full is getting stucked and does not finish at all. I am not sure if it just needs a lot of time and this is expected?
Here is the nextflow.log and execution trace from the manually interrupted test_full run:
nextflow.log
execution_trace_2023-11-10_13-12-01.txt

@sci-kai
Copy link
Author

sci-kai commented Nov 13, 2023

I identified the problem:
Within the samplesheet, I put the wrong header for lane and sample. Hence, I specified two different lanes ("HG008-normal" and "HG008-tumor") for one sample ("1") which then does not performs somatic variant calling.
Fixing this with switching the header labels solved this issue.

@maxulysse
Copy link
Member

Is the issue solved for you then?

@sci-kai sci-kai closed this as completed Nov 13, 2023
@FriederikeHanssen
Copy link
Contributor

Fantastic solve 🚀 . I am not sure we can lint for this, but we should keep it in mind

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants