Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error with 'preprocess:hisat2' #116

Open
Ahmed-Shibl opened this issue Dec 13, 2020 · 29 comments
Open

Error with 'preprocess:hisat2' #116

Ahmed-Shibl opened this issue Dec 13, 2020 · 29 comments
Assignees
Labels
bug Something isn't working question Further information is requested

Comments

@Ahmed-Shibl
Copy link

Ahmed-Shibl commented Dec 13, 2020

Hi! I recently ran the pipeline and got an error where a *_summary.log file was not detected during the execution of HISAT2.

Brief background: I'm using a Prochlorococcus genome as input and RNAseq reads (4 replicates) that belong to the strain grown under two different temperatures.

Please find the details below.

This is the command I started with:

nextflow run hoelzer-lab/rnaflow --reads input.csv --genome fasta.csv --annotation gtf.csv --mode paired --cores 60 --memory 250 --permanentCacheDir ~/miniconda3/envs/rnaflow/nextflow-autodownload-databases --condaCacheDir ~/miniconda3/envs/rnaflow/conda

And this is the output/error:

R N A F L O W : R N A - S E Q  A S S E M B L Y  &  D I F F E R E N T I A L  G E N E  E X P R E S S I O N  A N A L Y S I S
= = = = = = =   = = = = = = =  = = = = = = = =  =  = = = = = = = = = = = =  = = = =  = = = = = = = = = =  = = = = = = = =
Output path:          results
Mode:                 paired
Strandness:           0
TPM threshold:        1
Comparisons:          all
executor >  local (53)
[e3/050e9b] process > concat_genome                                          [100%] 1 of 1 ✔
[7b/9214a3] process > concat_annotation                                      [100%] 1 of 1 ✔
[skipped  ] process > download_sortmerna:sortmernaGet                        [100%] 1 of 1, stored: 1 ✔
[bb/e6b70b] process > preprocess:fastqcPre (4)                               [100%] 8 of 8 ✔
[88/0eace8] process > preprocess:fastp (7)                                   [100%] 8 of 8 ✔
[92/adf65f] process > preprocess:fastqcPost (8)                              [100%] 8 of 8 ✔
[fd/346485] process > preprocess:extract_tar_bz2                             [100%] 1 of 1 ✔
[fb/e687ee] process > preprocess:sortmerna (8)                               [100%] 8 of 8 ✔
[0d/5b91e4] process > preprocess:hisat2index                                 [100%] 1 of 1 ✔
[4c/636c66] process > preprocess:hisat2 (8)                                  [ 75%] 6 of 8
[3b/6186ed] process > preprocess:index_bam (6)                               [100%] 6 of 6
[-        ] process > expression_reference_based:featurecounts               [  0%] 0 of 6
[78/8fe098] process > expression_reference_based:format_annotation_gene_rows [100%] 1 of 1 ✔
[75/729cb6] process > expression_reference_based:format_annotation           [100%] 1 of 1 ✔
[-        ] process > expression_reference_based:tpm_filter                  -
[-        ] process > expression_reference_based:deseq2                      -
[0b/c14a18] process > expression_reference_based:multiqc_sample_names        [100%] 1 of 1 ✔
[-        ] process > expression_reference_based:multiqc                     -
Error executing process > 'preprocess:hisat2 (7)'

Caused by:
  Missing output file(s) `22_rep4_summary.log` expected by process `preprocess:hisat2 (7)`

Command executed:

  hisat2 -x reference -1 22_rep4.R1.other.fastq.gz -2 22_rep4.R2.other.fastq.gz -p 60 --new-summary --summary-file 22_rep4_summary.log  | samtools view -bS | samtools sort -o 22_rep4.sorted.bam -T tmp --threads 60

Command exit status:
  0

Command output:
  (empty)

Command error:
  grep: warning: GREP_OPTIONS is deprecated; please use an alias or script
  Aborted (core dumped)
executor >  local (53)
[e3/050e9b] process > concat_genome                                          [100%] 1 of 1 ✔
[7b/9214a3] process > concat_annotation                                      [100%] 1 of 1 ✔
[skipped  ] process > download_sortmerna:sortmernaGet                        [100%] 1 of 1, stored: 1 ✔
[bb/e6b70b] process > preprocess:fastqcPre (4)                               [100%] 8 of 8 ✔
[88/0eace8] process > preprocess:fastp (7)                                   [100%] 8 of 8 ✔
[92/adf65f] process > preprocess:fastqcPost (8)                              [100%] 8 of 8 ✔
[fd/346485] process > preprocess:extract_tar_bz2                             [100%] 1 of 1 ✔
[fb/e687ee] process > preprocess:sortmerna (8)                               [100%] 8 of 8 ✔
[0d/5b91e4] process > preprocess:hisat2index                                 [100%] 1 of 1 ✔
[4c/636c66] process > preprocess:hisat2 (8)                                  [100%] 7 of 7, failed: 1
[3b/6186ed] process > preprocess:index_bam (6)                               [100%] 6 of 6
[-        ] process > expression_reference_based:featurecounts               [  0%] 0 of 6
[78/8fe098] process > expression_reference_based:format_annotation_gene_rows [100%] 1 of 1 ✔
[75/729cb6] process > expression_reference_based:format_annotation           [100%] 1 of 1 ✔
[-        ] process > expression_reference_based:tpm_filter                  -
[-        ] process > expression_reference_based:deseq2                      -
[0b/c14a18] process > expression_reference_based:multiqc_sample_names        [100%] 1 of 1 ✔
[-        ] process > expression_reference_based:multiqc                     -
Error executing process > 'preprocess:hisat2 (7)'

Caused by:
  Missing output file(s) `22_rep4_summary.log` expected by process `preprocess:hisat2 (7)`

Command executed:

  hisat2 -x reference -1 22_rep4.R1.other.fastq.gz -2 22_rep4.R2.other.fastq.gz -p 60 --new-summary --summary-file 22_rep4_summary.log  | samtools view -bS | samtools sort -o 22_rep4.sorted.bam -T tmp --threads 60                

Command exit status:
  0

Command output:
  (empty)

Command error:
  grep: warning: GREP_OPTIONS is deprecated; please use an alias or script
  Aborted (core dumped)
  (ERR): hisat2-align exited with value 134
  [bam_sort_core] merging from 0 files and 60 in-memory blocks...
  grep: warning: GREP_OPTIONS is deprecated; please use an alias or script
  grep: warning: GREP_OPTIONS is deprecated; please use an alias or script

Work dir:
  /tmp/nextflow-work-as11798/9f/06b698f84634f12668a460b0102939

Tip: when you have fixed the problem you can continue the execution adding the option `-resume` to the run command line

More info:
The contents of /tmp/nextflow-work-as11798/9f/06b698f84634f12668a460b0102939 are as follows, but there was not much more info about the error;

 .command.trace
 .exitcode
 .command.err
 .command.log
 22_rep4.sorted.bam
 .command.out
 reference.6.ht2 -> /tmp/nextflow-work-as11798/0d/5b91e46671d78f74e5bdf923db148e/reference.6.ht2
 reference.7.ht2 -> /tmp/nextflow-work-as11798/0d/5b91e46671d78f74e5bdf923db148e/reference.7.ht2
 reference.8.ht2 -> /tmp/nextflow-work-as11798/0d/5b91e46671d78f74e5bdf923db148e/reference.8.ht2
 reference.3.ht2 -> /tmp/nextflow-work-as11798/0d/5b91e46671d78f74e5bdf923db148e/reference.3.ht2
 reference.4.ht2 -> /tmp/nextflow-work-as11798/0d/5b91e46671d78f74e5bdf923db148e/reference.4.ht2
 reference.5.ht2 -> /tmp/nextflow-work-as11798/0d/5b91e46671d78f74e5bdf923db148e/reference.5.ht2
 reference.1.ht2 -> /tmp/nextflow-work-as11798/0d/5b91e46671d78f74e5bdf923db148e/reference.1.ht2
 reference.2.ht2 -> /tmp/nextflow-work-as11798/0d/5b91e46671d78f74e5bdf923db148e/reference.2.ht2
 reference.fa -> /tmp/nextflow-work-as11798/e3/050e9bf742cdd758fa7546a9ace250/reference.fa
 22_rep4.R1.other.fastq.gz -> /tmp/nextflow-work-as11798/ab/bc7fcf61fa424d2303859b020cb698 /22_rep4.R1.other.fastq.gz
 22_rep4.R2.other.fastq.gz -> /tmp/nextflow-work-as11798/ab/bc7fcf61fa424d2303859b020cb698/22_rep4.R2.other.fastq.gz
 .command.begin
 .command.run
 .command.sh

I noticed that the file 22_rep4.sorted.bam, which is the output of the command causing the error, is found in there.

Please let me know if you need any more info! Thanks!!

@hoelzer
Copy link
Contributor

hoelzer commented Dec 13, 2020

Hi @Ahmed-Shibl, thanks for reporting! Interesting, because it seems that most of your samples were successfully running through the hisat2 mapping step but 22_rep4 failed.

There is no more information in the .command.{err,log,out} files? Can you do a

ls -lah /tmp/nextflow-work-as11798/9f/06b698f84634f12668a460b0102939/22_rep4.sorted.bam
# and
samtools view -H /tmp/nextflow-work-as11798/9f/06b698f84634f12668a460b0102939/22_rep4.sorted.bam

just to check if there was a somewhat valid BAM file produced with some alignment information. Because it looks like hisat2 failed to produce the summary file with some mapping statistics for that specific input sample. Due to

(ERR): hisat2-align exited with value 134
[bam_sort_core] merging from 0 files and 60 in-memory blocks...

I also think that the whole mapping step failed. If so, it might be also good to check the FASTQ files after the trimming and rRNA depletion step:

ls -lah /tmp/nextflow-work-as11798/ab/bc7fcf61fa424d2303859b020cb698 /22_rep4.R1.other.fastq.gz
ls -lah /tmp/nextflow-work-as11798/ab/bc7fcf61fa424d2303859b020cb698 /22_rep4.R2.other.fastq.gz

Are they empty?

@hoelzer hoelzer added bug Something isn't working question Further information is requested labels Dec 13, 2020
@Ahmed-Shibl
Copy link
Author

Ahmed-Shibl commented Dec 13, 2020

Hello @hoelzer ,thanks for getting back to me on this.

.command.err and .command.log show lines of this:

grep: warning: GREP_OPTIONS is deprecated; please use an alias or script
grep: warning: GREP_OPTIONS is deprecated; please use an alias or script

.command.out is empty.

ls -lah /tmp/nextflow-work-as11798/9f/06b698f84634f12668a460b0102939/22_rep4.sorted.bam returns:
1.4G Dec 12 23:51 /tmp/nextflow-work-as11798/9f/06b698f84634f12668a460b0102939/22_rep4.sorted.bam

and

samtools view -H /tmp/nextflow-work-as11798/9f/06b698f84634f12668a460b0102939/22_rep4.sorted.bam returns:

@HD	VN:1.0	SO:coordinate
@SQ	SN:Prochlorococcus_RSP50	LN:1656033
@PG	ID:hisat2	PN:hisat2	VN:2.1.0	CL:"/home/as11798/miniconda3/envs/rnaflow/conda/hisat2-64237f901cd5f0ba59918f6eb289d712/bin/hisat2-align-s --wrapper basic-0 -x reference -p 60 --new-summary --summary-file 22_rep4_summary.log -1 /tmp/1928453.inpipe1 -2 /tmp/1928453.inpipe2"
@PG	ID:samtools	PN:samtools	PP:hisat2	VN:1.11	CL:samtools view -H /tmp/nextflow-work-as11798/9f/06b698f84634f12668a460b0102939/22_rep4.sorted.bam

Regarding the FASTQ files,

wc -l /tmp/nextflow-work-as11798/9f/06b698f84634f12668a460b0102939/22_rep4.R1.other.fastq.gz and
wc -l /tmp/nextflow-work-as11798/9f/06b698f84634f12668a460b0102939/22_rep4.R1.other.fastq.gz return

5311056 /tmp/nextflow-work-as11798/9f/06b698f84634f12668a460b0102939/22_rep4.R1.other.fastq.gz and
5255348 /tmp/nextflow-work-as11798/9f/06b698f84634f12668a460b0102939/22_rep4.R2.other.fastq.gz, respectively.

Shouldn't they have the same line numbers? do you think skipping the SortMeRNA step using --skip_sortmerna affects the analysis downstream?

@hoelzer
Copy link
Contributor

hoelzer commented Dec 13, 2020

The grep thing is just a warning so I don't think that this has an actual impact.

The BAM file has 1.4G so there is at least something mapped.

That the FASTQ files have differing line numbers is a bit strange, yes. So it seems R1 has 1,327,764 reads and R2 has 1,313,837 reads (line numbers divided by 4).

You could also check the integrity of your input FASTQ files via

conda create -n fastq_utils -c bioconda fastq_utils
conda activate fastq_utils
fastq_info file_1.fastq.gz file_2.fastq.gz pe

just to be sure that your input files are fine.

You can also try to --skip_sortmerna, maybe the SortMeRNA step introduces some error into these files but then it would be again strange while the others are working.

@Ahmed-Shibl
Copy link
Author

Ahmed-Shibl commented Dec 13, 2020

Incredible...
fastq_utils did find an error:
ERROR: Error in file /tmp/nextflow-work-as11798/ab/bc7fcf61fa424d2303859b020cb698/22_rep4.R1.other.fastq.gz: line 48180848: wrong header

I checked all the *.other.fastq.gz files and they were all good - just this one.

Is this something that sortmerna caused?

@hoelzer
Copy link
Contributor

hoelzer commented Dec 13, 2020

Can you please also check your input files? So the raw FASTQ? Are they fine? If they are fine, I would suggest deleting the folder /tmp/nextflow-work-as11798/ab/bc7fcf61fa424d2303859b020cb698 so that sortmerna will run again for this sample. Maybe there was really some error in sortmerna... which we never experienced before. but anyway we also plan to replace the tool due to its slow runtime in the future...

@Ahmed-Shibl
Copy link
Author

The raw FASTQ files were all also fine. I haven't yet deleted the folder /tmp/nextflow-work-as11798/ab/bc7fcf61fa424d2303859b020cb698 but I'll certainly try that next. In the meantime, what I did was run the command again using -resume and --skip_sortmerna. HISAT2 ran fine but I got another error at the featurecounts step:


Error executing process > 'expression_reference_based:featurecounts (2)'

Caused by:
  Process `expression_reference_based:featurecounts (2)` terminated with an error exit status (255)

Command executed:

  featureCounts -pBP -T 60 -s 0 -a annotation.gtf -o 30_rep2.counts.tsv 30_rep2.sorted.bam

Command exit status:
  255

Command output:
  (empty)

Command error:
  grep: warning: GREP_OPTIONS is deprecated; please use an alias or script
  grep: warning: GREP_OPTIONS is deprecated; please use an alias or script
  grep: warning: GREP_OPTIONS is deprecated; please use an alias or script
  
          ==========     _____ _    _ ____  _____  ______          _____  
          =====         / ____| |  | |  _ \|  __ \|  ____|   /\   |  __ \ 
            =====      | (___ | |  | | |_) | |__) | |__     /  \  | |  | |
              ====      \___ \| |  | |  _ <|  _  /|  __|   / /\ \ | |  | |
                ====    ____) | |__| | |_) | | \ \| |____ / ____ \| |__| |
          ==========   |_____/ \____/|____/|_|  \_\______/_/    \_\_____/
          v2.0.1
  
  //========================== featureCounts setting ===========================\\
  ||                                                                            ||
  ||             Input files : 1 BAM file                                       ||
  ||                           o 30_rep2.sorted.bam                             ||
  ||                                                                            ||
  ||             Output file : 30_rep2.counts.tsv                               ||
  ||                 Summary : 30_rep2.counts.tsv.summary                       ||
  ||              Annotation : annotation.gtf (GTF)                             ||
  ||      Dir for temp files : ./                                               ||
  ||                                                                            ||
  ||                 Threads : 60                                               ||
  ||                   Level : meta-feature level                               ||
  ||              Paired-end : yes                                              ||
  ||      Multimapping reads : not counted                                      ||
  || Multi-overlapping reads : not counted                                      ||
  ||   Min overlapping bases : 1                                                ||
  ||                                                                            ||
  ||          Chimeric reads : counted                                          ||
  ||        Both ends mapped : required                                         ||
  ||         Fragment length : 50 - 600                                         ||
  ||                                                                            ||
  \\============================================================================//
  
  //================================= Running ==================================\\
  ||                                                                            ||
  || Load annotation file annotation.gtf ...                                    ||
  ERROR: no features were loaded in format GTF. The annotation format can be specified by the '-F' option, and the required feature type can be specified by the '-t' option..
  The porgram has to terminate.

Work dir:
  /tmp/nextflow-work-as11798/ab/3b420cdd254c04c98b9b95e94aa180

Tip: you can replicate the issue by changing to the process work dir and entering the command `bash .command.run`

Now, I thought this might be an issue with headers in the annotation.gtf and 30_rep2.sorted.bam files not matching, and so I looked at headers:
less annotation.gtf | head -n 5 gave:

Prochlorococcus_RSP50	Prodigal_v2.6.1	transcript	178	594	.	+	.	transcript_id "mrna.PspRS50_00001"; gene_id "mrna.PspRS50_00001";
Prochlorococcus_RSP50	Prodigal_v2.6.1	CDS	178	594	.	+	0	transcript_id "mrna.PspRS50_00001";
Prochlorococcus_RSP50	Prodigal_v2.6.1	transcript	771	1250	.	+	.	transcript_id "mrna.PspRS50_00002"; gene_id "mrna.PspRS50_00002";
Prochlorococcus_RSP50	Prodigal_v2.6.1	CDS	771	1250	.	+	0	transcript_id "mrna.PspRS50_00002";
Prochlorococcus_RSP50	Prodigal_v2.6.1	transcript	1551	1826	.	+	.	transcript_id "mrna.PspRS50_00003"; gene_id "mrna.PspRS50_00003";

and

samtools view 30_rep2.sorted.bam | head -n 5 gave:

K00235:229:HCYJYBBXY:6:1102:16133:9227	419	Prochlorococcus_RSP50	1	1	2S149M	=	100	252	CAGATATCTATACTCTTAAAAACAGTGAAGGTGGAACTTATTCTGACGATAGTTCTTCTCTATGGGATGTCACTGCTGCAAAACAGACTGGCAGTGGATTTGATGTCTTATTTGAAGGTAGCGATGGCTCGTATAGAGAAGGCTATAACTA	AAAFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJJJJJJJJJFJJFFJJJJJJJJJJJ7JJJFJJJJJJJJJFJFJFJFFFJJJJJJJFJJJAJJFFFAAJJAJJJFFJJF	AS:i:-2	ZS:i:0	XN:i:0	XM:i:0	XO:i:0	XG:i:0	NM:i:0	MD:Z:149	YS:i:-6	YT:Z:CP	NH:i:2
K00235:229:HCYJYBBXY:6:1104:18304:29483	163	Prochlorococcus_RSP50	1	1	11S140M	=	121	282	TTGGCAGCTCAGATATCTATACTCTTAAAAACAGTGAAGGTGGAACTTATTCTGACGATAGTTCTTCTCTATGGGATGTCACTGCTGCAAAACAGACTGGCAGTGGATTTGATGTCTTATTTGAAGGTAGCGATGGCTCGTATAGAGAAGG	AAFFFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFFJJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJJJJJJJJJJ	AS:i:-16	ZS:i:0	XN:i:0	XM:i:0	XO:i:0	XG:i:0	NM:i:0	MD:Z:140	YS:i:0	YT:Z:CP	NH:i:2
K00235:229:HCYJYBBXY:6:1114:23876:41000	163	Prochlorococcus_RSP50	1	0	18S133M	=	102	270	CAATTATTTGGCAGCTCAGATATCTATACTCTTAAAAACAGTGAAGGTGGAACTTATTCTGACGATAGTTCTTCTCTATGGGATGTCACTGCTGCAAAACAGACTGGCAGTGGATTTGATGTCTTATTTGAAGGTAGCGATGGCTCGTATA	AAFFFFJJJJJJJJJJJJJJJJJJJFJJFJFJJJJJJJJJJJJJFJJJJJFAFAJFJFJJFAJJFJFFFJJJJJJJJJFFJJJJJFAJJJJJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJJJJJJJJJFFJJJJJJJFJFJJJJJJJJJ	AS:i:-30	ZS:i:0	XN:i:0	XM:i:0	XO:i:0	XG:i:0	NM:i:0	MD:Z:133	YS:i:-6	YT:Z:CP	NH:i:2
K00235:229:HCYJYBBXY:6:1120:27011:18230	163	Prochlorococcus_RSP50	1	1	6S145M	=	133	289	AGCTCAGATATCTATACTCTTAAAAACAGTGAAGGTGGAACTTATTCTGACGATAGTTCTTCTCTATGGGATGTCACTGCTGCAAAACAGACTGGCAGTGGATTTGATGTCTTATTTGAAGGTAGCGATGGCTCGTATAGAGAAGGCTATA	AAFFFJJJJFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJAFFJJJJFJJJJJJFJJJJJJJJJJJJJFJJJJJJJJJFJJJJJJJJJJJJJJJJJJJJJJJJJF<JJFJJJJJJJJJF<JJJJJJJF	AS:i:-7	ZS:i:0	XN:i:0	XM:i:0	XO:i:0	XG:i:0	NM:i:0	MD:Z:145	YS:i:-6	YT:Z:CP	NH:i:2
K00235:229:HCYJYBBXY:6:1128:21846:2949	419	Prochlorococcus_RSP50	1	0	18S133M	=	135	303	CAATTATTTGGCAGCTCAGATATCTATACTCTTAAAAACAGTGAAGGTGGAACTTATTCTGACGATAGTTCTTCTCTATGGGATGTCACTGCTGCAAAACAGACTGGCAGTGGATTTGATGTCTTATTTGAAGGTAGCGATGGCTCGTATA	AAFFFJFJJJJJJJJJJJJJJJJJJJJFJJ<AAFJJJJJJJJJJJJFJJJAFJJJJJJJJ<FJJJJFFJJFJJJJJJJJFJFFJJJJJJJJFJJJJJFJJAJ<AFJJJJJFFJJFAAJ-FJJJJJ<FJJJFJFFJFJJFFFFJJJJJFFJ<	AS:i:-30	ZS:i:0	XN:i:0	XM:i:0	XO:i:0	XG:i:0	NM:i:0	MD:Z:133	YS:i:-9	YT:Z:CP	NH:i:2

Looking back at previous attempts with featureCounts (stand alone) on the same dataset, I realized I used the -t and -g options - which are not present in the command executed here:

featureCounts -pBP -T 60 -s 0 -a annotation.gtf -o 30_rep2.counts.tsv 30_rep2.sorted.bam

Thanks again!

@hoelzer
Copy link
Contributor

hoelzer commented Dec 14, 2020

@Ahmed-Shibl thanks for keeping on! :)

Okay, you are using annotation from Prokka. Actually, I think we did not test such an input until now @MarieLataretu .

The default output from Prokka is in GFF format and not GTF I think and normally does not have transcript and exon features (the later making sense for bacteria). But in your case, it looks like your annotation does only have transcript and CDS features? And we are performing analyses per default on gene feature level.

Can you provide the annotation that you are trying to run as an input for RNAflow?

I think you might have used the -t switch to use transcript as feature? And -g to point to transcript_id? Currently, we use the default that is

  • -t gene
  • -g gene_id

If this is the problem, we might able to solve this by adding a

  • --featurecounts '-t transcript -g transcript_id'

parameter to customize the counting for special input annotations. What do you think @MarieLataretu ?

Ps: I think this might not solve your previous problem about the deprecated fastq files but maybe let's solve this one first

@MarieLataretu
Copy link
Collaborator

The default output from Prokka is in GFF format and not GTF I think and normally does not have transcript and exon features (the later making sense for bacteria). But in your case, it looks like your annotation does only have transcript and CDS features? And we are performing analyses per default on gene feature level.

Yes, I agree. It looks like you don't have the gene -> transcript -> exon structure.

I'll implement an additional parameter for that.

I'm not sure with -t transcript in this case, though. Maybe -t CDS is better?

@MarieLataretu MarieLataretu self-assigned this Dec 14, 2020
@hoelzer
Copy link
Contributor

hoelzer commented Dec 14, 2020

@implementation of additional parameter: agree!

If we use -t CDS then we miss gene_id in the description column:

Prochlorococcus_RSP50	Prodigal_v2.6.1	transcript	178	594	.	+	.	transcript_id "mrna.PspRS50_00001"; gene_id "mrna.PspRS50_00001";
Prochlorococcus_RSP50	Prodigal_v2.6.1	CDS	178	594	.	+	0	transcript_id "mrna.PspRS50_00001";
Prochlorococcus_RSP50	Prodigal_v2.6.1	transcript	771	1250	.	+	.	transcript_id "mrna.PspRS50_00002"; gene_id "mrna.PspRS50_00002";
Prochlorococcus_RSP50	Prodigal_v2.6.1	CDS	771	1250	.	+	0	transcript_id "mrna.PspRS50_00002";
Prochlorococcus_RSP50	Prodigal_v2.6.1	transcript	1551	1826	.	+	.	transcript_id "mrna.PspRS50_00003"; gene_id "mrna.PspRS50_00003";

so one would need to set -g transcript_id as well. And I'm not sure what will happen subsequently when we do not extract the gene_id but some other ID :) But with the implementation of such a parameter, we could test it.

@Ahmed-Shibl
Copy link
Author

Hi @hoelzer and @MarieLataretu, thanks for your patience :)

Indeed my Prokka output was a GFF file and so I had to change it into a GTF file using gffread -T RSP50.gff -o RSP50.gtf

and yes when I used it before, I used -t CDS -g ID as options, but at the time featurecounts accepted the GFF input so the only problem I ran into back then was having different chromosome names than the .bam file.

This is the GFF file I used to get the GTF file in annotations.csv (changed extension to allow attachment)
The GTF file here is the one I used in a second attempt after changing the chromosome name from Prochlorococcus_RS50_WGC to Prochlorococcus_RSP50 to match the 30_rep2.sorted.bam file

Sorry if it's a bit messy - let me know if there's anything not clear.

RSP50 (GTF).txt
RSP50 (GFF).txt

@hoelzer
Copy link
Contributor

hoelzer commented Dec 14, 2020

@Ahmed-Shibl okay. @MarieLataretu implemented a hotfix. Because we don't have your genome and read files it's difficult for us to test w/ your GTF file. But what you can try now is:

git clone https://github.com/hoelzer-lab/rnaflow.git
cd rnaflow
git pull origin master # should be already up-to-date but just to be sure

# add the full path to the main.nf file in the rnaflow folder
nextflow run main.nf --reads input.csv --genome fasta.csv --annotation gtf.csv \
--mode paired --cores 60 --memory 250 \
--permanentCacheDir ~/miniconda3/envs/rnaflow/nextflow-autodownload-databases --condaCacheDir ~/miniconda3/envs/rnaflow/conda \
--featurecounts_additional_params '-t transcript -g transcript_id' 

Maybe add --output results_test if you don't want to overwrite your current output files. Let's see if that helps to count the transcript features in your GTF file properly and if the rest of the pipeline can handle the change.

MarieLataretu added a commit that referenced this issue Dec 14, 2020
partly fixes #116 - params for featurecounts
@MarieLataretu MarieLataretu reopened this Dec 14, 2020
@MarieLataretu
Copy link
Collaborator

@implementation of additional parameter: agree!

If we use -t CDS then we miss gene_id in the description column:

Prochlorococcus_RSP50	Prodigal_v2.6.1	transcript	178	594	.	+	.	transcript_id "mrna.PspRS50_00001"; gene_id "mrna.PspRS50_00001";
Prochlorococcus_RSP50	Prodigal_v2.6.1	CDS	178	594	.	+	0	transcript_id "mrna.PspRS50_00001";
Prochlorococcus_RSP50	Prodigal_v2.6.1	transcript	771	1250	.	+	.	transcript_id "mrna.PspRS50_00002"; gene_id "mrna.PspRS50_00002";
Prochlorococcus_RSP50	Prodigal_v2.6.1	CDS	771	1250	.	+	0	transcript_id "mrna.PspRS50_00002";
Prochlorococcus_RSP50	Prodigal_v2.6.1	transcript	1551	1826	.	+	.	transcript_id "mrna.PspRS50_00003"; gene_id "mrna.PspRS50_00003";

so one would need to set -g transcript_id as well. And I'm not sure what will happen subsequently when we do not extract the gene_id but some other ID :) But with the implementation of such a parameter, we could test it.

sorry, I meant: -t CDS -g transcript_id or -t transcript -g transcript_id - not sure what's better

@Ahmed-Shibl
Copy link
Author

Alright, I ran the following as suggested:

git clone https://github.com/hoelzer-lab/rnaflow.git
cd rnaflow
git pull origin master

nextflow run ~/miniconda3/envs/rnaflow/rnaflow/main.nf --output results_test --reads input.csv --genome fasta.csv --annotation gtf.csv --mode paired --cores 60 --memory 250 --permanentCacheDir ~/miniconda3/envs/rnaflow/nextflow-autodownload-databases --condaCacheDir ~/miniconda3/envs/rnaflow/conda --featurecounts_additional_params '-t transcript -g transcript_id' 

And got the earlier error related to the deprecated fastq.gz file as a result of the sortmerna step I guess. This time, though, it pointed out the error in more detail:

[13/4a568e] process > preprocess:sortmerna (8)                               [100%] 8 of 8 ✔
[56/d220c9] process > preprocess:hisat2index                                 [100%] 1 of 1 ✔
[f7/855d63] process > preprocess:hisat2 (3)                                  [ 29%] 2 of 7, failed: 1
[5d/e71a86] process > preprocess:index_bam (1)                               [100%] 1 of 1
[-        ] process > expression_reference_based:featurecounts               [  0%] 0 of 1
[22/57e8dc] process > expression_reference_based:format_annotation_gene_rows [100%] 1 of 1 ✔
[15/8dde0d] process > expression_reference_based:format_annotation           [100%] 1 of 1 ✔
[-        ] process > expression_reference_based:tpm_filter                  -
[-        ] process > expression_reference_based:deseq2                      -
[da/422fec] process > expression_reference_based:multiqc_sample_names        [100%] 1 of 1 ✔
[-        ] process > expression_reference_based:multiqc                     -
Error executing process > 'preprocess:hisat2 (2)'

Caused by:
  Missing output file(s) `22_rep4_summary.log` expected by process `preprocess:hisat2 (2)`

Command executed:

  hisat2 -x reference -1 22_rep4.R1.other.fastq.gz -2 22_rep4.R2.other.fastq.gz -p 60 --new-summary --summary-file 22_rep4_summary.log  | samtools view -bS | samtools sort -o 22_rep4.sorted.bam -T tmp --threads 60

Command exit status:
  0

Command output:
  (empty)

Command error:
  Error: Read AFFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ has more quality values than read characters.
  terminate called after throwing an instance of 'int'
  Aborted (core dumped)
  (ERR): hisat2-align exited with value 134
  [bam_sort_core] merging from 0 files and 60 in-memory blocks...
  grep: warning: GREP_OPTIONS is deprecated; please use an alias or script
  
Work dir:
  /tmp/nextflow-work-as11798/2f/4a5b7060530705c2697bdf3eec73a4

Tip: when you have fixed the problem you can continue the execution adding the option `-resume` to the run command line

And so, I re-ran the same command above but added -resume and --skip_sortmerna:

nextflow run ~/miniconda3/envs/rnaflow/rnaflow/main.nf -resume --output results_test --skip_sortmerna --reads input.csv --genome fasta.csv --annotation gtf.csv --mode paired --cores 60 --memory 250 --permanentCacheDir ~/miniconda3/envs/rnaflow/nextflow-autodownload-databases --condaCacheDir ~/miniconda3/envs/rnaflow/conda --featurecounts_additional_params '-t transcript -g transcript_id'

Got an error at expression_reference_based:deseq2 below:

[ec/3956c1] process > concat_genome                                          [100%] 1 of 1, cached: 1 ✔
[9f/b14960] process > concat_annotation                                      [100%] 1 of 1, cached: 1 ✔
[skipped  ] process > download_sortmerna:sortmernaGet                        [100%] 1 of 1, stored: 1 ✔
[04/fa3a05] process > preprocess:fastqcPre (3)                               [100%] 8 of 8, cached: 8 ✔
[d9/49e136] process > preprocess:fastp (3)                                   [100%] 8 of 8, cached: 8 ✔
[2a/905d11] process > preprocess:fastqcPost (8)                              [100%] 8 of 8, cached: 8 ✔
[56/d220c9] process > preprocess:hisat2index                                 [100%] 1 of 1, cached: 1 ✔
[38/f44526] process > preprocess:hisat2 (4)                                  [100%] 8 of 8 ✔
[59/b3e428] process > preprocess:index_bam (8)                               [100%] 8 of 8 ✔
[71/170d7f] process > expression_reference_based:featurecounts (8)           [100%] 8 of 8 ✔
[22/57e8dc] process > expression_reference_based:format_annotation_gene_rows [100%] 1 of 1, cached: 1 ✔
[15/8dde0d] process > expression_reference_based:format_annotation           [100%] 1 of 1, cached: 1 ✔
[c8/a9880c] process > expression_reference_based:tpm_filter                  [100%] 1 of 1 ✔
[6f/88ca27] process > expression_reference_based:deseq2 (1)                  [100%] 2 of 2, failed: 2, retries: 1 ✘
[da/422fec] process > expression_reference_based:multiqc_sample_names        [100%] 1 of 1, cached: 1 ✔
[4e/e6ee1c] process > expression_reference_based:multiqc (1)                 [100%] 1 of 1 ✔
[f6/d77a9c] NOTE: Process `expression_reference_based:deseq2 (1)` terminated with an error exit status (1) -- Execution is retried (1)
Error executing process > 'expression_reference_based:deseq2 (1)'

Caused by:
  Process `expression_reference_based:deseq2 (1)` terminated with an error exit status (1)

Command executed:

  R CMD BATCH --no-save --no-restore '--args c(".") c("22_rep1.counts.filtered.formated.tsv","22_rep2.counts.filtered.formated.tsv","22_rep3.counts.filtered.formated.tsv","22_rep4.counts.filtered.formated.tsv","30_rep1.counts.filtered.formated.tsv","30_rep2.counts.filtered.formated.tsv","30_rep3.counts.filtered.formated.tsv","30_rep4.counts.filtered.formated.tsv") c("22C","22C","22C","22C","30C","30C","30C","30C") c("22_rep1","22_rep2","22_rep3","22_rep4","30_rep1","30_rep2","30_rep3","30_rep4") c("22C","30C") c("22C:30C") c("annotation.id2ensembl") c("annotation.gene.gtf") c("RSP50","RSP50","RSP50","RSP50","RSP50","RSP50","RSP50","RSP50") c("") c("regionReport_DESeq2Exploration_custom.Rmd") c(60)' deseq2.R                             

Command exit status:
  1

Command output:
  (empty)

Command error:
  grep: warning: GREP_OPTIONS is deprecated; please use an alias or script
  

Work dir:
  /tmp/nextflow-work-as11798/6f/88ca27e4d376a753a84b9e456d6e82

The directory /tmp/nextflow-work-as11798/6f/88ca27e4d376a753a84b9e456d6e82 had these files in it:


   1 Dec 15 08:33 .exitcode
4.0K Dec 15 08:33 ./
 21K Dec 15 08:33 deseq2.Rout
 12K Dec 15 08:32 .command.err
 12K Dec 15 08:32 .command.log
   0 Dec 15 08:32 .command.trace
   0 Dec 15 08:32 .command.out
  72 Dec 15 08:32 improve_deseq_table.rb -> /home/as11798/miniconda3/envs/rnaflow/rnaflow/bin/improve_deseq_table.rb*
  82 Dec 15 08:32 refactor_reportingtools_table.rb -> /home/as11798/miniconda3/envs/rnaflow/rnaflow/bin/refactor_reportingtools_table.rb*
  97 Dec 15 08:32 30_rep4.counts.filtered.formated.tsv -> /tmp/nextflow-work-as11798/c8/a9880cc647679a1e21bed040b6c2ab/30_rep4.counts.filtered.formated.tsv
  80 Dec 15 08:32 annotation.gene.gtf -> /tmp/nextflow-work-as11798/22/57e8dce1cfa35f7e1d3781a9eadce3/annotation.gene.gtf
  82 Dec 15 08:32 annotation.id2ensembl -> /tmp/nextflow-work-as11798/15/8dde0d77d697c444cbee892a8d35cf/annotation.id2ensembl
  58 Dec 15 08:32 deseq2.R -> /home/as11798/miniconda3/envs/rnaflow/rnaflow/bin/deseq2.R*
  97 Dec 15 08:32 22_rep3.counts.filtered.formated.tsv -> /tmp/nextflow-work-as11798/c8/a9880cc647679a1e21bed040b6c2ab/22_rep3.counts.filtered.formated.tsv
  97 Dec 15 08:32 22_rep4.counts.filtered.formated.tsv -> /tmp/nextflow-work-as11798/c8/a9880cc647679a1e21bed040b6c2ab/22_rep4.counts.filtered.formated.tsv
  97 Dec 15 08:32 30_rep1.counts.filtered.formated.tsv -> /tmp/nextflow-work-as11798/c8/a9880cc647679a1e21bed040b6c2ab/30_rep1.counts.filtered.formated.tsv
  97 Dec 15 08:32 30_rep2.counts.filtered.formated.tsv -> /tmp/nextflow-work-as11798/c8/a9880cc647679a1e21bed040b6c2ab/30_rep2.counts.filtered.formated.tsv
  97 Dec 15 08:32 30_rep3.counts.filtered.formated.tsv -> /tmp/nextflow-work-as11798/c8/a9880cc647679a1e21bed040b6c2ab/30_rep3.counts.filtered.formated.tsv
  97 Dec 15 08:32 22_rep1.counts.filtered.formated.tsv -> /tmp/nextflow-work-as11798/c8/a9880cc647679a1e21bed040b6c2ab/22_rep1.counts.filtered.formated.tsv
  97 Dec 15 08:32 22_rep2.counts.filtered.formated.tsv -> /tmp/nextflow-work-as11798/c8/a9880cc647679a1e21bed040b6c2ab/22_rep2.counts.filtered.formated.tsv
  94 Dec 15 08:32 regionReport_DESeq2Exploration_custom.Rmd -> /home/as11798/miniconda3/envs/rnaflow/rnaflow/assets/regionReport_DESeq2Exploration_custom.Rmd
   0 Dec 15 08:32 .command.begin
 12K Dec 15 08:32 .command.run
 731 Dec 15 08:32 .command.sh

The last few lines in /tmp/nextflow-work-as11798/6f/88ca27e4d376a753a84b9e456d6e82/deseq2.Rout were this:

> #####################
> ## Read in ensembl ids, gene names and biotypes from a tab seperated table
> df.gene.anno <- as.data.frame( read.table(ensembl2genes, header=FALSE, sep="\t") )
Error in h(simpleError(msg, call)) : 
  error in evaluating the argument 'x' in selecting a method for function 'as.data.frame': no lines available in input
Calls: as.data.frame -> read.table
Execution halted

@hoelzer
Copy link
Contributor

hoelzer commented Dec 15, 2020

Hi @Ahmed-Shibl !

[1]

  Error: Read AFFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ has more quality values than read characters.

I still dont know why SortMeRNA seems to report a deprecated FASTQ file here. I think you checked the corresponding input FASTQ and they were fine? I think fastp (trimming step, before SortMeRNA) can also work with the input FASTQ because then it is counted by featurecounts and passed to DESeq2 in your second call.

[2]
Okay, this is something I kind of expected because we changed the way how we extract information from your GTF file. Can you please provide the following files:

mkdir troubleshoot && DIR=troubleshoot
cp /tmp/nextflow-work-as11798/22/57e8dce1cfa35f7e1d3781a9eadce3/annotation.gene.gtf $DIR/
cp /tmp/nextflow-work-as11798/15/8dde0d77d697c444cbee892a8d35cf/annotation.id2ensembl $DIR/
cp /tmp/nextflow-work-as11798/c8/a9880cc647679a1e21bed040b6c2ab/*.counts.filtered.formated.tsv $DIR/
cp /tmp/nextflow-work-as11798/6f/88ca27e4d376a753a84b9e456d6e82/.command.sh $DIR/command.sh
tar zcvf $DIR.tar.gz $DIR

I think that we can better investigate where we need to add changes to the scripts to work with your input. This also helps us to generalize the pipeline to other user inputs which is a good thing. thx!

@Ahmed-Shibl
Copy link
Author

Ahmed-Shibl commented Dec 15, 2020

Hi!

[1]
When I ran this earlier

fastq_info /tmp/nextflow-work-as11798/ab/bc7fcf61fa424d2303859b020cb698/22_rep4.R1.other.fastq.gz /tmp/nextflow-work-as11798/ab/bc7fcf61fa424d2303859b020cb698/22_rep4.R2.other.fastq.gz

it returned this (also mentioned above):

ERROR: Error in file /tmp/nextflow-work-as11798/ab/bc7fcf61fa424d2303859b020cb698/22_rep4.R1.other.fastq.gz: line 48180848: wrong header

All the other FASTQ files - including 22_rep4.R2.other.fastq.gz - were fine.

[2]
Before I provide the files, does it make sense that these two seem empty?

/tmp/nextflow-work-as11798/22/57e8dce1cfa35f7e1d3781a9eadce3/annotation.gene.gtf 
/tmp/nextflow-work-as11798/15/8dde0d77d697c444cbee892a8d35cf/annotation.id2ensembl

@hoelzer
Copy link
Contributor

hoelzer commented Dec 15, 2020

[1] and the raw input files were also fine (srry cant' remember but I think you checked this) ?

[2] ah! no, they should not be empty. I think that's the problem. annotation.gene.gtf is constructed by extracting all rows that have gene in column 3. @MarieLataretu we should set this to something like echo ${params.featurecounts_additionalparameter} | awk 'BEGIN{FS="-t"};{print $1}' | awk '{print $1}' or similar? so that we collect the necessary information based on the feature defined via -t? But if we do it like this, we should have a check if someone sets params.featurecounts_additionalparams = '' to just use the gene column as a default. I think when we solve the issue that annotation.gene.gtf is empty annotation.id2ensembl should be also have content. Btw we should also rename annotation.id2ensembl because the file does not necessarily hold ensembl ID information.

@Ahmed-Shibl
Copy link
Author

[1] Yes the raw FASTQ were also fine

[2] Thanks for explaining this - makes sense, because I guess currently it's looking for gene when it should be looking for transcript or whatever is defined by -t

Happy to run it again once you give me the greenlight. Thanks

@hoelzer
Copy link
Contributor

hoelzer commented Dec 15, 2020

[1] Yes the raw FASTQ were also fine

[2] Thanks for explaining this - makes sense, because I guess currently it's looking for gene when it should be looking for transcript or whatever is defined by -t

Happy to run it again once you give me the greenlight. Thanks

[1] Okay, then, for some reason, SortMeRNA destroys this sample as it seems.

[2] @MarieLataretu can you implement a fix please?

@hoelzer
Copy link
Contributor

hoelzer commented Dec 29, 2020

at [1]
I just experienced a similar error. For one FASTQ paired-end file, SortMeRNA produces a deprecated output that then fails afterward in the HISAT2 step. The raw FASTQ files look fine (checked with fastq_info) and are still fine after the fastp step. But the SortMeRNA step produces FASTQ files failing the fastq_info check.

What happens in the SortMeRNA step is

unpigz -f -p 36 lung_old_rep3.R1.trimmed.fastq.gz
unpigz -f -p 36 lung_old_rep3.R2.trimmed.fastq.gz
merge-paired-reads.sh lung_old_rep3.R1.trimmed.fastq lung_old_rep3.R2.trimmed.fastq lung_old_rep3.merged.fastq
sortmerna --ref ./rRNA_databases/silva-bac-16s-id90.fasta,./rRNA_databases/silva-bac-16s-id90:./rRNA_databases/silva-bac-23s-id98.fasta,./rRNA_databases/silva-bac-23s-id98:./rRNA_databases/silva-arc-16s-id95.fasta,./rRNA_databases/silva-arc-16s-id95:./rRNA_databases/silva-arc-23s-id98.fasta,./rRNA_databases/silva-arc-23s-id98:./rRNA_databases/silva-euk-18s-id95.fasta,./rRNA_databases/silva-euk-18s-id95:./rRNA_databases/silva-euk-28s-id98.fasta,./rRNA_databases/silva-euk-28s-id98:./rRNA_databases/rfam-5s-database-id98.fasta,./rRNA_databases/rfam-5s-database-id98:./rRNA_databases/rfam-5.8s-database-id98.fasta,./rRNA_databases/rfam-5.8s-database-id98     --reads lung_old_rep3.merged.fastq     --paired_in --aligned lung_old_rep3.aligned     --other lung_old_rep3.other_merged     --fastx --log --num_alignments 1 -v     -a 36
unmerge-paired-reads.sh lung_old_rep3.other_merged.fastq lung_old_rep3.R1.other.fastq lung_old_rep3.R2.other.fastq
pigz -p 36 lung_old_rep3.R1.other.fastq
pigz -p 36 lung_old_rep3.R2.other.fastq
rm lung_old_rep3.merged.fastq lung_old_rep3.aligned.fastq lung_old_rep3.other_merged.fastq lung_old_rep3.R1.trimmed.fastq lung_old_rep3.R2.trimmed.fastq

so could be that the problem lies in the 1) paired-end merge step, 2) sortmerna command, 3) unmerge step. I don't think the pigz step could produce deprecated archives of the fastqs?

I just deleted the nextflow work dir that holds the SortMeRNA output for the deprecated sample and -resume the workflow to see if that error does occur systematically (due to the composition of the specific sample).

@hoelzer
Copy link
Contributor

hoelzer commented Dec 30, 2020

Update: the same error occured again with the same input files (same sample).

@MarieLataretu I'm using this sample:

lung_old_rep3,/data/fass2/reads/fli/20200903_572_GG/Lung7_S15_R1_001.fastq.gz,/data/fass2/reads/fli/20200903_572_GG/Lung7_S15_R2_001.fastq.gz,lung_old,

and the SortMeRNA process seems to produce deprecated output FASTQs for some reason.

fastq_info tells:

└─> fastq_info lung_old_rep3.R1.other.fastq.gz
fastq_utils 0.24.1
DEFAULT_HASHSIZE=39000001
Scanning and indexing all reads from lung_old_rep3.R1.other.fastq.gz
CASAVA=1.8
61500000
ERROR: Error in file lung_old_rep3.R1.other.fastq.gz: line 246306760: wrong header

and for the R2 file:

└─> fastq_info lung_old_rep3.R2.other.fastq.gz
fastq_utils 0.24.1
DEFAULT_HASHSIZE=39000001
Scanning and indexing all reads from lung_old_rep3.R2.other.fastq.gz
CASAVA=1.8
61500000
ERROR: Error in file lung_old_rep3.R2.other.fastq.gz: line 246306760: wrong header FFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF

Not sure what's going on and why sortmerna is failing here

@MarieLataretu
Copy link
Collaborator

Hi @Ahmed-Shibl ,

with release 1.2.0 we tackled the issues regarding the non-Ensembl annotation. --featurecounts_additional_params '-t transcript -g transcript_id' should work now.

For the corrupted fastq files, we speculate that screen or tmux is interfering. Can it be, that you started your run in a screen?

@MarieLataretu
Copy link
Collaborator

There might be some issues with the non-biomartr-compatible IDs. This should be fixed in release 1.2.1

@Ahmed-Shibl
Copy link
Author

Hi @MarieLataretu,

with release 1.2.0 we tackled the issues regarding the non-Ensembl annotation. --featurecounts_additional_params '-t transcript -g transcript_id' should work now.

That's great, thanks - I'll use the latest v1.2.1 and re-run the command I started with above and keep you posted.

For the corrupted fastq files, we speculate that screen or tmux is interfering. Can it be, that you started your run in a screen?

Yes, I definitely use tmux for runs that take time - do you recommend not to? is there an alternative?

@hoelzer
Copy link
Contributor

hoelzer commented Jan 16, 2021

@Ahmed-Shibl

great, thanks for re-running those analyses and we hope it works now!

Yes, I definitely use tmux for runs that take time - do you recommend not to? is there an alternative?

We experienced now a few times issues when executing Nextflow in a screen or via tmux but unfortunately we currently don't know what is happening here (and likely we can not fix that when it is a Nextflow issue). I also use screen quite regularly to run Nextflow (in particular for longer runs), detach, and attach to check the status which is actually quite handy. However, in a few cases, some strange errors occur. I will also report this to the Nextflow devs, maybe there is something known.

In the meantime, maybe it is also possible for you to run it w/o tmux?

@hoelzer
Copy link
Contributor

hoelzer commented Jan 18, 2021

I asked around and one of the Nextflow developers wrote:

I usually run it adding -bg > log and then tailing the log

-bg puts the run into the background. Maybe that helps somehow

@Ahmed-Shibl
Copy link
Author

@hoelzer @MarieLataretu

Okay I think there is some progress but I still ran into an error - this time with deseq2.

First I updated the pipeline with:
nextflow pull hoelzer-lab/rnaflow

The command I then ran (without tmux) was this:
Notice however that I used --skip_sortmerna (next time I can try without this flag)
nextflow run hoelzer-lab/rnaflow --output results_test_fixed --skip_sortmerna --reads input.csv --genome fasta.csv --annotation gtf.csv --mode paired --cores 60 --memory 250 --permanentCacheDir ~/miniconda3/envs/rnaflow/nextflow-autodownload-databases --condaCacheDir ~/miniconda3/envs/rnaflow/conda --featurecounts_additional_params '-t transcript -g transcript_id' -bg > log

The error read:

Jan-19 10:54:37.319 [Task monitor] DEBUG n.processor.TaskPollingMonitor - Task completed > TaskHandler[id: 59; name: expression_reference_based:deseq2 (1); status: COMPLETED; exit: 1; error: -; workDir: /home/as11798/miniconda3/envs/rnaflow/RSP50/work/1c/426255bd997a680682966e1445e1ba]
Jan-19 10:54:37.348 [Task monitor] ERROR nextflow.processor.TaskProcessor - Error executing process > 'expression_reference_based:deseq2 (1)'

Caused by:
  Process `expression_reference_based:deseq2 (1)` terminated with an error exit status (1)

Command executed:

  R CMD BATCH --no-save --no-restore '--args c(".") c("22_rep1.counts.filtered.formated.tsv","22_rep2.counts.filtered.formated.tsv","22_rep3.counts.filtered.formated.tsv","22_rep4.counts.filtered.formated.tsv","30_rep1.counts.filtered.formated.tsv","30_rep2.counts.filtered.formated.tsv","30_rep3.counts.filtered.formated.tsv","30_rep4.counts.filtered.formated.tsv") c("22C","22C","22C","22C","30C","30C","30C","30C") c("22_rep1","22_rep2","22_rep3","22_rep4","30_rep1","30_rep2","30_rep3","30_rep4") c("22C","30C") c("22C:30C") c("annotation.id2details") c("annotation.transcript.gtf") c("RSP50","RSP50","RSP50","RSP50","RSP50","RSP50","RSP50","RSP50") c("") c("regionReport_DESeq2Exploration_custom.Rmd") c(60) c("ensembl_gene_id")' deseq2.R

Command exit status:
  1

Command output:
  (empty)

Command error:
  grep: warning: GREP_OPTIONS is deprecated; please use an alias or script
  grep: warning: GREP_OPTIONS is deprecated; please use an alias or script

Work dir:
  /home/as11798/miniconda3/envs/rnaflow/RSP50/work/1c/426255bd997a680682966e1445e1ba

The difference between this time and the time before is that the directory /home/as11798/miniconda3/envs/rnaflow/RSP50/work/1c/426255bd997a680682966e1445e1ba contained 1) the directories data and plots but were both empty...and 2) the files annotation.transcript.gtf and annotation.id2details had content in them, unlike last time when they were empty.

The last few lines in deseq2.Rout were this:

> ##########################################
> ## DESeq2 stuff
> ##########################################
> 
> #####################
> ## Create input object
> if (length(sources) > 0) {
+     sampleTable <- data.frame(sampleName = samples, fileName = samples, condition = conditions, type = col.labels, sources = sources, design = paste(sources, conditions, sep = ':'))
+     ddsHTSeq <- DESeqDataSetFromHTSeqCount(sampleTable = sampleTable, design= ~ sources + condition)
+ } else {
+     sampleTable <- data.frame(sampleName = samples, fileName = samples, condition = conditions, type = col.labels, design = conditions)
+     ddsHTSeq <- DESeqDataSetFromHTSeqCount(sampleTable = sampleTable, design= ~ condition)
+ }
Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) : 
  contrasts can be applied only to factors with 2 or more levels
Calls: DESeqDataSetFromHTSeqCount ... DESeqDataSetFromMatrix -> DESeqDataSet -> <Anonymous> -> contrasts<-
In addition: Warning message:
In DESeqDataSet(se, design = design, ignoreRank) :
  some variables in design formula are characters, converting to factors
Execution halted

Let me know if there files or more information to share. Thanks!!

@MarieLataretu
Copy link
Collaborator

Hmm, this DESeq error really sounds to be input-dependent... 🤔

Have you ever tried the small test example with nextflow run hoelzer-lab/rnaflow -profile test,conda,local --condaCacheDir ~/miniconda3/envs/rnaflow/conda?

@Ahmed-Shibl
Copy link
Author

Hmm, this DESeq error really sounds to be input-dependent... 🤔

Have you ever tried the small test example with nextflow run hoelzer-lab/rnaflow -profile test,conda,local --condaCacheDir ~/miniconda3/envs/rnaflow/conda?

Hi @MarieLataretu,

So, I ran the command you suggested above and it went well. Here's the output:

Output path:          results
Mode:                 single
Strandness:           0
TPM threshold:        1
Comparisons:          all
executor >  local (36)
[skipped  ] process > download_sortmerna:sortmernaGet                        [100%] 1 of 1, stored: 1 ✔
[33/158223] process > preprocess:fastqcPre (3)                               [100%] 4 of 4 ✔
[84/631843] process > preprocess:fastp (2)                                   [100%] 4 of 4 ✔
[5b/c95467] process > preprocess:fastqcPost (4)                              [100%] 4 of 4 ✔
[66/84e124] process > preprocess:extract_tar_bz2                             [100%] 1 of 1 ✔
[8c/de9b9e] process > preprocess:sortmerna (2)                               [100%] 4 of 4 ✔
[6c/89e82a] process > preprocess:hisat2index                                 [100%] 1 of 1 ✔
[d8/e8f61f] process > preprocess:hisat2 (4)                                  [100%] 4 of 4 ✔
[62/328a03] process > preprocess:index_bam (4)                               [100%] 4 of 4 ✔
[78/bdad3f] process > expression_reference_based:featurecounts (4)           [100%] 4 of 4 ✔
[cf/2cb693] process > expression_reference_based:format_annotation_gene_rows [100%] 1 of 1 ✔
[a4/0f478b] process > expression_reference_based:format_annotation           [100%] 1 of 1 ✔
[75/f20390] process > expression_reference_based:tpm_filter                  [100%] 1 of 1 ✔
[70/f1a4d6] process > expression_reference_based:deseq2 (1)                  [100%] 1 of 1 ✔
[7a/e95cad] process > expression_reference_based:multiqc_sample_names        [100%] 1 of 1 ✔
[37/e5163d] process > expression_reference_based:multiqc (1)                 [100%] 1 of 1 ✔
Completed at: 25-Jun-2021 17:40:02
Duration    : 11m 58s
CPU hours   : 0.3
Succeeded   : 36

How do you recommend I troubleshoot my data/files? Is there a way to get a template of the files used as input for DESeq2 in this test run? or would I have to trace further back?

Thanks,

@hoelzer
Copy link
Contributor

hoelzer commented Jun 26, 2021

Hi @Ahmed-Shibl thanks for catching up!

Here you can find the test data and structure that is run via the test command: https://github.com/hoelzer-lab/rnaflow/tree/master/test-data

Maybe that helps to figure out what's the difference in your data?

You can also look in the work dir: work/70/f1a4d6* to see how the files look like that actually go then in the DEseq2 process

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants