Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow adding of Illumina Casava 1.8 format entry to fastq headers #41

Closed
charlesfoster opened this issue Sep 4, 2024 · 4 comments
Closed
Labels
enhancement New feature or request
Milestone

Comments

@charlesfoster
Copy link

Hi,

Thanks for the useful tool. Would it be possible to add the option to modify the headers of the output fastq files? I can see in aligner.py that the (final) command for generating clean reads with paired-end data is:

f" | samtools fastq --threads 4 -c 6 -N -1 '{fastq1_out_path}' -2 '{fastq2_out_path}'"

This results in read headers in the form of "@<read_id>/1". It would be useful to have the option to output clean reads with the Illumina Casava 1.8 format entry, which is an option in samtools fastq:

  -i           add Illumina Casava 1.8 format entry to header (eg 1:N:0:ATCACG)
  <truncated>
  --barcode-tag TAG
               Barcode tag [BC]
  --quality-tag TAG
               Quality tag [QT]
  --index-format STR
               How to parse barcode and quality tags

      --input-fmt-option OPT[=VAL]
               Specify a single input file format option in the form
               of OPTION or OPTION=VALUE
      --reference FILE
               Reference sequence FASTA FILE [null]

Minimally, the command could be:

f" | samtools fastq --threads 4 -c 6 -n -i --index-format "i*"  -1 '{fastq1_out_path}' -2 '{fastq2_out_path}'"

This results in read headers in the form of "@<read_id> 1:N:0:0".

In my case this is useful because my previous method of host decontamination (bowtie2 ... --un-conc-gz id.unmapped.fastq.gz ...) resulted in read headers in that format, and hence downstream read manipulation is based on that format. However, I can understand if this might not be a priority to implement. If not, could there be an option to save a *.bam file with clean reads, allowing users to also extract reads to file as they see fit?

Thanks

@bede bede added the enhancement New feature or request label Sep 9, 2024
@bede
Copy link
Owner

bede commented Sep 9, 2024

Hi Charles,
Thanks for raising this and helpfully suggesting an implementation. I will review this alongside some other planned improvements, which will be in the order of weeks. If you'd like to get this working in the meantime, I can offer guidance for improvising an interim solution in case you haven't already done so.

Thanks,
Bede

@charlesfoster
Copy link
Author

charlesfoster commented Sep 10, 2024

Great, thanks. I'll follow along :)

In the meantime I've just got the following hacky code in my Nextflow script which does the job:

## modify the file headers if need be
if [[ "\$artificial_cassava" == "true" ]]; then
    for file in *.clean_*.fastq.gz; do
        name=\$(basename "\$file")
        zcat "\$file" | awk '
        {
            if (NR % 4 == 1) {
                if (\$0 ~ /\\/1\$/) {
                    sub(/\\/1\$/, " 1:N:0:0")
                } else if (\$0 ~ /\\/2\$/) {
                    sub(/\\/2\$/, " 2:N:0:0")
                }
            }
            print
        }
        ' | gzip > tmp.fastq.gz
        mv tmp.fastq.gz "\$name"
    done
fi

@bede bede added this to the 1.2.0 milestone Dec 14, 2024
@bede
Copy link
Owner

bede commented Dec 16, 2024

I'll add this in the next release, hopefully this week

@bede
Copy link
Owner

bede commented Dec 19, 2024

Released in 2.0.0

@bede bede closed this as completed Dec 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants