Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support aggregate reporting for demultiplexed FASTQ files #124

Open
mtomko opened this issue Oct 17, 2023 · 2 comments
Open

Support aggregate reporting for demultiplexed FASTQ files #124

mtomko opened this issue Oct 17, 2023 · 2 comments

Comments

@mtomko
Copy link

mtomko commented Oct 17, 2023

Our group has long generated FastQC reports for a single lane of sequencing at a time. Our sequencing provider is now only providing demultiplexed FASTQs, which means that we need to look at hundreds of FastQC reports instead of just 2. We would be interested in an option to FastQC that generated one aggregate report for all of the demultiplexed FASTQ files, summarizing the overall quality of all of them. This would be akin to the report generated by simply concatenating all the FASTQ files and running FastQC on that.

I would consider implementing this myself if it would be welcome.

@mtomko
Copy link
Author

mtomko commented Oct 17, 2023

Ah, my coworker has pointed out that it's possible to do this by reading from standard in:

If you want to run fastqc on a stream of data to be read from standard input then you
can do this by specifing 'stdin' as the name of the file to be processed and then
streaming uncompressed fastq format data to the program. For example:

zcat *fastq.gz | fastqc stdin

If you want the results from a streamed analysis sent to a file with a name other than
stdin then you can add a colon and put the file name you want, for example:

zcat *fastq.gz | fastqc stdin:my_results

..would write results to my_result.html and my_results.zip.

@s-andrews
Copy link
Owner

You've found one option for this which will combine the full set of results. To be honest, if you're just looking at data quality then it's pretty unlikely that you'll see a difference in quality between the different split subsets of reads so any of the reports is likely to be representative.

The other option to consider is MultiQC (https://multiqc.info/) which you can run in a directory where you have multiple FastQC (and other programs) reports and it will aggregate them into a single combined report. We use this on the end of our sequencing pipelines and it works great for this purpose.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants