This page includes bioinformatics pipelines, software, and training material developed by the Sydney Informatics Hub, which is a Core Research Facility of the University of Sydney. The Sydney Informatics Hub is an official node of Australian BioCommons, and has worked in partnership with National Computational Infrastructure, Pawsey Supercomputing Research Centre, and QCIF to create command-line resources that make bioinformatics more accessible for life scientists.
Many of the resources available here are focused on making processing data at scale more accessible. To achieve this we have developed optimised pipelines for national HPC infrastructures and resources for workflow development.
- π» Scalable data processing pipelines
- π Reproducible code notebooks
- β¨ Supporting Nextflow
- πΎ Software and helper scripts
- Bio-toolkit resources
- π Cite us to support us
Our pipelines have been optimised for compute platforms including the University of Sydney's HPC Artemis, the National Compute Infrastructure (NCI), Pawsey Supercomputing Research Centre's HPC Setonix and Nimbus cloud, the University of Queensland's (UQ's) HPC Flashlite and AWS Cloud. You can find DOIs for all our pipelines at the Sydney Informatics Hub's WorkflowHub.
We also support the use of nf-core workflows. Check out the institutional configs we've build for Australian HPC and cloud infrastructures.
Category | Pipeline | Infrastructure | Description | Software |
---|---|---|---|---|
Quality control | fastqc-nf | Nextflow - NCI Gadi | QC of raw Illumina sequence reads | fastQC, multiqc |
Quality control | BamQC-nf | Nextflow - NCI Gadi, Pawsey Setonix, Pawsey Nimbus | Short read alignment file QC stats | samtools, mosdepth, qualimap, multiqc |
Genomics | Parabricks-Genomics | Nextflow - NCI Gadi | GPU-enabled, rapid whole genome sequence alignment and short variant calling against a reference genome | Parabricks, BWA-MEM, DeepVariant, GLnexus, VEP |
Genomics | Fastq-to-VCF | Optimised - NCI Gadi | High throughput whole genome sequence analysis and joint genotyping using cutting edge tools | FastQC, MultiQC, fastp, BWA-MEM2, SAMbamba, SAMblaster, SAMtools, DeepVariant, GLnexus, VEP |
Genomics | Fastq-to-BAM | Optimised - NCI Gadi | Whole genome sequence alignment to a reference genome following pre-processing recommendations by the BROAD Institute | bwa-kit, fastp, BWA-MEM, SAMbamba, SAMblaster, SAMtools, GATK4 |
Genomics | Germline-ShortV | Optimised - NCI Gadi | Germline short variant calling (joint calling) following the Germline short variant discovery (SNPs + Indels) Best Practices Workflow by the BROAD Institute | GATK4 |
Genomics | Bootstrapping-for-BQSR | Optimised - NCI Gadi | Bootstrapping a variant resource to enable GATK base quality score recalibration (BQSR) for non-model organisms that lack a publicly available variant resource. | GATK4 |
Genomics | Somatic-ShortV | Optimised - NCI Gadi | Somatic short variant calling (joint calling) following the Somatic short variant discovery (SNPs + Indels) Best Practices Workflow by the BROAD Institute for tumour-normal pairs | GATK4 |
Genomics | Somatic-ShortV-nf | Nextflow - NCI Gadi, Pawsey Setonix, Pawsey Nimbus | Currently under development | GATK4 |
Genomics | GermlineStructuralV-nf | Nextflow - NCI Gadi, Pawsey Setonix, Pawsey Nimbus | Germline structural variant calling with short read bam files | manta, smoove, tiddit, survivor, annotSV |
RNAseq | RNASeq-DE | Optimised - NCI Gadi | Process RNA sequencing data for differential expression, including fastQC, trimming, mapping with STAR and obtaining a raw count matrix | fastQC, multiQC, bbduk, STAR, RSeQC, HTSeq |
Metagenomics | Shotgun-Metagenomics-Analysis | Optimised - NCI Gadi | Analysis of metagenomic shotgun sequences including assembly, speciation, abundance, ARG discovery, functional profiling, gene prediction, insertion sequence annotation and estimation of the resitome. | abricate, bbtools, bracken, bwa, diamond, fastqc, gatk, humann2, kraken2, kronatools, megahit, metaphlan2, multiqc, nci-parallel, openmpi, prodigal, prokka, python3, sambamba, samtools, seqtk |
Transcriptomics | Gadi-Trinity | Optimised - NCI Gadi | Perform de novo transcriptome assembly with Trinity | Trinity |
Data preparation | IndexReferenceFasta-nf | Nextflow - NCI Gadi, Pawsey Setonix, Pawsey Nimbus | Create fasta file indexes | samtools, bwa, gatk |
Notebook | Description |
---|---|
Rnaseq: differential expression | A Rmarkdown notebook to convert raw gene counts to functional enrichments |
Proteomics: differential abundance | Currently under development |
Metagenomics: taxonomic profiling | Currently under development |
We have created resources to support Nextflow workflow development and deployment on HPC infrastructures.
Tool | Description |
---|---|
Nextflow DSL2 template | A straightforward Nextflow workflow template generator. |
Nextflow ConfigBuilder | A simple custom config file generator. Under development. |
Institutional nf-core configs | Public config files for running nf-core pipelines at NCI and Pawsey infrastructures. |
We have created resources to support workflow development and deployment on HPCs, resource benchmarking, and flexible data visualisation.
Tool | Description |
---|---|
HPC usage reports | Pull resource usage data from HPC job logs into reports. |
NCI Gadi benchmarking template | Automated submission of identical benchmark tasks with increasing compute resources. |
IGVreport-nf | Generate IGV report for a set of variants. |
split-GeneWiz-fastq | Split GeneWiz 'combined' (concatenated) fastq files into correct flowcell-lane pairs. |
Fix-BAM-read-groups | Change the read group metadata within a BAM file. Operates on the header as well as the individual SAM output lines. |
Acknowledgements (and co-authorship, where appropriate) are an important way for us to demonstrate the value we bring to your research. Your research outcomes are vital for ongoing funding of the Sydney Informatics Hub and national compute facilities. Please cite the pipeline repository(s) that you have used. You can also find DOIs for all our pipelines at the Sydney Informatics Hub's WorkflowHub.
Suggested acknowledgements:
Sydney Informatics Hub
The authors acknowledge the technical assistance provided by the Sydney Informatics Hub, a Core Research Facility of the University of Sydney and the Australian BioCommons which is enabled by NCRIS via ARDC and Bioplatforms Australia.
NCI Gadi
The authors acknowledge the technical assistance provided by the Sydney Informatics Hub, a Core Research Facility of the University of Sydney and the Australian BioCommons which is enabled by NCRIS via Bioplatforms Australia. The authors acknowledge the use of the National Computational Infrastructure (NCI) supported by the Australian Government and the Sydney Informatics Hub HPC Allocation Scheme, supported by the Deputy Vice-Chancellor (Research), University of Sydney and the ARC LIEF, 2019: Smith, Muller, Thornber et al., Sustaining and strengthening merit-based access to National Computational Infrastructure (LE190100021).
USyd Artemis
The authors acknowledge the technical assistance provided by the Sydney Informatics Hub, a Core Research Facility of the University of Sydney and the Australian BioCommons which is enabled by NCRIS via Bioplatforms Australia. This research utilised the high performance computing service, Artemis, provided by the Sydney Informatics Hub, Core Research Facility, University of Sydney.