GitHub - spriggsy83/spriggsy-seq-tools: A collection of handy tools for use in bioinformatic sequence analysis

README

This is a collection of small C++ programs and Perl scripts that have been handy in bioinformatic sequence analysis. Most parse/manipulate fasta/fastq formatted DNA/RNA sequence files. Some read SAM format sequence alignment results. A couple deal specifically with results from the program 'biokanga align' (https://github.com/csiro-crop-informatics/biokanga).

C++ program prerequisites: Boost C++ Libraries (www.boost.org) and OpenMPI (www.open-mpi.org/)
Ubuntu preparation:

sudo apt-get install openmpi-bin openmpi-doc libopenmpi-dev
sudo apt-get install libboost-all-dev libboost-dev libboost-doc libboost-container-dev
sudo apt-get install zlib1g-dev zlib1g

Compile C++ programs on Unix systems with:
bash buildAllC.sh

Program	Use
getSubSeqs	Extract sub-sequences, given a list of coordinate ranges
extractSeqSubsets	Extract a subset of sequences, given rules (skip X, print every X, max X, etc.)
excludeSeqsBySAM	Extract a subset of sequences, excluding those with no alignment in a SAM file
filterSeqSize	Extract a subset of sequences, retaining those within a range of lengths
getBiokangaAlignStats .pl	Produces alignment statistics from log-file output of 'biokanga align'
getSeqCGstats	Profiles sequences for GC% and base counts
getSeqCountTable	Produces a table counting occurances of individual sequences per inputs
getSeqQCStats	Profiles sequences for count, total/average/median bp length, av. GC%, fq phred scores...
getSeqSizeChart	Produces a table counting sequences of different lengths per inputs
getSeqSizeList	Print list of sequence bp lengths to stdout
getSeqSizeStats	Profiles sequences for total/average/median/min/max bp lengths
getSeqSizeStatsT	Transposed table alternate format of getSeqSizeStats
reverseComplement	Produces the reverse complements of sequences
splitInputs-snpTally-gz .pl	Companion to tallySNPs2, see README-tallySNPs.md
splitSeqsIntoXFiles	Will divide a sequence file up into multiple smaller sequence files
tallyGeneCoverageSamGZ	Produces a count of aligned reads per gene per sample
tallySNPs2	Counts aligned reads from different alleles at SNP positions, see README-tallySNPs.md
mergeKmerCounts	Merge Kmer count results from multiple samples into a multi-column table

SeqReader.cpp/.h is a useful library for building upon. It handles reading of fasta or fastq formatted sequence files and can handle .gz compressed inputs. Allows for easy parsing with nextSeq() function and has various sequence manipulations built in.

Code by Andrew Spriggs, CSIRO Ag&Food (www.csiro.au)
[email protected]
github.com/spriggsy83

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
CppLibrary		CppLibrary
.gitignore		.gitignore
LICENSE		LICENSE
README-tallySNPs.md		README-tallySNPs.md
README.md		README.md
buildAllC.sh		buildAllC.sh
getBiokangaAlignStats.pl		getBiokangaAlignStats.pl
splitInputs-snpTally-gz.pl		splitInputs-snpTally-gz.pl

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

README

About

Releases

Packages

Languages

License

spriggsy83/spriggsy-seq-tools

Folders and files

Latest commit

History

Repository files navigation

README

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages