C Library for reading multiple bioinformatics sequence file formats
https://github.com/noporpoise/seq_file
Isaac Turner [email protected]
21 July 2014, license: Public Domain
The aim is to provide a C library that allows programs to transparently read
sequence data without having to worry about the file format.
Pass a file to seq_open(path)
and then read sequences using seq_read(...)
.
File format is automatically detected.
Currently supports:
- SAM & BAM (if compiled with htslib with
make HTSLIB=
) - FASTA (& gzipped fasta)
- FASTQ (& gzipped fastq)
- 'plain' format (one sequence per line [.txt]) (& gzipped plain [.txt.gz])
seq_open_fh(...)
allows you to read through pipes and the command line.
seq_open2(...)
gives more options about how you'd like to read your input.
seq_file requires htslib for reading SAM/BAM files. To compile run with htslib:
make HTSLIB=PATH/TO/htslib
or without (you won't be able to read sam/bam files)
make
Also included is a tool for reading, converting amd manipulating sequence files: dnacat
Generate a random 20bp sequence and a 5bp sequence, print in fasta format
./bin/dnacat -n 20 -n 5 -F
Output:
>rand0
CTGATAAATCTTTTACGTCC
>rand1
GTAGC
Read three files, print as FASTQ:
./bin/dnacat -Q input.sam input.fq input.fa
Get some stats on an input file:
./bin/dnacat -s input.sam
Output:
File: test/reads.fq
Format: FASTQ (read with zlib)
Format QScores: Sanger (Phred+33), offset: 33, min: 33, max: 73, scores: [0,40]
QScore range in first 500bp: [35,65]
Total seq (bp): 1,275
Number of reads: 25
Shortest read (bp): 51
Longest read (bp): 51
Mean length (bp): 51
Interleave two files, print in FASTQ:
./bin/dnacat -Q -i reads.1.fq.gz reads.2.fq.gz > reads.interleaved.fq.gz
Read a file, print in FASTA, wrap lines to be no longer than 80bp:
./bin/dnacat -F -w 80 input.fq > input.wrapped.fa
Other useful one liners:
- Reverse complement a sequence:
./bin/dnacat -r - <<< AACGA
Output:TCGTT
- Reverse a sequence:
./bin/dnacat -R - <<< AACGA
Output:AGCAA
- Complement a sequence:
./bin/dnacat -C - <<< AACGA
Output:TTGCT
- Convert to uppercase:
./bin/dnacat -u in.fa
- Convert to lowercase:
./bin/dnacat -l in.fa
- Convert lowercase + non-ACGT bases to 'N':
./bin/dnacat -m in.fa
Note: in bash cmd <<< AACGA
is the same as 'echo AACGA | cmd'
Example code to read a file and print as a FASTA file using seq_file.
#include "seq_file.h"
int main(int argc, char **argv)
{
if(argc != 2) exit(EXIT_FAILURE);
seq_file_t *file = seq_open(argv[1]);
if(file == NULL)
exit(EXIT_FAILURE);
read_t *read = seq_read_new();
while(seq_read(file, read) > 0)
{
printf("@%s\n", read->name.b);
printf("%s\n", read->seq.b);
printf("+\n);
printf("%s\n", read->qual.b);
}
seq_close(file);
seq_read_free(read);
}
If the code above is pasted into a file test.c and the files seq_file.h
and
stream_buffer.h
are copied into the same directory, the program should compile
with the following command:
gcc -Wall -o test test.c -lz
Or to use htslib:
HTSLIB=PATH/TO/HTSLIB
gcc -Wall -o test test.c -D_USESAM=1 -I$(HTSLIB)/htslib -L$(HTSLIB)/htslib -lhts -lpthread -lz
You'll need to compile htslib first.
read_t* seq_read_new()
Returns a pointer to a new read_t
struct
void seq_read_free(read_t* r)
Free a read_t
struct
read_t* seq_read_alloc(read_t *r)
Allocate an object on the stack
void seq_read_dealloc(read_t* r)
Free an object on the stack
seq_file_t* seq_open(const char *path)
Open sequence file pointed to by path.
seq_file_t* seq_open2(const char *path, char sam_bam, char use_gzip, size_t buffer_size)
Parameters:
sam_bam
: if 1 opens as sam, if 2 opens as bam, 0 treats as otheruse_zlib
: if 0 opens with FILE, otherwise uses gzFilebuffer_size
: size of buffer to read into.
If buffer_size == 0
no buffer is used, with the exception of use_zlib=1
with newer versions of zlib where all input is buffered by the zlib library
seq_file_t* seq_open_fh(FILE *fh, char buffered)
Use a file handle that has already been opened. Pass stdin
to read from cmdline,
pipes etc. If buffered is passed, attempts to read with a gzip stream and a buffer.
Note: seq_open_fh can only read sam/bam files from stdin at the moment.
seq_close()
will close fh
so you shouldn't call fclose(fh)
. Returns NULL
on error, in which case fh
will not have been closed.
void seq_close(seq_file_t *sf)
Close a seq_file_t
int seq_read(seq_file_t *sf, read_t *r)
Read a read from the file into r
.
Returns 1 on success, 0 on eof, -1 if partially read / syntax error
void seq_read_reverse_complement(read_t *r)
Reverse complement a read. If the read has quality scores, they are also reversed.
These functions return 1 (if true), otherwise 0
seq_is_bam(seq_file_t *sf)
seq_is_sam(seq_file_t *sf)
seq_use_gzip(seq_file_t *sf) // is this seq_file reading through zlib?
The following require a read to have been read successfully using seq_read:
seq_is_fasta(seq_file_t *sf)
seq_is_fastq(seq_file_t *sf)
seq_is_plain(seq_file_t *sf)
seqtype_t seq_guess_filetype_from_extension(const char *path)
Returns one of: {IS_SAM, IS_BAM, IS_FASTQ, IS_FASTA, IS_PLAIN, IS_UNKNOWN}
int seq_guess_fastq_format(const char *path, int max_read_bases,
int *min_qual, int *max_qual)
Returns and number between 0 and 5 inclusive on success. These refer to:
- Sanger / Illumina 1.9+ (Phred+33) -- this is the most common
- Sanger (Phred+33)
- Solexa (Solexa+64)
- Illumina 1.5+ (Phred+64)
- Illumina 1.3+ (Phred+64)
- Illumina 1.8+ (Phred+33)
Returns -1 if it cannot read the file or the file has no quality scores (e.g it's in FASTA format).
Example usage:
int fmt = seq_guess_fastq_format("in.fq");
printf("in.fq format: %s, offset: %i, min: %i, max: %i, range: [%i,%i]\n",
FASTQ_FORMATS[fmt], FASTQ_MIN[fmt], FASTQ_MAX[fmt], FASTQ_OFFSET[fmt],
FASTQ_MIN[fmt]-FASTQ_OFFSET[fmt], FASTQ_MAX[fmt]-FASTQ_OFFSET[fmt]);
Outputs:
in.fq format: Sanger / Illumina 1.9+ (Phred+33), offset: 33, min: 126, max: 33, range: [0,93]
in.fq format: Sanger (Phred+33), offset: 33, min: 73, max: 33, range: [0,40]
in.fq format: Solexa (Solexa+64), offset: 59, min: 104, max: 64, range: [-5,40]
in.fq format: Illumina 1.3+ (Phred+64), offset: 64, min: 104, max: 64, range: [0,40]
in.fq format: Illumina 1.5+ (Phred+64), offset: 67, min: 104, max: 64, range: [3,40]
in.fq format: Illumina 1.8+ (Phred+33), offset: 33, min: 74, max: 33, range: [0,41]
To get highest and lowest quality characters in the first num
bases:
int seq_get_qual_limits(const char *path, int num, int *minptr, int *maxptr)
minptr
and maxptr
are set to the min and max values of the first num
base
quality scores. Returns 0 if no qual scores in the first num
bases,
1 on success, -1 if read error.
int seq_print_fasta(const read_t *r, FILE *fh, int linewrap)
int seq_print_fastq(const read_t *r, FILE *fh, int linewrap)
int seq_gzprint_fasta(const read_t *r, gzFile gz, int linewrap)
int seq_gzprint_fastq(const read_t *r, gzFile gz, int linewrap)
Write a read in FASTA or FASTQ format. If using FASTQ and the quality score and sequence have different lengths, the quality score is shortened or padded with '.' to make it the same length as the sequence. Return -1 on error, 0 on success.
static inline size_t seq_read_truncate_name(read_t *r)
Formally, FASTA/Q entry names stop at the first space character. This function truncates the read name and returns new length
static inline void seq_read_to_uppercase(read_t *r)
static inline void seq_read_to_lowercase(read_t *r)
Convert read sequence to upper or lower case. Does not affect read name or quality scores.
seq_read_looks_valid_dna(read_t *r)
seq_read_looks_valid_rna(read_t *r)
seq_read_looks_valid_protein(read_t *r)
Check that a read looks valid. Returns 1 if it appears valid, 0 otherwise. A read is invalid if it has:
- an invalid sequence character
- quality scores but of a different length than the sequence
- a quality score that is
<33
or>105
Valid sequence characters are (upper and lower case are valid):
- DNA: ACGTN
- RNA: ACGUN
- protein: ACDEFGHIKLMNOPQRSTUVWY
Public Domain. Use as you wish. No warranty. There may be bugs.