GitHub - bcthomas/pullseq: Utility program for extracting sequences from a fasta/fastq file

bcthomas / pullseq Public

Notifications You must be signed in to change notification settings
Fork 15
Star 36

Utility program for extracting sequences from a fasta/fastq file

36 stars 15 forks Branches Tags Activity

Notifications

Name		Name	Last commit message	Last commit date
Latest commit History 60 Commits
cmake		cmake
src		src
test		test
.gitignore		.gitignore
AUTHORS		AUTHORS
CMakeLists.txt		CMakeLists.txt
COPYING		COPYING
ChangeLog		ChangeLog
INSTALL		INSTALL
NEWS		NEWS
README		README

Repository files navigation

Summary:

Software to extract sequence from a fasta or fastq. Also filter
sequences by a minimum length or maximum length. Fast, written in C,
using kseq.h library.


Pullseq Summary:

  pullseq - extract sequences from a fasta/fastq file.  This program is
  fast, and can be useful in a variety of situations.  You can use it to
  extract sequences from one fasta/fastq file into a new file, given
  either a list of header ids to include or a regular expression
  pattern to match.  Results can be included (default) or excluded,
  and they can additionally be filtered with minimum / maximum sequence
  lengths.

  Additionally, it can convert from fastq to fasta or visa-versa and
  can change the length of the output sequence lines.

  NOTE: pullseq prints to standard out, so you need to use redirection
  (e.g. pullseq input.fasta -m 10 *>* output.fasta ) to create output files.

Synopsis:

 pullseq -i <input fasta/fastq file> -n <header names to select>

 pullseq -i <input fasta/fastq file> -m <minimum sequence length>

 pullseq -i <input fasta/fastq file> -g <regex name to match>

 pullseq -i <input fasta/fastq file> -m <minimum sequence length> -a <max sequence length>

 pullseq -i <input fasta/fastq file> -t

 cat <names to select from STDIN> | pullseq -i <input fasta/fastq file> -N

  Options:
    -i, --input,       Input fasta/fastq file (required)
    -n, --names,       File of header id names to search for
    -N, --names_stdin, Use STDIN for header id names
    -g, --regex,       Regular expression to match (PERL compatible; always case-insensitive)
    -m, --min,         Minimum sequence length
    -a, --max,         Maximum sequence length
    -l, --length,      Sequence characters per line (default 50)
    -c, --convert,     Convert input to fastq/fasta (e.g. if input is fastq, output will be fasta)
    -q, --quality,     ASCII code to use for fasta->fastq quality conversions
    -e, --excluded,    Exclude the header id names in the list (-n)
    -t, --count,       Just count the possible output, but don't write it
    -h, --help,        Display this help and exit
    -v, --verbose,     Print extra details during the run
    --version,         Output version information and exit

=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

Seqdiff Summary:
  seqdiff - compare two fasta (or fastq) files to determine overlap of
  sequences.  This overlap can be at the sequence level (are two
  sequences exactly the same in both files?) or at the header name
  level (do two sequences contain the same header name between the two
  files?).

Synopsis:
  seqdiff -1 first_file.fa -2 second_file.fa

Usage:
  seqdiff -1 <first input fasta/fastq file> -2 <second fasta/fastq file>

  Options:
    -1, --first,      First sequence file (required)
    -2, --second,     Second sequence file (required)
    -a, --a_output,   File name for uniques from first file
    -b, --b_output,   File name for uniques from second file
    -c, --c_output,   File name for common entries
    -d, --headers,    Compare headers instead of sequences (default: false)
    -s, --summary,    Just show summary stats? (default: false)
    -h, --help,       Display this help and exit
    -v, --verbose,    Print extra details during the run
    --version,        Output version information and exit

REQUIREMENTS:
  Pullseq/Seqdiff require a C compiler and has been tested to work with
  either GCC or clang. They also require (and include) kseq.h (Heng
  Li) and uthash.h (http://troydhanson.github.com/uthash/).

  kseq.h also requires Zlib (so your linker should be able to handle
  the '-lz' option).  You can obtain zlib from http://www.zlib.net/
  or commonly from your OS package manager (e.g. apt-get zlib or
  emerge zlib).

NEW INSTALL:
  Pullseq uses CMake, so you must have CMake installed on your system.

  git clone: https://github.com/bcthomas/pullseq.git
  cd pullseq
  mkdir build
  cd build
  cmake ..

  This will build binaries in build/src/
  > build/src/pullseq
  > build/src/seqdiff



OLD INSTALL:
  To install, do the following in a shell on your system...

  From Git:
  git clone https://github.com/bcthomas/pullseq.git # checkout the code using git
  cd pullseq
  ./bootstrap  # get set up for config/build after cloning
  ./configure  # configure the application based on your system
  make         # will build the application
  make install # will install in /usr/local by default

  From a Release file (tar or zip):
  tar xvf pullseq_version.tar.gz
  cd pullseq_version
  ./autoconf   # make sure configuration is set
  ./configure  # configure the application based on your system
  make         # will build the application
  make install # will install in /usr/local by default

  NOTE: If you have PCRE (perl-compatible regular expression library)
  installed in a non-standard location (e.g. on a mac using brew), the
  ./configure script will fail. You'll need to update your CFLAGS and
  LDFLAGS env settings to define where your PCRE library files were
  installed.

  For example, on a mac with pcre installed by brew, you can do this:

  pcre-config --cflags
  -I/usr/local/Cellar/pcre/8.39/include
  
  Then you can just add this to a env CFLAGS variable and run the
  configure command, like so...

  export CFLAGS="-I/usr/local/Cellar/pcre/8.39/include"
  ./configure

  If your pcre library is installed somewhere else, you just update
  the CFLAGS env variable accordingly.