-
Notifications
You must be signed in to change notification settings - Fork 23
calc_N50
[calc_N50] can be used to calculate the N50 of sequences in the stream. N50 is typically used to decribe sequence assemblies.
N50 is defined as the contig length such that using equal or longer contigs produces half the bases of the genome. The N50 size is computed by sorting all contigs from largest to smallest and by determining the minimum set of contigs whose sizes total 50% of the entire genome.
Read more here:
http://en.wikipedia.org/wiki/N50_statistic
... | calc_N50 [options]
[-? | --help] # Print full usage description.
[-x | --no_stream] # Do not emit records.
[-o <file> | --data_out=<file>] # Write result to file.
[-I <file!> | --stream_in=<file!>] # Read input from stream file - Default=STDIN
[-O <file> | --stream_out=<file>] # Write output to stream file - Default=STDOUT
[-v | --verbose] # Verbose output.
Consider the following FASTA entries in the file test.fna
:
>test1
ATGCACATTG
>test2
ATGCACATTGATGCACATTG
>test3
ATGCACATTGATGCACATTGATGCACATTG
>test4
ATGCACATTGATGCACATTGATGCACATTGATGCACATTG
>test5
ATGCACATTGATGCACATTGATGCACATTGATGCACATTGATGCACATTG
To find the N50 read in the sequences with read_fasta:
read_fasta -i test.fna | calc_N50
SEQ_NAME: test1
SEQ: ATGCACATTG
SEQ_LEN: 10
---
SEQ_NAME: test2
SEQ: ATGCACATTGATGCACATTG
SEQ_LEN: 20
---
SEQ_NAME: test3
SEQ: ATGCACATTGATGCACATTGATGCACATTG
SEQ_LEN: 30
---
SEQ_NAME: test4
SEQ: ATGCACATTGATGCACATTGATGCACATTGATGCACATTG
SEQ_LEN: 40
---
SEQ_NAME: test5
SEQ: ATGCACATTGATGCACATTGATGCACATTGATGCACATTGATGCACATTG
SEQ_LEN: 50
---
N50: 40
---
To obtain only the N50 use the -x
switch:
read_fasta -i test.fna | calc_N50 -x
N50: 40
---
And to output the N50 to a file use the -o
switch:
read_fasta -i test.fna | calc_N50 -o N50.txt -x
Martin Asser Hansen - Copyright (C) - All rights reserved.
January 2011
GNU General Public License version 2
http://www.gnu.org/copyleft/gpl.html
[calc_N50] is part of the Biopieces framework.