Skip to content

humberto-ortiz/algomolbiol

Repository files navigation

Grassembler - graph-based assembly of short read sequences.

Copyright 2012 - Humberto Ortiz-Zuazaga, Cassandra Schaening, Jose
Carlos Bonilla, Alejandro Vientos del Valle, Richard Garcia Lebron.

    This program is free software: you can redistribute it and/or modify
    it under the terms of the GNU General Public License as published by
    the Free Software Foundation, either version 3 of the License, or
    (at your option) any later version.

    This program is distributed in the hope that it will be useful,
    but WITHOUT ANY WARRANTY; without even the implied warranty of
    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
    GNU General Public License for more details.

    You should have received a copy of the GNU General Public License
    along with this program.  If not, see <http://www.gnu.org/licenses/>.

Introduction 

As a class project, build an eulerian assembler.

Input will be 5000 reads of length 100 bases from a random 50 Kb chunk
of NC_010278, Actinobacillus pleuropneumoniae serovar 3 str. JL03
chromosome, complete genome. The chromosome is 2242062 bases long, and circular.

Let's generate a random starting position:

>>> import ramdom
>>> random.randint(0, 2242062-50000)
896878

Now we can slice out the piece starting at 896878 (zero based).

$ python slice.py
$ head slice.fasta
>NC_010278.1 50000 bases starting at position 896878
GAGCTTACGCCGGAATTATTGGTCGATCTCCAATCCGGTAACTACGATCGTGGGATTGTA
ACCTTACCTTATGTACGTCAATCGGATAACCAAACCGTGTATATTCCGCAAAGCATTATC
GGCAACTTATTTGTTTCAAACGGTATGTCGGCTGGTAATACGAAAAACGAAGCGCGTGTA
CAAGGTTTGTCGGAAGTGTTCGAACGTTTTGTGAAAAACCGTATTATTACTGAAGCAATC
AGCTTACCGGAAATTCCGCAAAGCGTGATTGACGGCTATCCGACAATCAAAGCGTCTATC
GAAAAACTTGAGCAAGAAGGCTTCCCGATTTTATGCTATGACGCATCATTAGGCGGTGAA
TTCCCGGTTATTTGTGTGATCTTGTTAAATCCGAATAATGGTACTTGTTTCGCTTCATTC
GGTGCACACCCGAATTTCCAAGTGGCGTTTGAACGTACGGTAACCGAGTTATTACAAGGT
CGTAGCTTAAAAGATTTAGATGTATTTGCTCCGCCTTCATTTAATAATGAAGATGTGGCG

Just to check:

$ infoseq slice.fasta 
Display basic information about sequences
USA                      Database  Name           Accession      Type Length %GC    Organism            Description 
fasta::slice.fasta:NC_010278.1 -              NC_010278.1    -              N    50000  40.55                      50000 bases starting at position 896878

Now we basically want to pick 5000 random starting positions uniformly
from 0 to 50000 and 5000 read lengths from a normal distribution with
mean 100 and stddev 5. Slice those out and write to a file.

$ simulate-reads.py
$ head reads.fasta 
>NC_010278.1 50000 bases starting at position 896878 fragment 42221:42321
ATCACGCTGATAAACGAACGGTTCATTGCTTTACACATGATGTAAGACAGAATCGCACCT
GATGAACCTACTAACGCACCGGTTACAATTAACAAGTCAT
>NC_010278.1 50000 bases starting at position 896878 fragment 12945:13039
TGGCTCGGACGTATTTACGGCGAAAGAATTCGCCAGTTTCCGCTAATTCGTAATATTGTG
ACCGAAGAACGTTACCAAATGGTGCAGGAACGTT
>NC_010278.1 50000 bases starting at position 896878 fragment 25563:25655
TTACATCGCCGTATGCCAACTGTGCCAAGCTCACATCCAAATTACCTTGAATCGTTTTCT
CGGTATTTACTTTTCCTTTTGCGACTAAGTTT
>NC_010278.1 50000 bases starting at position 896878 fragment 15165:15269

Next step: read reads, build a graph for input into Eulerian path.

2012/04/03 - HOZ

Richard contributed a Graph class with some skeleton functions and
some test cases. He implemented manual graph constructors and a
breadth first search.

The test harness can be run from the command line:

$ python Graph.py 
[['B', 'B', 'B', 'W', 'W', 'W'], [0, 1, 1, (), (), ()], ['NIL', 0, 0, 'NIL', 'NIL', 'NIL']]
[['W', 'B', 'W', 'W', 'W', 'W'], [(), 0, (), (), (), ()], ['NIL', 'NIL', 'NIL', 'NIL', 'NIL', 'NIL']]
[['W', 'W', 'B', 'W', 'W', 'W'], [(), (), 0, (), (), ()], ['NIL', 'NIL', 'NIL', 'NIL', 'NIL', 'NIL']]
[['B', 'B', 'B', 'B', 'W', 'W'], [1, 2, 1, 0, (), ()], [3, 0, 3, 'NIL', 'NIL', 'NIL']]
[['B', 'B', 'B', 'B', 'B', 'W'], [1, 2, 2, 1, 0, ()], [4, 0, 3, 4, 'NIL', 'NIL']]
[['B', 'B', 'B', 'B', 'B', 'B'], [2, 3, 2, 1, 1, 0], [3, 0, 3, 5, 5, 'NIL']]
two  predecessor  zero
two  predecessor  one
three  predecessor  two
three  predecessor  one
fourth  predecessor  three
fourth  predecessor  two
five  predecessor  three
five  predecessor  fourth

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •  

Languages