-
Notifications
You must be signed in to change notification settings - Fork 5
Algorithms for Molecular Biology
humberto-ortiz/algomolbiol
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
Grassembler - graph-based assembly of short read sequences. Copyright 2012 - Humberto Ortiz-Zuazaga, Cassandra Schaening, Jose Carlos Bonilla, Alejandro Vientos del Valle, Richard Garcia Lebron. This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program. If not, see <http://www.gnu.org/licenses/>. Introduction As a class project, build an eulerian assembler. Input will be 5000 reads of length 100 bases from a random 50 Kb chunk of NC_010278, Actinobacillus pleuropneumoniae serovar 3 str. JL03 chromosome, complete genome. The chromosome is 2242062 bases long, and circular. Let's generate a random starting position: >>> import ramdom >>> random.randint(0, 2242062-50000) 896878 Now we can slice out the piece starting at 896878 (zero based). $ python slice.py $ head slice.fasta >NC_010278.1 50000 bases starting at position 896878 GAGCTTACGCCGGAATTATTGGTCGATCTCCAATCCGGTAACTACGATCGTGGGATTGTA ACCTTACCTTATGTACGTCAATCGGATAACCAAACCGTGTATATTCCGCAAAGCATTATC GGCAACTTATTTGTTTCAAACGGTATGTCGGCTGGTAATACGAAAAACGAAGCGCGTGTA CAAGGTTTGTCGGAAGTGTTCGAACGTTTTGTGAAAAACCGTATTATTACTGAAGCAATC AGCTTACCGGAAATTCCGCAAAGCGTGATTGACGGCTATCCGACAATCAAAGCGTCTATC GAAAAACTTGAGCAAGAAGGCTTCCCGATTTTATGCTATGACGCATCATTAGGCGGTGAA TTCCCGGTTATTTGTGTGATCTTGTTAAATCCGAATAATGGTACTTGTTTCGCTTCATTC GGTGCACACCCGAATTTCCAAGTGGCGTTTGAACGTACGGTAACCGAGTTATTACAAGGT CGTAGCTTAAAAGATTTAGATGTATTTGCTCCGCCTTCATTTAATAATGAAGATGTGGCG Just to check: $ infoseq slice.fasta Display basic information about sequences USA Database Name Accession Type Length %GC Organism Description fasta::slice.fasta:NC_010278.1 - NC_010278.1 - N 50000 40.55 50000 bases starting at position 896878 Now we basically want to pick 5000 random starting positions uniformly from 0 to 50000 and 5000 read lengths from a normal distribution with mean 100 and stddev 5. Slice those out and write to a file. $ simulate-reads.py $ head reads.fasta >NC_010278.1 50000 bases starting at position 896878 fragment 42221:42321 ATCACGCTGATAAACGAACGGTTCATTGCTTTACACATGATGTAAGACAGAATCGCACCT GATGAACCTACTAACGCACCGGTTACAATTAACAAGTCAT >NC_010278.1 50000 bases starting at position 896878 fragment 12945:13039 TGGCTCGGACGTATTTACGGCGAAAGAATTCGCCAGTTTCCGCTAATTCGTAATATTGTG ACCGAAGAACGTTACCAAATGGTGCAGGAACGTT >NC_010278.1 50000 bases starting at position 896878 fragment 25563:25655 TTACATCGCCGTATGCCAACTGTGCCAAGCTCACATCCAAATTACCTTGAATCGTTTTCT CGGTATTTACTTTTCCTTTTGCGACTAAGTTT >NC_010278.1 50000 bases starting at position 896878 fragment 15165:15269 Next step: read reads, build a graph for input into Eulerian path. 2012/04/03 - HOZ Richard contributed a Graph class with some skeleton functions and some test cases. He implemented manual graph constructors and a breadth first search. The test harness can be run from the command line: $ python Graph.py [['B', 'B', 'B', 'W', 'W', 'W'], [0, 1, 1, (), (), ()], ['NIL', 0, 0, 'NIL', 'NIL', 'NIL']] [['W', 'B', 'W', 'W', 'W', 'W'], [(), 0, (), (), (), ()], ['NIL', 'NIL', 'NIL', 'NIL', 'NIL', 'NIL']] [['W', 'W', 'B', 'W', 'W', 'W'], [(), (), 0, (), (), ()], ['NIL', 'NIL', 'NIL', 'NIL', 'NIL', 'NIL']] [['B', 'B', 'B', 'B', 'W', 'W'], [1, 2, 1, 0, (), ()], [3, 0, 3, 'NIL', 'NIL', 'NIL']] [['B', 'B', 'B', 'B', 'B', 'W'], [1, 2, 2, 1, 0, ()], [4, 0, 3, 4, 'NIL', 'NIL']] [['B', 'B', 'B', 'B', 'B', 'B'], [2, 3, 2, 1, 1, 0], [3, 0, 3, 5, 5, 'NIL']] two predecessor zero two predecessor one three predecessor two three predecessor one fourth predecessor three fourth predecessor two five predecessor three five predecessor fourth
About
Algorithms for Molecular Biology
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published