Lecture Notes: See PDF
Workshop Notes: See PDF
Using ecoli_0.25.contigs.fasta, write a script that reports:
- The number of contigs in the file
- The shortest contig.
- The longest contig.
- Total contig length.
- The L50 size
- The N50 size
When running RepeatMasker or other software that identifies repetitive sequences in a genome, the resulting sequence can be “soft-masked”. This means that the nucleotides that are contained in repetitive elements are in lower-case letters. Any gaps present in the genome, where the sequence is unresolved are marked with ”N”s.
NNNNNNNNNNNNCAGCAAAGACAAAcaaacaaatatacaaagacAAAAATTGCCACAGCAAAGACAAAGAGATAAATAAAAGGCACAAAATTGTCAC
For the following exercise, write a python script that parses a FASTA file, D. melanogaster chromosome assembly, and identify the following:
- How many contigs?
- Nucleotide content:
- number of each nucleotide both masked (a,c,g,t) and not (A,C,G,T)
- What proportion of the genome is comprised of gaps?
The fasta file for the exercise: D_melanogaster_genomic.fna is a large file, see filtering below:
wc -l D_melanogaster_genomic.fna
449842 lines
grep -c ">" D_melanogaster_genomic.fna
448 reads
Previously we used Unix split command, which you can use again or write a Python script to output a filtered dataset.