Genome Assembly Workshop

Lecture Notes: See PDF
Workshop Notes: See PDF

Problem Set

Exercise 1: Genome Stats

Using ecoli_0.25.contigs.fasta, write a script that reports:

The number of contigs in the file
The shortest contig.
The longest contig.
Total contig length.
The L50 size
The N50 size

Exercise 2: Soft repeat-masked genome

When running RepeatMasker or other software that identifies repetitive sequences in a genome, the resulting sequence can be “soft-masked”. This means that the nucleotides that are contained in repetitive elements are in lower-case letters. Any gaps present in the genome, where the sequence is unresolved are marked with ”N”s.

NNNNNNNNNNNNCAGCAAAGACAAAcaaacaaatatacaaagacAAAAATTGCCACAGCAAAGACAAAGAGATAAATAAAAGGCACAAAATTGTCAC

For the following exercise, write a python script that parses a FASTA file, D. melanogaster chromosome assembly, and identify the following:

How many contigs?
Nucleotide content:
- number of each nucleotide both masked (a,c,g,t) and not (A,C,G,T)
What proportion of the genome is comprised of gaps?

The fasta file for the exercise: D_melanogaster_genomic.fna is a large file, see filtering below:

wc -l D_melanogaster_genomic.fna
449842 lines

grep -c ">" D_melanogaster_genomic.fna
448 reads

Previously we used Unix split command, which you can use again or write a Python script to output a filtered dataset.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Genome Assembly Workshop

Problem Set

Exercise 1: Genome Stats

Exercise 2: Soft repeat-masked genome

Files

README.md

Latest commit

History

README.md

File metadata and controls

Genome Assembly Workshop

Problem Set

Exercise 1: Genome Stats

Exercise 2: Soft repeat-masked genome