The Korf Lab performs bioinformatics experiments in a Linux command-line environment. In order for you to contribute to our scientific mission, you must have some experience in the following technologies:
- Linux - we do everything in Linux/Unix
- GitHub - where all of our code is maintained
- Markdown - how documents are formatted
- Conda - how external software gets installed
- Python - the default programming language, but not the only one
- FASTA, FASTQ - standard sequence formats
- GFF, BED, SAM - standard annotation and alignment formats
- TSV, CSV, JSON, YAML - standard interchange formats
This exam is designed to assess your current knowledge and skills. If you find yourself pulling answers from the Internet instead of your brain, you need more practice. Practice means repetition. Repetition means doing things over and over until they become automatic (like every other skill in life).
After passing the exam, you will be listed in the DEVELOPERS.md
document and
receive an official "KorfLab Bioinformatics Developer Certificate".
This exam assumes that you have a computer connected to the Internet with a
Linux CLI and have some experience with git
, python3
, and conda
. For
training in Linux, GitHub, and Python, you are encouraged to study the latest
MCB185 repository and scour the Internet.
Even though it's a good idea to collaborate with others any time you perform experiments or write code, this exam is meant to assess your personal skills and knowledge. Here are the rules of this game.
- You may not get help from anyone
- You may not import any modules that require installation
- You may look things up on the internet, but not copy-paste anything
- You may download files, but not source code
Create a GitHub repository called korflab_exam
and invite iankorf
as a
collaborator.
This task is designed to test your ability to work with text files of genomic
data using typical Linux CLI tools like grep
and sort
.
Clone this repo https://github.com/iankorf/E.coli
to get some data files for
the tasks ahead.
- According to the
*.faa
file, how many proteins does it encode? - According to the
*.gbff
file, how many coding sequences are there? - Reconcile the differences between the previous 2 questions.
- How many
tRNA
features are found in the*.gff
file? - How many of each type of features are there in the
*.gff
file?
Provide your answers in a document in your repo. Include the command lines and their outputs.
This task is designed to test your ability to install and run a program that may be unfamiliar to you.
For this task, your goal is to find the proteins in the E. coli proteome that are similar to E. coli protein NP_414608.1.
Using conda
, install blast-legacy
from the bioconda
channel. Then format
the blast database with formatdb
and search it with blastall -p blastp
. To
get the similar proteins, set the E-value to 1e-5. You may also like to use
tabular output to make it easier to get the matching protein names. Find these
options in the usage statment (of course).
Document this task in a text file that shows the command lines and the output.
Each program below will be graded on the following criteria:
- Beautiful - is the code simple, clear, and visually appealing?
- Robust - does the code handle unexpected values?
- Friendly - is the program easy to use?
- Correct - does the program solve the problem as stated?
- Efficient - does the program waste memory or time?
This task is designed to test your ability to make a simple sliding window algorithm that works like a standard Linux CLI program.
Write an entropy filter for DNA sequences.
Requirements:
- Input is a FASTA file (e.g. E.coli genome)
- Output is a FASTA file with low-entropy regions masked with Ns
- Must support multi-FASTA files
- There must be options and default parameters for
- window size (default 11)
- entropy threshold (default 1.4 bits)
- There is an option for soft masking (lowercase instead of Ns)
An example output for the E. coli genome is shown below.
>NC_000913.3 Escherichia coli str. K-12 substr. MG1655, complete genome
AGcttttcattctGACTGCAACGGGCAATATGTCTCTGTGTggattaaaaaaagagtgTC
TGATAGCAGCTTCTGAACTGGTTACCTGCCGTGagtaaattaaaattttattgaCTTAGG
TCACtaaatactttaaCCAATATAGGCATAGCGCACAGacagataaaaattacaGAGTac
acaacatccaTGAAACGCATTAGCACCACCATtaccaccaccatcaccaTTACCACAGGT
AACggtgcgggctgACGCGTACAGgaaacacagaaaaaagccCGCACCTGACAGTGCggg
ctttttttttcgaCCAAAGGTAACGAGGTAACAACCATGCGAGTGTTGAAGTTCGGCGGT
ACATCAGTGGCAAATGCAGAACGTTTTCTGCGTGTTGCCGATATTCTGGAAAGCAATGCC
AggcaggggcaggtggCCAccgtcctctctgcccccgccaaaatcaccaaccacctGGTG
GCGATgattgaaaaaaccattaGCGGCCAGGATGCTTTACCCAATATCAGCGATGCCGAA
...
This task is designed to test your ability to solve problems creatively.
Write a program that reports the locations of k-mers in a FASTA file of DNA.
- Input is a FASTA file (just one sequence) and a value for k
- Output is a table of k-mers and their positions in the sequence
- There is an option to count both strands (use negative coordinates)
Given a FASTA file like this:
>seq
AAAAACGT
The k-mer locations for k=3 are the following (using 1-based coordinates)
AAA 1 2 3
AAC 4
ACG 5
CGT 6
This task is designed to test your ability to parse complex text data into a tidy data structure.
Write a program that creates a PWM of the Kozak concensus for the E. coli genome.
- The input file is
*.gbff
format - The output format is JSON
This task is designed to test your ability to write an algorithm that balances speed, memory usage, and complexity.
Write a program that reports features in file1 that overlap features in file2.
- Input files may be GFF or BED
- Output format is up to you
Schedule an appointment to meet with Ian to determine if you passed and what level of pass you achieved. You don't need to perform every task perfectly in order to pass the exam. You may take the exam more than once to turn a fail into a pass or to increase the level of pass.
- Bronze - you passed, but there are a few problems
- Silver - you passed, but there are some things that could be done better
- Gold - you passed, and everything is very good