added tutorials to _episodes

MGXlab · Jun 12, 2024 · 3f443e6 · 3f443e6
1 parent e3021db
commit 3f443e6
Show file tree

Hide file tree

Showing 11 changed files with 238 additions and 0 deletions.
diff --git a/_episodes/1.1.0_introduction.md b/_episodes/1.1.0_introduction.md
@@ -0,0 +1,44 @@
+# Introduction to the module
+
+In this module you will learn how to analyse bacteriophage sequences with computational approaches. The theoretical part will be covered in the mornings and include a series of questions. Document the questions and answers in a lab book. These will also be considered in your final evaluation. The evaluation will be composed of:
+
+- Performance in the dry lab, lab journal (60%)
+- Presentation of the final project (40%)
+
+To write a good lab book be attentive to:
+
+- Write in the lab book every day, documenting your tasks;
+- Structure it in cronological order, so that each section corresponds to a different day.
+
+To keep up with a good organization, in your practical experiments, use a logical order for naming your folders. For instance: 0_download, 1_quality_evaluation, 2_assembly. Always name them in your lab book, so that you can find your files while reading the lab book in the future.
+
+# Lab Book
+
+Documenting your work is crucial in Computational Biology/Bioinformatics. This way you can make sure your work is reproducible, 
+you can transfer text to other documents, such as reports or papers and you can send it to collegues, so that they can access your work.   
+
+You are required to write a Lab Book in markdown for this module, which will count as part of your evaluation. If you want, you can start a GITHub repository for the course and write your lab book there.  References:
+
+- https://commonmark.org/help/
+- https://www.markdownguide.org/basic-syntax/
+
+You should divide the lab book into sections (days) and add subsections for different tasks. 
+Each subsection should have a meaningful title, e. g. "Documenting module *Viromics-Bioinformatics*", "Downloading phage sequences", "Evaluating sequence quality". Any relevant information should be included, such as websites used in searches, folders 
+where you can find files, tool names and versions and bibliographic references. An example can be seen in the link: 
+https://github.com/waltercostamb/course_viral-microbiology/blob/main/tutorials/lab-book.pdf. Start your lab book today and write on it 
+whenever necessary during the module. In the last day, send a copy to your advisor.    
+
+The paper "Ten Simple Rules for a Computational Biologist’s Laboratory Notebook", by Santiago Schnell of 2015 offers interesting insights: https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1004385 Most relevant are rules 4, 6 and 10. Keeping a hard copy of your lab book is not necessary, however make sure you have it backed up. 
+
+## Supplementary reference
+
+- "Ten Simple Rules for Reproducible Computational Research", by Geir Sandve and collaborators, 2013: https://app.dimensions.ai/details/publication/pub.1022987921
+
+# Access to draco
+
+[Draco](https://wiki.uni-jena.de/pages/viewpage.action?pageId=22453002) is a high-performance cluster created and maintained by the Universitätsrechenzentrum. It is [available for members of Thuringian Universities](http://sternb.gitpages.tpi.uni-jena.de/draco-101-2023-01/#5). To log in, you can use [ssh](http://sternb.gitpages.tpi.uni-jena.de/draco-101-2023-01/#15): 
+
+```
+ssh <fsuid>@login1.draco.uni-jena.de
+```
+
diff --git a/_episodes/1.1.1_introduction_viromics.md b/_episodes/1.1.1_introduction_viromics.md
@@ -0,0 +1,26 @@
+# Introduction to Viromics
+
+Watch the lecture below and write down at least 3 questions and/or discussion points about it:  
+
+- Click on the image to see lecture "Viral metagenomics: predicting phage-microbe interactions in the gut", by Prof Bas E. Dutilh:
+
+[![IMAGE ALT TEXT HERE](https://img.youtube.com/vi/xm2iEK4Jj90/0.jpg)](https://www.youtube.com/watch?v=xm2iEK4Jj90)  
+
+Additional reading: https://www.sciencedirect.com/science/article/abs/pii/B9780128145159000345   
+
+Answer the following questions:  
+
+- What is metagenomics?
+- What is viromics?
+- What are bacteriophages? How do they fit into metagenomics?
+- How do you asses the viromic community of a biological sample?
+- Describe FASTA and FASTQ formats
+- What do you expect the difference in diversity between viromes and metagenomes to be?
+
+Objectives:
+
+- Understand what is a metagenomics study
+- Understand what are bacteriophages and how they relate to prokaryotic communities
+- Understand specifics of sequencing applied to the study of viromics data
+- Understand FASTA and FATSQ file formats
+- Understand diversity in microbial and viral communities
diff --git a/_episodes/1.1.2_sequencing_quality.md b/_episodes/1.1.2_sequencing_quality.md
@@ -0,0 +1,21 @@
+# Download sequences and evaluate sequencing quality
+
+Watch the lecture below and write down at least 3 questions and/or discussion point about it:
+
+https://www.youtube.com/watch?v=OMzshfgJ4PQ  
+
+- Click on the image to see lecture "Basic file formats in bioinformatics", by Prof Bas E. Dutilh:  
+[![IMAGE ALT TEXT HERE](https://img.youtube.com/vi/D4WDdAbZW1Y/0.jpg)](https://www.youtube.com/watch?v=D4WDdAbZW1Y)
+
+Copy viromics sequences following the teacher's directions. These will be used in your project.        
+
+Answer the questions below using bash. Useful bash commands, by Varada Khot, can be found here: https://github.com/vmkhot/useful-scripts/blob/main/Linux%20Commands%20Cheat%20Sheet.md  
+
+- how many lines does your FASTQ file has?
+- how many sequences are there in your FASTQ file?
+- Print the first 10 lines of your file in the terminal
+- Print the last 10 lines of your file in the terminal
+- Describe what you see in these lines and what they mean
+- What is the GC content of your files?
+- Create a plot of frequency (y-axis) versus GC content (x-axis) and describe the distribution. Use python (seaborn distplot, matplotlib hist) or R (hist or ggplot geom_hist)
+- Evaluate the sequence quality following: https://github.com/vmkhot/Metagenome-workflows/blob/main/Nanopore-Long-Reads/Quality%20Control.md
diff --git a/_episodes/1.2.1_assembly_I.md b/_episodes/1.2.1_assembly_I.md
@@ -0,0 +1,32 @@
+# Assembly I
+
+Watch the lecture below and write down at least 3 questions and/or discussion point about it:
+
+- Click on the image to see the lecture "Assembly strategies for genomics and metagenomics", by Prof Bas E. Dutilh:
+
+[![IMAGE ALT TEXT HERE](https://img.youtube.com/vi/mHmMbPxKmn0/0.jpg)](https://www.youtube.com/watch?v=mHmMbPxKmn0)  
+
+- Click on the image to watch the video:
+
+[![IMAGE ALT TEXT HERE](https://img.youtube.com/vi/MgdfZTA-J3o/0.jpg)](https://www.youtube.com/watch?v=MgdfZTA-J3o)
+
+Answer the following questions:
+
+- What is sequence assembly?
+- How would you detect mutations in a model organism after an evolutionary experiment?
+- How would you determine the genome sequence of an unknown organism?
+- What are the strengths and weaknesses of DBG and OLC?
+
+Objectives:
+
+- Understand what is sequence assembly
+- Understand the difference between assembly algorithms
+
+# Supplementary reference 
+
+The book "Computational Biology: Genomes, Networks, Evolution MIT course 6.047/6.878" by Prof. Manolis Kellis is part of a course on Computational Biology and contains several topics that are relevant for Bioinformatics: https://ocw.mit.edu/ans7870/6/6.047/f15/MIT6_047F15_Compiled.pdf 
+
+Read the following sections and summarize their key points in your lab book:  
+
+- "5.2 Genome Assembly I: Overlap-Layout-Consensus Approach" and "5.3 Genome Assembly II: String graph methods" (pages 93 to 102)
+
diff --git a/_episodes/1.2.2_assembly_II.md b/_episodes/1.2.2_assembly_II.md
@@ -0,0 +1,12 @@
+# Assembly II
+
+In this experimental part, use the tool [Flye](https://github.com/fenderglass/Flye) to assemble your reads following Varada Khot's directions [for Metagenomes](https://github.com/vmkhot/Metagenome-workflows/blob/main/Nanopore-Long-Reads/Nanopore%20Assembly.md#for-metagenomes).   
+
+- Assembly https://aldertzomer.github.io/Microbial-Genomics-2022/30_Assembly/index.html
+- Quality of assemly: https://github.com/vmkhot/Metagenome-workflows/blob/main/Illumina-Short-Reads/Assembly.md#how-good-is-my-assembly
+
+- Add questions to guide students
+  - Where do you expect the longest contigs. Take into account genome size and diversity
+  - Look in the files, to check format, understand output, make sense of it
+  - What is the longest, shortest contig
+  - BLAST online to see what it is
diff --git a/_episodes/1.3.2_identify_viral_contigs.md b/_episodes/1.3.2_identify_viral_contigs.md
@@ -0,0 +1,25 @@
+# Identify viral contigs
+
+In this experimental part, you will identify viral contigs, assess genome completeness andfilter only high quality ones to use for subsequent analysis.  
+
+Before running Jaeger  
+- (virome and metagenome (pre-assembled) should be in dir /path/to/dir
+- virome (small size fraction) and metagenome (larger size fraction)
+
+- Where do you expect the most hits? (Before running tool)
+
+Run Jaeger
+- Add command line
+
+After running Jaegar
+Questions to think about
+      - Take the longest contig from each dataset and blast it
+      - Plot from jaegar score per genome length?
+      - How many viral contigs are there in each assembly?
+      - Why are not all contigs in the virome identified as viral contigs?
+      - Why are viral contigs also identified in the metagenome?
+      - size differences between viral and non-viral contigs
+
+    - How would you rank Jaeger compared to other virus detection tools? (After running tool)
+
+Check completeness with CheckV
diff --git a/_episodes/1.4.1_gene_finding.md b/_episodes/1.4.1_gene_finding.md
@@ -0,0 +1,7 @@
+# Gene Finding
+
+Youtube links for gene annotation
+Rob: https://www.youtube.com/watch?v=ecJ1DqVvuFE&pp=ygUJcGhhbm90YXRl
+Katelyn: https://www.youtube.com/watch?v=gvnPsA1S6GY&pp=ygUJcGhhbm90YXRl
+Evelien Adrianssens: https://www.youtube.com/watch?v=wO1w1Z1Or1w&pp=ygUJcGhhbm90YXRl
+
diff --git a/_episodes/1.4.2_gene_finding.md b/_episodes/1.4.2_gene_finding.md
@@ -0,0 +1,20 @@
+# Gene finding
+
+Use [Phanotate](https://github.com/deprekate/PHANOTATE) to annotate contigs and answer the following questions:
+
+Before running the tool:  
+
+- What is an ORF?
+- How and why do you predict ORFs?
+- Would you expect ORFs to overlap on the genome?
+- Would a tool for predicting ORFs in bacteria work for viruses?
+- How do you know an ORF is really a protein coding gene?
+
+After running the tool:
+
+- What does the output file look like?
+- How many ORFs did the predict predict?
+- What is the longest ORF?
+- Did you find any overlapping ORFs in a contig?
+
+Find similar contigs with a custering method: AAI ?
diff --git a/_episodes/2.1.2_host_prediction.md b/_episodes/2.1.2_host_prediction.md
@@ -0,0 +1,3 @@
+# Host prediction
+
+RafAH: https://www.cell.com/patterns/fulltext/S2666-3899(21)00100-8
diff --git a/_episodes/2.2.2_taxonomy and Phylogeny.md b/_episodes/2.2.2_taxonomy and Phylogeny.md
@@ -0,0 +1,38 @@
+# Taxonomy and Phylogeny
+
+## Morning
+
+Check out the ICTV website: https://ictv.global/
+
+- What is ICTV?
+- What is a realm?
+- How many viral genera are there?
+
+## Afternoon
+
+VContact2, Genomad.
+
+You should use only high quality genomes for this analysis.
+
+- How does Genomad work?
+- Strengths vs weaknesses
+- Taxonomic annotation of genomes with Genomad
+- How many genomes are taxonomically annotated
+- Describe the output
+
+- How does VContact2 work
+- Strengths vs weaknesses
+- Make a VContact2 plot
+- Add predicted host to the plot
+- How many genomes are taxonomically annotated
+- Describe the output
+
+# Phylogeny
+
+- How can you make a phylogeny of viruses?
+- Pros vs cons of making phylogeny with marker gene vs full genome
+- Choose a marker gene to create your phylogeny: what was the most frequently annotated PHROG?
+- Align your marker genes (MAFFT, clustalw, etc.)
+- Use iqtree to build the tree
+- Use iTOL or iroki https://www.iroki.net/ to visualize the tree
+
diff --git a/_episodes/2.3.1_research_question.md b/_episodes/2.3.1_research_question.md
@@ -0,0 +1,10 @@
+# Research Question
+
+When thinking of a new project and proposing scientific questions, the first step is to go to the scientific literature. You will get acquainted to the body of knowledge of that specific field and notice limitations. Based on this you can think of your own questions. Use (one of) the articles below (or chose one yourself) as inspiration to create your own research question(s).
+
+- [A highly abundant bacteriophage discovered in the unknown sequences of human faecal metagenomes](https://www.nature.com/articles/ncomms5498), by Dutilh *et al* (2014)
+- [Bacteriophage evolution differs by host, lifestyle and genome](https://www.nature.com/articles/nmicrobiol2017112), by Mavrich and Hatfull (2017)
+- [Evolution of BACON Domain Tandem Repeats in crAssphage and Novel Gut Bacteriophage Lineages](https://www.mdpi.com/1999-4915/11/12/1085), by de Jonge *et al* (2019)
+- [Virus Bioinformatics](https://www.sciencedirect.com/science/article/abs/pii/B9780128145159000345), by Pappas *et al* (2021)
+- [High viral abundance and low diversity are associated with increased CRISPR-Cas prevalence across microbial ecosystems](https://www.sciencedirect.com/science/article/pii/S0960982221014615), by Meaden *et al* (2022)
+- [Viruses interact with hosts that span distantly related microbial domains in dense hydrothermal mats](https://www.nature.com/articles/s41564-023-01347-5), by Hwang *et al* (2023)