Included here are several tutorials in the form of 'Jupyter notebooks'.
The purpose of these tutorials is to help users familiarize themselves with the cloud computing in the specific context of running bioinformatics workflows. Here is a link to a YouTube video that gives you an overview of the tutorials.
These tutorials do this by going step-by-step through specific workflows. These workflows cover the start to finish of basic bioinformatics analysis; starting from downloading raw sequence data, and extending to differential gene expression analysis, and producing common plots in R.
Requirements
Getting Started
Workflows
Data
Support Forum Thanks
Funding
These tutorials were designed to be used on cloud computing platforms, with the aim of requiring nothing but the files within this GitHub repository.
With this in mind, our tutorials use Jupyter Notebook files, which Google Cloud Platform, Amazon Web Service, and Microsoft Azure all provide support for. Therefore, requirements should only require creation of a virtual machine on one of these providers, and the downloading of this github's files to that machine.
For more information on creating a virtual machine and downloading our GitHub repo to that machine, we have a getting started section below. Currently this section only includes information on how to do this using Google Cloud Platform.
This repository contains several notebook files which serve as bioinformatics workflow tutorials.
The below steps guide you through setting up a virtual machine on Google Cloud Platform, downloading our tutorial files, and running those files.
Accordingly, before starting, make sure you have a google account and have access to a Google Cloud Platform Project.
Once you have these, you can begin by first navigating to https://console.cloud.google.com/ and logging in with your credentials. Then, in the top left of the screen, navigate to 'select a project', and choose the project you belong to.
This tutorial will cost you just less than $3.00 assuming a n1-standard-8 machine, and assuming you delete the virtual machine and the storage bucket after you finish the tutorial. You can save some time by using an n2 machine, but it will cost a little bit more (thought not that much).
Follow the steps highlighted here to create a new user-managed notebook in Vertex AI. Follow steps 1-8 and be especially careful to enable idle shutdown as highlighted in step 7. For this module you should select Debian 10 and R 4.2 in the Environment tab in step 5. In step 6 in the Machine type tab, select n1-standard-4 from the dropdown box.
To clone this repository, use the Git command git clone https://github.com/NIGMS/RNA-Seq-Differential-Expression-Analysis.git
in the dropdown menu option in Jupyter notebook. Please make sure you only enter the link for the repository that you want to clone. There are other bioinformatics related learning modules available in the NIGMS Repository. This should download our repo, and the tutorial files inside, into a folder called 'RNA-Seq-Differential-Expression-Analysis'. Double click this folder now. Inside you will find all our tutorial files, which you can double click and run.
When you are finished running code, you can turn off your virtual machine to prevent unneeded billing or resource use by checking your notebook and pushing the Stop button.
Our tutorials are broken down into 'workflows'. Each notebook file covers a specific workflow, which contains written and visual commentary, as well as the actual step-by-step code for running that workflow analysis.
These notebooks were designed to be run using a virtual machine on a cloud computing provider. For more information on how to do this; navigate to the Getting Started section. Feel free to explore and run the workflows in any order you like.
Workflow One: A short introduction to downloading and mapping sequences to a transcriptome using Trimmomatic and Salmon. Here is a link to the YouTube video demonstrating the tutorial: https://youtu.be/ChGfBR4do_Y.
Extended Workflow One: An extended version of workflow one. Once you have got your feet wet, you can retry workflow one with this extended version that covers the entire dataset, and includes elaboration such as using SRA tools for sequence downloading, and examples of running batches of fastq files through the pipeline. This workflow may take around an hour to run.
Workflow One (Using Snakemake): Using Snakemake to run workflow one.
Workflow Two (DEG Analysis): Using Deseq2 and R to conduct clustering and differential gene expression analysis.
Workflow three (Using Snakemake with Life Sciences API): Using Snakemake to run workflow three.
Workflow three (Using Nextflow and Google Batch): Using Nextflow to run workflow three.
Bonus: Test your knowledge by filling in the blanks for key Cloud and bioinformatic tasks learned in the other submodules.
These tutorials use example sequence data procured from the Sally Molloy labratory here at University of Maine; which investigates the transcriptome change in prophage infected, versus non-prophage infected M. chelonae bacteria. The respective article can be found here.
Support forum thanks. For almost any programming project, contributions to online databases and support forums are an invaluable resource that are heavily leaned upon. Thank goodness for those people who take the time to answer silly questions.
Funded by the Maine INBRE Program (NIH/NIGMS P20 GM103423).
Text and materials are licensed under a Creative Commons CC-BY-NC-SA license. The license allows you to copy, remix and redistribute any of our publicly available materials, under the condition that you attribute the work (details in the license) and do not make profits from it. More information is available here.
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License