Merge pull request #32 from snayfach/dev

Dev
snayfach · Oct 25, 2016 · f0ef5a7 · f0ef5a7
2 parents 5481b80 + 9053b0c
commit f0ef5a7
Show file tree

Hide file tree

Showing 15 changed files with 1,037 additions and 429 deletions.
diff --git a/README.md b/README.md
@@ -12,7 +12,7 @@ MIDAS is an integrated pipeline that leverages >30,000 reference genomes to esti
 
 
 ## Table of Contents
-1. [Step-by-step tutorial] (docs/tutorial.md)  
+1. [Step-by-step tutorial] (docs/tutorial.md)  
 2. [Install or update MIDAS] (docs/install.md)  
 3. [Download default MIDAS database] (docs/ref_db.md)  
 4. [Build your own custom database] (docs/build_db.md)
@@ -24,15 +24,17 @@ MIDAS is an integrated pipeline that leverages >30,000 reference genomes to esti
  * [Merge species abundance] (docs/merge_species.md)  
  * [Merge gene content] (docs/merge_cnvs.md)
  * [Merge SNPs] (docs/merge_snvs.md)
-7. Scripts for analyzing gene content and SNPs:
+7. Example scripts for analyzing gene content and SNPs:
  * [Strain tracking] (docs/strain_tracking.md)  
  * [Population diversity] (docs/snp_diversity.md)  
- * Gene content dynamics (coming soon)
+ * [Core-genome phylogenetic trees] (docs/snp_trees.md)   
+ * [Gene content dynamics] (docs/compare_genes.md)
 
 
 ## Citation
 If you use this tool, please cite:
-S Nayfach, B Rodriguez-Mueller, N Garud, and KS Pollard. "An integrated metagenomics pipeline for strain profiling reveals novel patterns of transmission and global biogeography of bacteria". bioRxiv 2016.
+S Nayfach, B Rodriguez-Mueller, N Garud, and KS Pollard. "An integrated metagenomics pipeline for strain profiling reveals novel patterns of transmission and global biogeography of bacteria". Genome Research 2016. doi:
+10.1101/gr.201863.115
 
 ## Pipeline
 <img src="images/pipeline.jpg" width="600" align="middle"/>   

diff --git a/docs/compare_genes.md b/docs/compare_genes.md
@@ -0,0 +1,56 @@
+## Gene Content Dynamics
+
+These scripts will allow you to compare the gene content of a species between all pairs of metagenomic samples. 
+
+Before running these scripts, you'll need to have run:   
+`merge_midas.py genes` [read more] (https://github.com/snayfach/MIDAS/blob/master/docs/merge_snvs.md).
+
+
+####Command usage:
+
+``` 
+gene_sharing.py --indir <PATH> --out <PATH> [options]
+
+  --indir PATH          path to output from 'merge_midas.py genes' for one species
+                        directory should be named according to a species_id and contains files 'genes_*.txt')
+  --out PATH            path to output file
+```
+
+####Options:
+
+```
+  --max_genes INT       Maximum number of genes to use. Useful for quick tests (use all)
+  --max_samples INT     Maximum number of samples to use. Useful for quick tests (use all)
+  --distance {jaccard,euclidean,manhattan}
+                        Metric to use for computing distances (jaccard)
+  --dtype {presabs,copynum}
+                        Data type to use for comparing genes (presabs)
+  --cutoff FLOAT        Cutoff to use for determining presence absence (0.35)
+
+```
+
+####Examples:  
+1) Run with defaults:  
+`gene_sharing.py --indir /path/to/species --out distances.txt`
+
+2) Run a quick test:  
+`gene_sharing.py --indir /path/to/species --out distances.txt --max_genes 1000 --max_samples 10`
+
+3) Use a different distance metric:  
+`gene_sharing.py --indir /path/to/species --out distances.txt --distance manhattan`
+
+4) Use a lenient cutoff for determining gene presence absence:  
+`gene_sharing.py --indir /path/to/species --out distances.txt --cutoff 0.10`
+
+5) Use a strict cutoff for determining gene presence absence:  
+`gene_sharing.py --indir /path/to/species --out distances.txt --cutoff 0.75`
+
+####Output format:  
+  sample1: first sample identifier  
+  sample2: second sample identifier  
+  count1: number of present genes in sample1
+  count2: number of present genes in sample2  
+  count_either: number of genes in sample1 or sample2  
+  count_both: number of genes in sample1 and sample2  
+  distance: dissimilarity between gene sets 
+
diff --git a/docs/install.md b/docs/install.md
@@ -12,13 +12,16 @@ Python modules (installed via setup.py):
 
 ### Download the software
 
-**Clone the latest version from GitHub (recommended)**:
+Clone the latest version from GitHub (recommended):  
 `git clone https://github.com/snayfach/MIDAS`  
 This makes it easy to update the software in the future using `git pull` as bugs are fixed and features are added
 
-**Or, download the source code**: 
+Or, download the latest source code:   
 https://github.com/snayfach/MIDAS/archive/master.zip  
-And unpack the project: `unzip MIDAS-master.zip`
+And unpack the project: `unzip MIDAS-master.zip`  
+
+Or, download a stable releases of the source code:  
+[https://github.com/snayfach/MIDAS/releases] (https://github.com/snayfach/MIDAS/releases)  
 
 ### Install dependencies
 
@@ -27,36 +30,46 @@ The dependencies can be installed by running the setup.py script:
 `sudo python MIDAS/setup.py install` to install as a superuser, or  
 `python MIDAS/setup.py install --user` to install locally  
 
-If you have pip installed, you can install these dependencies by running the following command:  
+Or, if you have pip installed, you can install these dependencies by running the following command:  
 `pip install -U numpy biopython pysam pandas` or  
 `sudo pip install -U numpy biopython pysam pandas` to install as a superuser, or  
 `pip install --user numpy biopython pysam pandas` to install locally  
 
-[Detailed instructions for installation on a brand new CentOS/Ubuntu system] (install_other.md)
+If you run into problems, check out these additional instructions:  
+[Installing MIDAS on Mac OS X 10.11 and later] (https://github.com/snayfach/MIDAS/issues/31)  
+[Installing on CentOS/Ubuntu] (install_other.md)
+
+If you still have installation issues, please open a new [issue] (https://github.com/snayfach/MIDAS/issues) on GitHub
 
 ### Update your environmental variables
 
 This will enable you to run MIDAS scripts:  
 `export PYTHONPATH=$PYTHONPATH:/path/to/MIDAS`  
 `export PATH=$PATH:/path/to/MIDAS/scripts` 
+`export MIDAS_DB=/path/to/midas_database` 
 
 Be sure to replace '/path/to' with the appropriate path on your system  
-Add these commands to your .bash_profile to avoid entering the commands in the future
+Add these commands to your .bash_profile to avoid entering the commands in the future  
 
-### Testing
+To learn how to download the default MIDAS database, click [here] (ref_db.md)  
+To learn how to build your own MIDAS database, click [here] (build_db.md)   
 
-Run the following command to test MIDAS:
+### Testing
 
+Run the following command to test MIDAS:  
 `cd /path/to/MIDAS/test`  
 `python test_midas.py`
 
-Be sure to replace '/path/to' with the appropriate path on your system.  
 This will take a few minutes to complete
 
 ### Update MIDAS
 Move to installation directory, pull the latest version, and install:  
 `cd MIDAS`  
-`git pull`  
+`git pull` 
+
+Or, download the latest stable release of the source code:  
+[https://github.com/snayfach/MIDAS/releases] (https://github.com/snayfach/MIDAS/releases)  
+
 
 ## Next step
 [Download the reference database] (ref_db.md)
diff --git a/docs/snp_trees.md b/docs/snp_trees.md
@@ -0,0 +1,76 @@
+## Core-Genome Phylogenetic Trees
+
+This script will allow you to build strain-level phylogenetic trees using consensus-alleles found in the core-genome.   
+
+The core-genome of a species is identified directly from the data by looking for genomic sites in the representative genome that have high coverage across multiple metagenomes.  
+
+Before running these scripts, you'll need to have run `merge_midas.py snps` [read more] (https://github.com/snayfach/MIDAS/blob/master/docs/merge_snvs.md).
+
+### Step 1. Call consensus alleles
+
+####Command usage:  
+
+``` 
+call_consensus.py --indir <PATH> --out <PATH> [options]
+
+  --indir PATH          path to output from 'merge_midas.py snps' for one species
+                        directory should be named according to a species_id and contains files 'snps_*.txt')  
+  --out PATH            path to output file
+```
+
+####Options for selecting samples:
+
+```
+  --sample_depth FLOAT  minimum average read depth per sample (0.0)  
+  --sample_cov FLOAT    fraction of reference sites covered by at least 1 read (0.0)  
+  --max_samples INT     maximum number of samples to process.  
+                        useful for quick tests (use all)  
+  --keep_samples STR    comma-separated list of samples to include  
+                        samples will still be subject to other filters  
+  --exclude_samples STR
+                        comma-separated list of samples to exclude.  
+                        samples will still be subject to other filters  
+```
+
+####Options for selecting genomic sites:
+
+```
+  --site_depth INT      minimum number of mapped reads per site (2)
+  --site_prev FLOAT     site has at least <site_depth> coverage in at least <site_prev> proportion of samples (0.0)
+                        a value of 1.0 will select sites that have sufficent coverage in all samples.
+                        a value of 0.0 will select all sites, including those with low coverage in many samples 
+                        NAs recorded for included sites with less than <site_depth> in a sample 
+  --site_maf FLOAT      minimum average-minor-allele-frequency of site across samples (0.0)
+                        setting this above zero (e.g. 0.01, 0.02, 0.05) will only retain variable sites
+                        by default invariant sites are also retained.
+  --site_ratio FLOAT    maximum ratio of site-depth to mean-genome-depth (None)
+                        a value of 10 will filter genomic sites with 10x high coverage than the genomic background
+  --max_sites INT       maximum number of sites to include in output (use all)
+                        useful for quick tests
+```
+
+####Example command:  
+```
+call_consensus.py --indir /path/to/snps --out consensus.fa --site_maf 0.01 --site_depth 5 --site_prev 0.90 --sample_depth 10 --sample_cov 0.40 --site_ratio 
+```
+
+This command will build a multi-FASTA of core-genome sequences   
+-core-genome sites defined as >=5 reads in >=90% of samples  
+-use only variable positions (>=1% minor allele frequency across samples)  
+-only include samples with sufficient data (>=10x mean-depth, >=40% of sites with >=1 mapped read)  
+-exclude sites with abnormal depth (>5x mean-depth or <1/5 mean-depth)  
+
+
+
+### Step 2. Build phylogenetic tree
+Now simply use your favorite tool to build the phylogenetic tree.
+
+Example:  
+Download FastTree [here] (http://www.microbesonline.org/fasttree)  
+And, run: `FastTree -gtr -nt < consensus.fa > consensus.tree `
+
+
+### Step 3. Visualize tree
+From a web-browser: [iTOL] (http://itol.embl.de/)  
+Or, from R: [ape] (https://cran.r-project.org/web/packages/ape)
+
diff --git a/docs/strain_tracking.md b/docs/strain_tracking.md
@@ -1,58 +1,63 @@
 ## Strain tracking
 
-These scripts will allow you to identify rare SNPs that discriminate individual strains and to track these SNPs between hosts to elucidate transmission patterns. Before running these scripts, you'll need to have run `merge_midas.py snps` [read more] (https://github.com/snayfach/MIDAS/blob/master/docs/merge_snvs.md).
+These scripts will allow you to identify rare SNPs that discriminate individual strains and to track these SNPs between hosts to elucidate transmission patterns.   
 
+Before running these scripts, you'll need to have run:  
+`merge_midas.py snps` [read more] (https://github.com/snayfach/MIDAS/blob/master/docs/merge_snvs.md).
 
 
-#### Step 1: identify rare SNPs that disriminate individual strains of a particular species
+### Step 1: identify rare SNPs that disriminate individual strains of a particular species
 
  * Scan across the entire genome of a patricular species
  * At each genomic site, compute the presence-absence of the four nucleotides across metagenomic samples
  * Identify SNPs (particular nucleotide at a genomic site) that rarely occur across samples
  * Because these SNPs are rare, they serve as good markers of individual strains 
 
-USAGE:
+####Command usage:
 
-`strain_tracking.py id_markers --indir /path/to/snps/species_id --out species_id.markers [options]`
+`strain_tracking.py id_markers --indir <PATH> --out <PATH> [options]`
 
-OPTIONS:
+####Options:
 
-<b>--sample_map PATH </b>  
-Path to mapping file that specifes groups of related sample_ids. 
-Specify this if there are groups of related samples likely to harbor the same strains (e.g. technical replicates, biological replicates, related individuals).  
-The file should be tab-delimited with no header and two columns. Example:  
-
- sample1 subject1  
- sample2 subject1  
- sample3 subject2  
- sample4 subject2  
- sample5 subject3  
-
+<b>--samples STR </b>  
+Comma-separated list of samples to use for training  
+By default, all samples are used
+
 <b>--min_freq FLOAT </b>   
- Minimum allele frequency (proportion of reads) per site for SNP calling (0.10) 
+ Minimum allele frequency (proportion of reads) per site for SNP calling (0.10)  
 
 <b>--min_reads INT </b>     
 Minimum number of reads supporting allele per site for SNP calling (3)  
 
-<b>--max_groups INT </b>   
-Discard alleles found in > MAX_GROUPS (1)    
-Sample groups are specified by the second field in '--sample_map'  
-If '--sample_map' is not specified, each sample_id is treated as a separate group  
+<b>--allele_freq INT </b>   
+Maximum occurences of allele across samples (1)   
+Setting this to 1 (default) will pick alleles found in exactly 1 sample
 
 <b>--max_sites INT </b>    
 Maximum number of genomic sites to process (use all)  
 Useful for quick tests
+
+####Examples:
+1) Use a subset of sample in SNP matrix for training
+`strain_tracking.py id_markers --indir merged_snps/species_id --out species.markers --samples sample1,sample2,sample3`
+
+2) Run a quick test  
+`strain_tracking.py id_markers --indir merged_snps/species_id --out species.markers --max_sites 10000`
+
+3) Use strict criteria for pick marker alleles:   
+`strain_tracking.py id_markers --indir indir --out outfile --min_freq 0.90 --min_reads 5 --allele_freq 1 `
 
-#### Step 2: track rare SNPs between samples and determine transmission 
+      
+### Step 2: track rare SNPs between samples and determine transmission 
  * Compute the presence of marker SNPs (identified in Step 1) across metagenomic samples
  * Quantify the number and fraction of marker SNPs that are shared between all pairs of metagenomic samples
  * Based on a sharing cutoff (e.g. 5%), determine if a strain is shared or not
 
-USAGE:
+####Command usage:
 
 `strain_tracking.py track_markers --indir /path/to/snps/species_id --out species_id.marker_sharing --markers species_id.markers [options]`
 
-OPTIONS:
+####Options:
 
 <b>--min_freq FLOAT </b>   
  Minimum allele frequency (proportion of reads) per site for SNP calling (0.10)