From 9b429bc00cc2641f8b0c0fe1454fd0ad115ef23f Mon Sep 17 00:00:00 2001 From: Adam Orr Date: Mon, 20 Feb 2017 10:57:25 -0700 Subject: [PATCH] work on documentation. #23 --- README.md | 77 ++++++++++++++++++++++++++++++++++++++++++++++++++++--- 1 file changed, 74 insertions(+), 3 deletions(-) diff --git a/README.md b/README.md index f3f5fc8..f0287e3 100644 --- a/README.md +++ b/README.md @@ -177,7 +177,7 @@ Usage: clean_reads.sh [-t THREADS] [-d DEST_DIRECTORY] [-k KMER_SIZE] [-f FILE_S * **-c:** max coverage to allow [40000] * **-i:** input directory to search for reads to clean -This script uses Rcorrector and Khmer to clean reads found in READ_DIRECTORY. +This script uses Rcorrector and Khmer to clean reads found in READ_DIRECTORY. Rcorrector is used to remove errors and Khmer is used to remove excessively repetitive reads. The -d option changes which directory to put output in and defaults to ./cleaned_reads . The -k option changes the kmer size to use and defaults to 32. The -f flag is a pattern that finds reads in READ_DIRECTORY matching that pattern; by default, it is '*.fastq' . @@ -186,6 +186,18 @@ The -m flag controls how much memory is allocated to khmer and defaults to 64e9. The -c flag is an approximate coverage cutoff to filter on using Khmer and defaults to 40000. The -i option specifies the directory to search for reads to correct. +###What it does +This script uses Rcorrector and Khmer to clean reads found in READ_DIRECTORY. Rcorrector is used to remove errors and Khmer is used to remove excessively repetitive reads. + +###Expected output +A file containing the khmer graph, called khmer_count.graph. + +Two folders: + * **corrected/** containing the corrected read pairs. + * **sliced/** containing the read pairs after filtration with Khmer, as well as reads which passed filtering but their mate did not. + +Output reads will have the same name as the input reads, but corrected reads will have a .cor.fq suffix and sliced reads will have a .cor_sliced.${FILE_SUFFIX} suffix. + ##slice-paired-reads-by-coverage.py Usage: slice-paired-reads-by-coverage.py [-m MIN_COVERAGE] [-M MAX_COVERAGE] INPUT_GRAPH INPUT_READS1 INPUT_READS2 OUTPUT_READS1 OUTPUT_READS2 [OUTPUT_SINGLETONS] @@ -200,6 +212,11 @@ Usage: slice-paired-reads-by-coverage.py [-m MIN_COVERAGE] [-M MAX_COVERAGE] INP This is a modified version of khmer's [slice-reads-by-coverage.py](https://github.com/dib-lab/khmer/blob/master/sandbox/slice-reads-by-coverage.py) script. It can handle paired-end reads. Check out [this blog post](http://ivory.idyll.org/blog/2014-slice-reads-by-coverage.html) for more details on how the script works. In summary, it will remove reads which have a high or low expected coverage level. This can be useful for removing reads which are likely to be errors or reads likely to map to repetitive regions. All positional arguments are required except the singleton output file. One of -m or -M must be specified. +###What it does +Removes reads which have a high or low expected coverage level using khmer. + +###Expected output +3 files: OUTPUT_READS1, OUTPUT_READS2, and OUTPUT_SINGLETONS, which are the first, second, and singleton reads. The singleton reads are those that passed filtering, but their mates did not. ##bwa_aligner.sh Usage: bwa_aligner.sh [-t THREADS] [-d TMPDIR] [-p RG_PLATFORM] [-q FILEPATTERN] [-1 FIRSTMATE] [-2 SECONDMATE] [-o out.bam] -r reference.fa -i data/ @@ -227,6 +244,12 @@ It will then match these files with their mates by substituting the value of -2. The pairs of reads should be in the same directory and have the same name except for the substitution of the value of -1 for the value of -2. For example, in a data directory with reads_R1.fastq and reads_R2.fastq, by default the script will identify the two files by their suffixes, identify the file containing R1 as the first set of reads and substitute R1 with R2 to identify the file containing the second set of reads. +###What it does +Uses BWA to align reads in the given directory to the given reference and outputs a bamfile. + +###Expected output +A bam file, specified by -o. + ##stampy_realigner.sh Usage: stampy_realigner.sh [-t THREADS] [-d TMPDIR] [-o out.bam] -r reference.fasta -i data.bam @@ -239,6 +262,12 @@ Usage: stampy_realigner.sh [-t THREADS] [-d TMPDIR] [-o out.bam] -r reference.fa This script uses Stampy to realign poorly mapped reads (from Stampy's --bamkeepgoodreads option) and outputs to `out.bam`. The given reference should be the same used to generate `data.bam`. A temporary directory to be used can be specified with -d. +###What it does +Remaps poorly aligned reads in a given bam file using Stampy. + +###Expected output +A bam file, specified by -o. + ##create_consensus.sh Usage: create_consensus.sh [-t THREADS] [-d TMPDIR] [-f FILTER] [-c CHAINFILE] [-b BCFTOOLSFILE] -o out.fa -r reference.fasta -i in.bam @@ -251,7 +280,15 @@ Usage: create_consensus.sh [-t THREADS] [-d TMPDIR] [-f FILTER] [-c CHAINFILE] [ * **-r:** reference used to create input BAM file. * **-i:** input BAM file to create consensus from. -This script uses bcftools and tabix to create a consensus in fasta format from a BAM file aligned to a reference. A filter is applied to the genotypes called by bcftools which can be changed with -f to give the desired behavior. A chain file that describes the differences between the reference and the output is generated and called consensus.chain by default. This can be changed with the -c flag. A file containing the genotype calls generated by BCFtools is generated and can be saved for use with the -b option. +This script uses bcftools and tabix to create a consensus in fasta format from a BAM file aligned to a reference. A filter is applied to the genotypes called by bcftools which can be changed with -f to give the desired behavior. A chain file that describes the differences between the reference and the output is generated with the same name as the output fasta file with an added .chain suffix by default. This can be changed with the -c flag. A file containing the genotype calls made by BCFtools is generated and can be saved for use with the -b option. + +###What it does +Uses bcftools and tabix to create a consensus sequence in fasta format from a BAM file. + +###Expected output + * A consensus reference specified by -o. + * A chainfile describing the rearrangements between the input reference and the output consensus file. By default, this has the same name as the output consensus but with an additional .chain suffix. + * When -b is used, a vcf file of variant calls used to generate the consensus. ##gatkcaller.sh Usage: gatkcaller.sh [-t THREADS] [-p PICARD_CMD] [-d TMPDIR] [-g GATK_PATH] [-b BEDFILE] [-o OUTFILE] -r ref.fa -i data.bam @@ -267,6 +304,12 @@ Usage: gatkcaller.sh [-t THREADS] [-p PICARD_CMD] [-d TMPDIR] [-g GATK_PATH] [-b A script implementing the [GATK best practices](https://software.broadinstitute.org/gatk/best-practices/) to call variants from the input BAM file. A BED file can be specified to exclude the regions described in it from variant calling, if desired. +###What it does +Calls variants from a BAM file using the GATK best practices. + +###Expected output +A vcf file specified by -o. + ##filt_with_replicates.pl Usage: filt_with_replicates.pl [-s | --strict] [-m | --majority] [-g | --grouped int] [-h | -? | --help] filtered.vcf @@ -276,6 +319,12 @@ Usage: filt_with_replicates.pl [-s | --strict] [-m | --majority] [-g | --grouped This script filters a vcf using replicate data from the VCF. By default, it will consider all samples to be replicates of one another that have the same sample name except for the last character. Using -g, one can specify that the replicates are grouped together in the VCF. For example, -g 3 will assume that the first 3 samples are replicates of one another, the 2nd 3 samples are replicates of one another, and so on. A group of replicates will not pass the filter unless the genotypes of all replicates match. The -m flag relaxes this stipulation, and allows a group to pass filtering if the majority of replicates match. In this case, all genotypes of the replicate will be changed to the majority. If a replicate fails to pass filtering, all samples of that replicate are changed to have a ./. genotype. The -s flag alters filtering behavior such that a site is only emitted if all replicates pass filtering at that site. +###What it does +Filter a vcf using replicate information. + +###Expected output +A filtered vcf file printed to stdout. + ##vcf2fa.sh Usage: vcf2fa.sh [-d tmp_dir/] [-i file.vcf] [-o out.fa] @@ -284,6 +333,13 @@ Usage: vcf2fa.sh [-d tmp_dir/] [-i file.vcf] [-o out.fa] * **-o:** output fasta file [stdout] This script takes sites in the input VCF and outputs them to a fasta alignment of all the sites using IUPAC ambiguity for heterozygous sites. +Note that any downstream analysis should be done using ascertainment bias correction if possible, unless you know exactly what you're doing. + +###What it does +Takes sites from a VCF file and generates a fasta alignment of these sites. + +###Expected output +A fasta alignment specified by -o. If -o is not used, prints to stdout. ##diploidify.py Usage: diploidify.py [-i INFILE] [-t IN_TYPE] [-o OUTFILE] [-p OUTTYPE] [-v] [-s] @@ -297,6 +353,12 @@ Usage: diploidify.py [-i INFILE] [-t IN_TYPE] [-o OUTFILE] [-p OUTTYPE] [-v] [-s A script to turn an alignment where each diploid genotype is represented as one site with IUPAC ambiguity codes into an alignment where each genotype is represented as 2 sites. This script also removes sites that require two mutations (like C/C -> T/T). It also removes sites that have 3 alleles. Using -s will keep these checks (ie remove triallelic sites and sites that require multiple mutations) but won't expand each site into two other sites, keeping the original notation. Using -v will output only variable sites. +###What it does +Takes a fasta alignment and doubles the number of sites, eliminating IUPAC ambiguity codes. This is useful if the ambiguity codes represent a diploid genotype and not actual ambiguity. + +###Expected output +An alignment specified by -o. If -o is not used, prints to stdout. + ##vcf2tree.sh Usage: vcf2tree.sh [-t THREADS] [-r raxmlHPC] [-i in.vcf] [-o out.nwk] -g GROUPBY @@ -308,6 +370,12 @@ Usage: vcf2tree.sh [-t THREADS] [-r raxmlHPC] [-i in.vcf] [-o out.nwk] -g GROUPB Takes a vcf file and filters it using the strict mode of `filt_with_replicates.pl`, creates a fasta alignment with `vcf2fa`, then creates a diploidified version of the alignment with `diploidify.py`, and finally creates a newick tree using RAxML. -g is the same argument used in `filt_with_replicates.pl`. Use -r to tell the script how to invoke RAxML on this system. +###What it does +Creates a newick tree from a VCF file. + +###Expected output +A newick tree specified by -o. If -o is not specified, prints the tree to stdout. + ##plot_tree.R Usage: Rscript plot_tree.R tree.nwk plot.pdf @@ -316,6 +384,9 @@ Usage: Rscript plot_tree.R tree.nwk plot.pdf R script to plot the given newick tree. The script will label the tips using the first string of numbers that appears in its name. Requires the ape and phytools R packages. +###What it does +Plots a newick tree. - +###Expected output +A pdf file, given as the second argument to the script.