-
Notifications
You must be signed in to change notification settings - Fork 1
License
snz20/carnelian
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
This code is associated with the following manuscript. If you use any part of the source code, please cite us: Sumaiya Nazeen, Yun William Yu, and Bonnie Berger*. "Carnelian uncovers hidden functional patterns across diverse study populations from whole metagenome sequencing reads." (Accepted for publication in Genome Biology). A preliminary version of the paper was presented at ECCB 2018 (Applications Track) and bioRxiv preprint can be found at https://doi.org/10.1101/375121 Upon publication, further information can be found at http://carnelian.csail.mit.edu/ 0. Requirments Vowpal Wabbit 8.1.1 scikit-learn R 3.3.2 Python 2.7.13 BioPython 1.70 FragGeneScan This code has been tested with GCC 6.3.0 on Ubuntu 17.04, running under Bash 4.4.7(1) on a server with Intel Xeon E5-2695 v2 x86_64 2.40 GHz processor and 320 GB RAM. Using the EC-2010-DB dataset as gold standard, Carnelian can be comfortably run on a machine with 16GB RAM using 1 CPU. 1. Directory structure data/: EC-2010-DB dataset with gold standard EC labels. scripts/: R scripts for abundance estimation and analysis from read counts in functional bins. util/ ext/: external libararies. test/: test drawfrag.c and fasta2skm.c drawfrag.c: draw fragments from fasta records. fasta2skm.c: construct feature (spaced k-mer profile), and convert to VW input format. ldpc.py: generate LSH function using LDPC code. sequtil.py: splitting and merging utilities for fasta files. merge_pairs.py: links paired-end read files using paired-end relationships. reduce.py: translate sequences using reduced amino acid alphabets. kseq.h: parse FASTA files tests/ demo_data/: data files required for unit and advanced tests. basictest_carnelian.py: contains the unit tests for Carnelian. advancedtest_carnelian.py: contains the end-to-end tests for Carnelian. config.py: configures unit tests and advanced tests for Carnelian. README.txt: contains the instructions to run the tests. 2. Install and test: bash SETUP.sh 3. Usage: Modes: (default --optional-arguments such as k-mer length, fragment size, hash functions, etc. are set to work best with EC-2010-DB as used in the manuscript. If you're going to train on a different dataset, be sure to tune parameters.) 1) ./carnelian.py frag [--optional-arguments] test_dir frag_dir [-h] Looks for a fasta file in test_dir with matching label file. Randomly draws fragments of length and coverage specified in optional-arguments. (use "./carnelian.py frag -h" for details) Outputs these fragments with corresponding label into frag_dir. 2) ./carnelian.py train [--optional-arguments] train_dir model_dir [-h] Looks for a fasta file in train_dir with matching label file. For each batch of training, randomly draw fragments and generate feature vectors using Opal LDPC hashes, and trains Vowpal_Wabbit One-Against-All classifier against all batches sequentially. To train classifiers in precise mode, use "--precise" option which will make the learned model store probabilities. Outputs the generated classifier model into model_dir. 3) ./carnelian.py retrain [--optional-arguments] old_model_dir new_model_dir new_exmpls_dir [-h] Looks for a vowpal-wabbit model with patterns and dictionary file in the old_model_dir and a fasta file with matching labels in the new_exmpls_dir. Starting with the old model, it updates the existing training model and merges new labels with old dictionary using the old LDPC patterns. Note that a model trained in default mode must be updated in default mode. Same is true for precise mode. Output model, dictionary, and pattern files will be generated in new_model_dir. 4) ./carnelian.py translate [--optional-arguments] seq_dir out_dir fgsp_loc [-h] Using FragGeneScan program located in the fgsp_loc directory, tries to find coding sequences in the input reads fasta file in seq_dir, and translated the coding sequences to possible ORFs outputting them in a fasta file in the out_dir. 5) ./carnelian.py predict [--optional-arguments] model_dir test_dir predict_dir [-h] Looks for a classifier model in model_dir, and a fasta file in test_dir containing reads/fragments. To make predictions with probabilities, run in precise mode using "--precise" option and specify probability cutoff using "--cutoff <X>" option. Outputs the predictions in predict_dir as a fasta file with corresponding a corresponding label file. 6) ./carnelian.py eval reference_file predicted_labels [-h] Evaluation of prediction accuracy in terms of micro and macro averaged precision, sensitivity, and F1-score. If run in "precise" mode, it will assume predicted_labels file to have two tab-separated columns: <readID, predLabel> 7) ./carnelian.py abundance in_dir out_dir mapping_file gs_file [-h] Generates abundance estimates of functional terms. Looks for predicted labels for each sample in its own sub-directory in in_dir and sample mapping information and average protein length per label in mapping_file and gs_file respectively. Please note that, the sample ids must not start with digits. Outputs raw counts and effective counts matrices in out_dir. 8) ./carnelian.py simulate [--optional-arguments] test_dir train_dir out_dir [-h] Runs a full pipeline for performance evaluation starting from training on data in train_dir, testing on data in test_dir, and outputting fragments, model and predictions under out_dir in the following directory structure: 1frag/ simulated test data (drawn fragments) are saved here. (ignored if --do-not-fragment) 2model/ classifier will be saved here. 3predict/ fragment classifications are saved here. 9) ./carnelian.py annotate [--optional-arguments] sample_dir model_dir out_dir fgsp_loc [-h] Annotates the input nucleotide reads starting from gene finding and translation on the reads fasta file in the sample_dir using FragGeneScan located in the fgsp_loc directory, then classifying the predicted ORFs using the model in model_dir, and outputting the labels in the out_dir. Steps to be followed in a typical workflow is given in the workflow.txt file. To replicate our classification performance analysis the code in performance_analysis.R can be used. Before running the script the following packages need to be installed: caret, pROC, ROCR, cvAUC, randomForest Contact Sumaiya Nazeen, [email protected] Acknowledgement This implementation of Carnelian is adapted from the source code of the following papers: Yunan Luo, Y. William Yu, Jianyang Zeng, Bonnie Berger, and Jian Peng. Metagenomic binning through low density hashing. Bioinformatics (2018), bty611, https://doi.org/10.1093/bioinformatics/bty611 K. Vervier, P. Mahe, M. Tournoud, J.-B. Veyrieras, and J.-P. Vert. Large-scale Machine Learning for Metagenomics Sequence Classification , Technical report HAL-01151453, May, 2015.