Microbiome Encoder Decoder

A suite of models for doing time-series predictions of the human gut microbiome. At the moment the models are:

Feed-forward network (FFN)
Long short term memory (LSTM)
Encoder-decoder. As of right now the Encdoer-decoder is the most successful and all reporting is done on that model.

Data

Data sources

The input data is from QIITA (a repository for microbiome data). The data are:

Study_id=11052. This is a different set of Rob Knight's time-series data.
Study_id=2202. This is another time-series study with 2 other individuals.
Study_id=10283. This is Larry Smarr's pre and post-op data for a colonoscopy. There is also data from a few other women in here.
Study_id=1015. This is a dataset of Rob Knight and a few other researchers data from a trip abroad. Rob's data from here is excluded because it is present in Study 11052.

Data preprocessing

After studies have been downloaded from QIITA (in the *.biom format) they need to be cleaned up prior to being fed to the model. The metadata for a study should also be downloaded from QIITA at the same time. Additionally, a taxonomy file is needed (this can be generated from QIIME).

The general workflow is to use the scripts in the data_preprocessing directory in this order:

metadata_taxonomy_adder
biom_combiner (if necessary)
host_site_separator_time_sorting

At this point only the sampling sites or hosts of interest can be selected as they are now in their own files (as opposed to combined into one *.biom file).

sum_truncate_sort_taxonomy
filtering_normalization_completion
top_N_strains

How to use each of these scripts is described in the respective file.

The output of the data preprocessing pipeline is available in input_data. The directory summed_completed_no_chloro was used to train the models reported on in this study. The _no_chloro refers to the fact that an organism identified as part of a chloroplast was manually removed from those datasets.

Training a model

Given a directory of data as produced from above (let's call it input_dir) training is very simple. Each model type has its own directory dev/models/<model_type>/; in that directory there is a file called params.py where pertinent training parameters can be altered.

To train a model do:

python dev/models/<some_model>/trainer.py -d input_data/summed_completed_no_chloro/all_strains_top_N

If you want to also have a testing dataset, remove the dataset CSV(s) from the directory listed above and put them in their own directory (ie, test). Now the command above becomes:

python dev/models/<some_model>/trainer.py -d input_data/summed_completed_no_chloro/all_strains_top_N -t input_data/summed_completed_no_chloro/test

Evaluating a trained model

The ipython notebook under notebooks/Model Evaluator.ipynb has all of the tools necessary for assessing model quality.

Evaluating the input data

The ipython notebook under notebooks/Input Analysis has all of the tools necessary for assessing trends in the data.

Name		Name	Last commit message	Last commit date
Latest commit History 414 Commits
data_preprocessing		data_preprocessing
dev		dev
input_data		input_data
notebooks		notebooks
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Microbiome Encoder Decoder

Data

Data sources

Data preprocessing

Training a model

Evaluating a trained model

Evaluating the input data

About

Releases

Packages

Languages

License

michaelwiest/microbiome-rnn

Folders and files

Latest commit

History

Repository files navigation

Microbiome Encoder Decoder

Data

Data sources

Data preprocessing

Training a model

Evaluating a trained model

Evaluating the input data

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages