Compositional data analysis in Python.
This package is meant to extend the methods of scikit-bio and serve as a pythonic alternative (not replacement) to some functionalities within propr.
Compatible for Python 3.
Required:
- pandas
- numpy
- scipy
Optional:
- scikit-bio
- gneiss
- ete[2/3]
- matplotlib
- seaborn
- scitkit-learn
# Stable release (Preferred)
pip install compositional
# Developmental release
pip install git+https://github.com/jolespin/compositional
- scikit-bio: A package providing data structures, algorithms and educational resources for bioinformatics
- gneiss: a compositional data analysis toolbox designed for analyzing high dimensional proportions (Jamie Morton)
-
Jin, S., Notredame, C. and Erb, I., 2022. Compositional Covariance Shrinkage and Regularised Partial Correlations. arXiv preprint arXiv:2212.00496.
-
Quinn T, Richardson MF, Lovell D, Crowley T (2017) propr: An R-package for Identifying Proportionally Abundant Features Using Compositional Data Analysis. Scientific Reports 7(16252): doi:10.1038/s41598-017-16520-0
-
Espinoza JL. compositional: Compositional data analysis in Python (2020). https://github.com/jolespin/compositional
-
Jin, S., Notredame, C. and Erb, I., 2022. Compositional Covariance Shrinkage and Regularised Partial Correlations. arXiv preprint arXiv:2212.00496.
-
Erb, I., 2020. Partial correlations in compositional data analysis. Applied Computing and Geosciences, 6, p.100026.
-
Quinn TP, Erb I, Gloor G, Notredame C, Richardson MF, Crowley TM (2019) A field guide for the compositional analysis of any-omics data. GigaScience 8(9). doi:10.1093/gigascience/giz107
-
Quinn T, Erb I, Richardson MF, Crowley T (2018) Understanding sequencing data as compositions: an outlook and review. Bioinformatics 34(16): doi:10.1093/bioinformatics/bty175
-
Erb I, Quinn T, Lovell D, Notredame C (2017) Differential Proportionality - A Normalization-Free Approach To Differential Gene Expression. Proceedings of CoDaWork 2017, The 7th Compositional Data Analysis Workshop; available under bioRxiv 134536: doi:10.1101/134536
-
Erb I, Notredame C (2016) How should we measure proportionality on relative gene expression data? Theory in Biosciences 135(1): doi:10.1007/s12064-015-0220-8
-
Lovell D, Pawlowsky-Glahn V, Egozcue JJ, Marguerat S, Bahler J (2015) Proportionality: A Valid Alternative to Correlation for Relative Data. PLoS Computational Biology 11(3): doi:10.1371/journal.pcbi.1004075
-
Morton, J.T., Sanders, J., Quinn, R.A., McDonald, D., Gonzalez, A., Vázquez‐Baeza, Y., et al . (2017) Balance trees reveal microbial niche differentiation. mSystems: e00162‐16. doi: 10.1128/mSystems.00162-16
- Espinoza JL., Shah N, Singh S, Nelson KE., Dupont CL. Applications of weighted association networks applied to compositional data in biology. https://doi.org/10.1111/1462-2920.15091
Each function operates on either 2D pd.DataFrame
or np.array
objects and output either pandas
or numpy
objects, respectively.
Transformation functions (e.g., transform_clr
) output the equivalent object with the same shape.
Pairwise functions either output a redundant form or non-redundant form. If a numpy
object is input, then either a 2D redundant form or 1D non-redundant form np.array
object will be output. If a pd.DataFrame
is input then there are 2 types of output that can be returned. If redundant_form=True
, then a square pd.DataFrame
will be returned. If redundant_form=False
, then a pd.Series
will be returned and the index will contain frozenset
objects that have the combinations.
For the operations in logspace, a pseudocount of 1 is added to avoid -inf values for log(0).
For usage, we are going to load data from oral microbiome 16S amplicon data from Gomez and Espinoza et al. 2017.
import compositional as coda
import pandas as pd
# Load abundances (Gomez and Espinoza et al. 2017)
X = pd.read_csv("https://github.com/jolespin/projects/raw/main/supragingival_plaque_microbiome/16S_amplicons/Data/X.tsv.gz",
sep="\t",
index_col=0,
compression="gzip",
)
# Load metadata
Y = pd.read_csv("https://github.com/jolespin/projects/raw/main/supragingival_plaque_microbiome/16S_amplicons/Data/Y.tsv.gz",
sep="\t",
index_col=0,
compression="gzip",
).loc[X.index]
# print("X.shape: (n={} samples, m={} OTUs)")
# X.shape: (n=473 samples, m=481 OTUs)
# Classes
classes = pd.Series(((Y["Caries_enamel"] == "YES").astype(int) + (Y["Caries_dentine"] == "YES").astype(int)).map(lambda x: {True:"Caries", False:"Caries-free"}[x > 0]), name="Diagnosis")
class_colors = {"Caries-free":"black", "Caries":"red"}
Filtering functions to preprocess data. Example use case: (1) Remove all samples with less than 10,000 total counts; (2) then all features that aren't in at least 50% of the samples, and then (3) samples that don't have at least 50 detected components.
X_filtered = coda.filter_data_highpass(
X=X,
minimum_total_counts=10000,
minimum_prevalence=0.5,
minimum_components=50,
)
X.shape, X_filtered.shape
# ((473, 481), (401, 93))
Summary metrics for compositional data.
# Sparsity
s = coda.sparsity(X)
print("Ratio of zeros in dataset: {:.3f}".format(s))
# Ratio of zeros in dataset: 0.776
# Total number of components per composition (i.e., richness)
coda.number_of_components(X).head()
# S-1409-45.B_RD1 111
# 1104.2_RD1 84
# S-1409-42.B_RD1 142
# 1073.1_RD1 101
# A-1504-100.B_RD1 95
# Prevalence of components across compositions
coda.prevalence_of_components(X).head()
# Otu000514 470
# Otu000001 473
# Otu000038 472
# Otu000003 473
# Otu000326 432
All pairwise operations support either a redundant form or non-redundant form using the redundant_form
argument.
# Pairwise Aitchison distance (redundant form)
aitchison_distances = coda.pairwise_aitchison_distance(X + 1, redundant_form=True)
# print(aitchison_distances.iloc[:4,:4])
# S-1409-45.B_RD1 1104.2_RD1 S-1409-42.B_RD1 1073.1_RD1
# S-1409-45.B_RD1 0.000000 25.384218 21.573635 23.455055
# 1104.2_RD1 25.384218 0.000000 27.811292 21.942080
# S-1409-42.B_RD1 21.573635 27.811292 0.000000 26.734435
# 1073.1_RD1 23.455055 21.942080 26.734435 0.000000
# Pairwise Aitchison distance (non-redundant form)
aitchison_distances = coda.pairwise_aitchison_distance(X + 1, redundant_form=False)
# print(aitchison_distances)
# aitchison_distance
# (S-1409-45.B_RD1, 1104.2_RD1) 25.384218
# (S-1409-45.B_RD1, S-1409-42.B_RD1) 21.573635
# (S-1409-45.B_RD1, 1073.1_RD1) 23.455055
# (S-1409-45.B_RD1, A-1504-100.B_RD1) 21.330042
# (S-1409-45.B_RD1, 2053.2_RD1) 22.531754
# ...
# (S-1410-40.B_RD1, M-1507-132.A_RD1) 23.247654
# (M-1507-132.A_RD1, C-1504-92.B_RD1) 20.422768
# (S-1410-40.B_RD1, 2005.1_RD1) 22.294198
# (2005.1_RD1, C-1504-92.B_RD1) 21.323598
# (S-1410-40.B_RD1, C-1504-92.B_RD1) 21.073093
# Length: 111628, dtype: float64
# Pairwise variance log-ratio
vlr = coda.pairwise_vlr(X + 1)
# print(vlr.iloc[:4,:4])
# Otu000514 Otu000001 Otu000038 Otu000003
# Otu000514 0.000000 0.764679 1.844322 1.869921
# Otu000001 0.764679 0.000000 1.299599 1.230553
# Otu000038 1.844322 1.299599 0.000000 2.207001
# Otu000003 1.869921 1.230553 2.207001 0.000000
# Pairwise rho from Erb et al. 2016
rhos = coda.pairwise_rho(X + 1)
# print(rhos.iloc[:4,:4])
# Otu000514 Otu000001 Otu000038 Otu000003
# Otu000514 1.000000 0.708325 0.304007 0.298552
# Otu000001 0.708325 1.000000 0.355895 0.394880
# Otu000038 0.304007 0.355895 1.000000 -0.070423
# Otu000003 0.298552 0.394880 -0.070423 1.000000
# Pairwise phi from Erb et al. 2016
phis = pairwise_phi(X + 1)
# print(phis.iloc[:4,:4])
# Otu000514 Otu000001 Otu000038 Otu000003
# Otu000514 0.000000 0.470005 1.133602 1.149336
# Otu000001 0.470005 0.000000 1.306492 1.237079
# Otu000038 1.133602 1.306492 0.000000 2.157470
# Otu000003 1.149336 1.237079 2.157470 0.000000
# Pairwise partial correlation with basis shrinkage from Erb et al. 2020 and Jin et al. 2022
pcorr = coda.pairwise_partial_correlation_with_basis_shrinkage(X + 1)
# print(pcorr.iloc[:4,:4])
# Otu000514 Otu000001 Otu000038 Otu000003
# Otu000514 1.000000 0.256310 -0.022194 -0.005131
# Otu000001 0.256310 1.000000 0.105960 0.222187
# Otu000038 -0.022194 0.105960 1.000000 -0.042785
# Otu000003 -0.005131 0.222187 -0.042785 1.000000
# Isometric log-ratio
X_ilr_without_tree = coda.transform_ilr(X + 1)
# print(X_ilr_without_tree.iloc[:4,:4])
# 0 1 2 3
# S-1409-45.B_RD1 -2.663112 -0.139161 -1.098112 6.023297
# 1104.2_RD1 -2.094331 3.804032 -4.579665 2.357939
# S-1409-42.B_RD1 -1.909313 -0.023536 -0.018245 5.614873
# 1073.1_RD1 -1.879929 2.322184 -2.717553 2.426881
import requests
from io import StringIO
from skbio import TreeNode
# Get newick tree
url = "https://github.com/jolespin/projects/raw/main/supragingival_plaque_microbiome/16S_amplicons/Data/otus.alignment.fasttree.nw"
newick = requests.get(url).text
tree = TreeNode.read(StringIO(newick), convert_underscores=False)
tree.bifurcate()
# Name internal nodes
intermediate_node_index = 1
for node in tree.traverse():
if not node.is_tip():
node.name = "y{}".format(intermediate_node_index)
intermediate_node_index += 1
# Isometric log-ratio transform
X_ilr_with_tree = coda.transform_ilr(X + 1, tree)
# print(X_ilr_with_tree.iloc[:4,:4])
# y1 y2 y480 y3
# S-1409-45.B_RD1 -1.039407 1.655538 -2.464164e-17 6.189481e-16
# 1104.2_RD1 -0.673964 1.073470 4.192522e-18 3.923163e-16
# S-1409-42.B_RD1 -1.326432 2.112703 3.851113e-17 8.306736e-16
# 1073.1_RD1 -0.979605 1.560287 5.023995e-18 5.907717e-16
Let's color the samples by a continuous variable (e.g., age in months).
sample_labels = pd.Index(X.sum(axis=1).sort_values().index[:4].tolist())
fig, g, df = coda.plot_compositions(X, colors=Y.loc[X.index,"age (months)"], sample_labels=sample_labels, title="Caries", figsize=(8,5))
Now color the samples by each class (e.g., phenotype).
fig, g, df = coda.plot_compositions(X, classes=classes, class_colors=class_colors, log_scale=True, title="Caries", style="ggplot", vertical_lines=[1, 1000,5000])
To identify a threshold to remove low prevalence components/features let's plot a prevalence curve where the x-axis shows the prevalence and y-axis shows the number of components are prevalent in x samples.
First, let's look at the prevalence globally. We want to see number of OTUs that are prevalent in at least 1 sample, 2 samples, half the samples, and all the samples.
There are 462 OTUs that in are in at least 1 sample, 392 OTUs that are in at least 2 samples (i.e., 462 - 392 = 70 singleton OTUs), and 11 OTUs that are in all the samples.
fig, ax, prevalence_distribution = coda.plot_prevalence(X, component_type="OTUs", show_prevalence=[1,2,0.5,1.0])
Now, let's look at the prevalence for each classes separately.
fig, ax, prevalence_distribution = coda.plot_prevalence(X, classes=classes, class_colors=class_colors, component_type="OTUs", show_prevalence=[1,2,0.5,1.0])
- Versions prior to v2020.12.16 used
ddof=0
for all variance except during thevlr
calculation. This was becausepandas._libs.algos.nancorr
usesddof=1
and notddof=0
. This caused specificrho
values not to be bound by [-1,1]. To retain the performance ofnancorr
, I've set allddof=1
to matchnancorr
. - The partial correlation with basis shrinkage is implemented exactly the same as
propr
as the backend algorithm in thecorpcor
package uses an updated the Ledoit-Wolf shrinkage approach from Opgen-Rhein, R., and K. Strimmer. 2007 and Schafer, J., and K. Strimmer. 2005.
- Thomas Quinn for insightful explanations of compositional data analysis and Jamie Morton for help in understanding isometric log-ratio transformations.
- Ionas Erb and Suzanne Jin for their help in understanding partial correlation with basis shrinkage.