High-level Ensemble Network implementations in Python. Built on top of NetworkX and Pandas.
Compatible for Python 3.
pandas
numpy
scipy
networkx
matplotlib
soothsayer_utils
compositional
- Nabwera HM+, Espinoza JL+, Worwui A, Betts M, Okoi C, Sesay AK, Bancroft R, Agbla SC, Jarju S, Bradbury RS, Colley M, Jallow AT, Liu J, Houpt ER, Prentice AM, Antonio M, Bernstein RM, Dupont CL+, Kwambana-Adams BA+. Interactions between fecal gut microbiome, enteric pathogens, and energy regulating hormones among acutely malnourished rural Gambian children. EBioMedicine. 2021 Oct 22;73:103644. doi: 10.1016/j.ebiom.2021.103644. PMID: 34695658.
- Josh L Espinoza, Manolito Torralba, Pamela Leong, Richard Saffery, Michelle Bockmann, Claire Kuelbs, Suren Singh, Toby Hughes, Jeffrey M Craig, Karen E Nelson, Chris L Dupont, Differential network analysis of oral microbiome metatranscriptomes identifies community scale metabolic restructuring in dental caries, PNAS Nexus, Volume 1, Issue 5, November 2022, pgac239, doi: 10.1093/pnasnexus/pgac239
# Stable release (Preferred)
pip install ensemble_networkx
# Current developmental release
pip install git+https://github.com/jolespin/ensemble_networkx
- Migrated from
soothsayer
-
Compositional data (e.g., counts data, NGS, etc.)
- Do not use Pearson, Spearman, Kendall-Tau, Biweight Midcorrelation for compositional data.
- Compositionally-valid association metrics:
- Partial correlation with basis shrinkage (
pcorr_bshrink
) (Jin et al. 2022, Erb 2020) - Proportionality (
rho
) (Lovell et al. 2015, Erb 2016) - Proportionality (
phi
) (Lovell et al. 2015, Erb 2016)
- Partial correlation with basis shrinkage (
-
Binary data (e.g., detected vs. not-detected)
- Matthew's Correlation Coefficient (
mcc
)
- Matthew's Correlation Coefficient (
-
Miscellaneous data:
- Pearson's correlation (
pearson
) - Spearman's correlation (
spearman
) - Kendall-Tau (
kendall
) - Biweight midcorrelation (
bicor
)
- Pearson's correlation (
Documentation will be released upon version 1.0 once API is stabilized.
import ensemble_networkx as enx
Simple case of an iris dataset ensemble network
Here we randomly sample 100 times, calculate the associations for each draw, and calculate summary statistics for the distributons of association values (i.e., edge weights).
If you choose sampling_size
float between 0 < x ≤ 1.0
then the number of samples drawn will be x * n
where n
is the number of samples. If an integer is used then it will grab that number of samples for each draw.
If you use this method, please cite:
Josh L Espinoza, Manolito Torralba, Pamela Leong, Richard Saffery, Michelle Bockmann, Claire Kuelbs, Suren Singh, Toby Hughes, Jeffrey M Craig, Karen E Nelson, Chris L Dupont, Differential network analysis of oral microbiome metatranscriptomes identifies community scale metabolic restructuring in dental caries, PNAS Nexus, Volume 1, Issue 5, November 2022, pgac239, https://doi.org/10.1093/pnasnexus/pgac239
import soothsayer_utils as syu
import numpy as np
from scipy import stats
# Load in data
X,y = syu.get_iris_data(["X", "y"])
# Create ensemble network
ens = enx.EnsembleAssociationNetwork(name="Iris", node_type="leaf measurement", edge_type="association", observation_type="specimen")
ens.fit(X=X, metric="pearson", n_iter=100, sampling_size=1.0, with_replacement=True, stats_summary=[np.median, stats.median_abs_deviation], stats_tests=[stats.normaltest], copy_ensemble=True)
print(ens)
======================================================
EnsembleAssociationNetwork(Name:Iris, Metric: pearson)
======================================================
* Number of nodes (leaf measurement): 4
* Number of edges (association): 6
* Observation type: specimen
--------------------------------------------------
| Parameters
--------------------------------------------------
* n_iter: 100
* sampling_size: 150
* random_state: 0
* with_replacement: True
* transformation: None
* memory: 15.238 KB
--------------------------------------------------
| Data
--------------------------------------------------
* Features (n=150, m=4, memory=9.930 KB)
* Ensemble (memory=4.812 KB)
* Statistics (['median', 'median_abs_deviation', 'CI(5%)', 'CI(95%)', 'normaltest|stat', 'normaltest|p_value'], memory=508 B)
Let's look at the "ensemble" which includes all of the associations for each of the permutations.
# View ensemble
print(ens.ensemble_.head())
Edges (sepal_length, sepal_width) (petal_length, sepal_length) \
Iterations
0 -0.061921 0.877956
1 -0.096871 0.868054
2 -0.274915 0.881042
3 -0.044374 0.837341
4 -0.010991 0.887051
Now let's look at the summary statistics that were calculated for each of the edges:
# View statistics
print(ens.stats_.head())
Statistics median median_abs_deviation CI(5%) \
Edges
(sepal_length, sepal_width) -0.111914 0.048581 -0.227990
(petal_length, sepal_length) 0.872573 0.011590 0.845797
(petal_width, sepal_length) 0.814497 0.010882 0.785130
(petal_length, sepal_width) -0.432412 0.036563 -0.522004
(petal_width, sepal_width) -0.370789 0.034571 -0.457290
Statistics CI(95%) normaltest|stat normaltest|p_value
Edges
(sepal_length, sepal_width) -0.015495 2.649309 0.265895
(petal_length, sepal_length) 0.893086 3.434098 0.179595
(petal_width, sepal_length) 0.844199 8.300075 0.015764
(petal_length, sepal_width) -0.327368 5.811593 0.054705
(petal_width, sepal_width) -0.261532 11.567172 0.003078
Simple case of an ensemble network for binary data using Matthew's Correlation Coefficient (MCC)
Pearson correlation isn't designed for binary data (i.e., True/False or 0/1) so you can use MCC
instead.
# Create ensemble network using MCC for binary data
n,m = 1000, 100
X_binary = pd.DataFrame(
data=np.random.RandomState(0).choice([0,1], size=(n,m)),
index=map(lambda i: f"sample_{i}", range(n)),
columns=map(lambda j:f"feature_{j}", range(m)),
)
ens_binary = enx.EnsembleAssociationNetwork(name="Binary", edge_type="association")
ens_binary.fit(X=X_binary, metric="mcc", n_iter=100, stats_summary=[np.mean,np.var], copy_ensemble=True)
print(ens_binary)
====================================================
EnsembleAssociationNetwork(Name:Binary, Metric: mcc)
====================================================
* Number of nodes (None): 100
* Number of edges (association): 4950
* Observation type: None
------------------------------------------------
| Parameters
------------------------------------------------
* n_iter: 100
* sampling_size: 1000
* random_state: 0
* with_replacement: True
* transformation: None
* memory: 4.969 MB
------------------------------------------------
| Data
------------------------------------------------
* Features (n=1000, m=100, memory=821.352 KB)
* Ensemble (memory=3.777 MB)
* Statistics (['mean', 'var', 'CI(5%)', 'CI(95%)', 'normaltest|stat', 'normaltest|p_value'], memory=399.742 KB)
We are going to create a differential between setosa and not-setosa samples.
If you use this method, please cite:
Josh L Espinoza, Manolito Torralba, Pamela Leong, Richard Saffery, Michelle Bockmann, Claire Kuelbs, Suren Singh, Toby Hughes, Jeffrey M Craig, Karen E Nelson, Chris L Dupont, Differential network analysis of oral microbiome metatranscriptomes identifies community scale metabolic restructuring in dental caries, PNAS Nexus, Volume 1, Issue 5, November 2022, pgac239, https://doi.org/10.1093/pnasnexus/pgac239
X,y = syu.get_iris_data(["X", "y"])
y_setosaornot = y.map(lambda x: {True:"setosa", False:"not_setosa"}[x == "setosa"])
# Differential network between setosa and not setosa
dn = enx.DifferentialEnsembleAssociationNetwork(name="Iris")
dn.fit(X, y_setosaornot, reference="setosa", treatment="not_setosa", metric="rho")
print(dn)
========================================================================================================
DifferentialEnsembleAssociationNetwork(Name:Iris, Reference: setosa, Treatment: not_setosa, Metric: rho)
========================================================================================================
* Number of nodes (None): 4
* Number of edges (None): 6
* Observation type: None
----------------------------------------------------------------------------------------------------
| Parameters
----------------------------------------------------------------------------------------------------
* n_iter: 1000
* sampling_size: 50
* random_state: 0
* with_replacement: True
* transformation: None
* memory: 105.539 KB
----------------------------------------------------------------------------------------------------
| Data
----------------------------------------------------------------------------------------------------
* Features (n=150, m=4, memory=9.930 KB)
----------------------------------------------------------------------------------------------------
| Intermediate
----------------------------------------------------------------------------------------------------
* Reference Ensemble (memory=47.496 KB)
* Treatment Ensemble (memory=47.496 KB)
----------------------------------------------------------------------------------------------------
| Terminal
----------------------------------------------------------------------------------------------------
* Initial Statistics (['median', 'median_abs_deviation', 'CI(5%)', 'CI(95%)', 'normaltest|stat', 'normaltest|p_value'])
* Comparative Statistics (['wasserstein_distance', 'mannwhitneyu|stat', 'mannwhitneyu|p_value'], memory=364 B)
* Differential Statistics (['median'], memory=268 B)
We can look at individual stats for each class:
dn.ensemble_reference_.stats_
Statistics median median_abs_deviation CI(5%) CI(95%) normaltest|stat normaltest|p_value
Edges
(sepal_width, sepal_length) 0.786607 0.043967 0.655605 0.866409 125.652750 5.186232e-28
(petal_length, sepal_length) 0.460913 0.090514 0.236955 0.662517 10.686368 4.780625e-03
(petal_width, sepal_length) -0.600863 0.019191 -0.644792 -0.548529 33.238974 6.056873e-08
(sepal_width, petal_length) 0.261425 0.095604 0.012081 0.483333 11.792565 2.749648e-03
(sepal_width, petal_width) -0.599378 0.031315 -0.673050 -0.514876 25.564313 2.810476e-06
(petal_length, petal_width) -0.513017 0.033226 -0.592368 -0.425776 7.823623 2.000423e-02
dn.ensemble_treatment_.stats_
Statistics median median_abs_deviation CI(5%) CI(95%) normaltest|stat normaltest|p_value
Edges
(sepal_width, sepal_length) 0.341311 0.051657 0.217615 0.460566 1.117749 0.571852
(petal_length, sepal_length) -0.019897 0.054220 -0.151335 0.109942 0.923136 0.630295
(petal_width, sepal_length) -0.781452 0.017277 -0.820192 -0.729637 25.201274 0.000003
(sepal_width, petal_length) -0.596316 0.037117 -0.678457 -0.493624 23.866671 0.000007
(sepal_width, petal_width) -0.627446 0.027775 -0.687832 -0.555307 11.590521 0.003042
(petal_length, petal_width) 0.046047 0.043560 -0.063424 0.154860 0.228881 0.891865
We can also look stats that compare the distributions of each edge between the 2 conditions:
dn.stats_comparative_
wasserstein_distance mannwhitneyu|stat mannwhitneyu|p_value
Edges
(sepal_width, sepal_length) 0.434245 999756.0 0.000000e+00
(petal_length, sepal_length) 0.477957 998548.0 0.000000e+00
(petal_width, sepal_length) 0.179624 999994.0 0.000000e+00
(sepal_width, petal_length) 0.851155 1000000.0 0.000000e+00
(sepal_width, petal_width) 0.028351 675085.0 7.047032e-42
(petal_length, petal_width) 0.556989 0.0 0.000000e+00
Lastly, we can get the differentials between 2 statistics calculated for each conditions. We would use this as our resulting differential network:
dn.stats_differential_
median
Edges
(sepal_width, sepal_length) -0.445296
(petal_length, sepal_length) -0.480810
(petal_width, sepal_length) -0.180589
(sepal_width, petal_length) -0.857740
(sepal_width, petal_width) -0.028068
(petal_length, petal_width) 0.559064
Simple case of creating sample-specific perturbation networks for compositional data using Rho Proportionality and confidence interval of [2.5, 97.5]
Iris data is NOT compositional but this is for demonstration since they are positive values.
Sampling size here is in relation to the reference
class.
# Create ensemble network
sspn_rho = enx.SampleSpecificPerturbationNetwork(name="Iris", node_type="leaf measurement", edge_type="association", observation_type="specimen")
sspn_rho.fit(X=X, y=y, metric="rho", reference="setosa", n_iter=100, confidence_interval=97.5, copy_ensemble=True)
print(sspn_rho)
==============================================================================================
SampleSpecificPerturbationNetwork(Name:Iris, Reference: Reference(setosa[clone]), Metric: rho)
==============================================================================================
* Number of nodes (leaf measurement): 4
* Number of edges (association): 6
* Observation type: specimen
------------------------------------------------------------------------------------------
| Parameters
------------------------------------------------------------------------------------------
* n_iter: 100
* sampling_size: 50
* random_state: 0
* with_replacement: True
* transformation: None
* memory: 787.684 KB
------------------------------------------------------------------------------------------
| Data
------------------------------------------------------------------------------------------
* Features (n=200, m=4, memory=9.930 KB)
------------------------------------------------------------------------------------------
| Intermediate
------------------------------------------------------------------------------------------
* Reference Ensemble (memory=220 B)
* Sample-specific Ensembles (memory=32.227 KB)
------------------------------------------------------------------------------------------
| Terminal
------------------------------------------------------------------------------------------
* Ensemble (memory=703.125 KB)
* Statistics (['median', 'median_abs_deviation', 'CI(2.5%)', 'CI(97.5%)', 'normaltest|stat', 'normaltest|p_value'], memory=42.188 KB)
# View ensemble
print(*repr(sspn_rho.ensemble_).split("\n")[-4:], sep="\n")
Coordinates:
* Samples (Samples) <U8 'iris_0' 'iris_1' ... 'iris_148' 'iris_149'
* Iterations (Iterations) int64 0 1 2 3 4 5 6 7 8 ... 92 93 94 95 96 97 98 99
* Edges (Edges) object frozenset({'sepal_length', 'sepal_width'}) ......
# View statistics
print(*repr(sspn_rho.stats_).split("\n")[-4:], sep="\n")
Coordinates:
* Samples (Samples) object 'iris_0' 'iris_1' ... 'iris_148' 'iris_149'
* Edges (Edges) object frozenset({'sepal_length', 'sepal_width'}) ......
* Statistics (Statistics) <U20 'median' ... 'normaltest|p_value'
# View SSPN for a particular sample
graph = sspn_rho.to_networkx("iris_50")
list(graph.edges(data=True))[0]
'sepal_length',
'sepal_width',
{'median': 0.02589165067420246,
'median_abs_deviation': 0.018200489248854534,
'CI(2.5%)': -0.01811839255779411,
'CI(97.5%)': 0.09388843207188755,
'normaltest|stat': 4.196194170296813,
'normaltest|p_value': 0.12268967426224149})
Now let's output the perturbation matrix which includes all SSPNs across all the samples using the median values of the perturbation distributions for the weight.
X_perturbation = sspn_rho.to_perturbation(weight='median')
X_perturbation.head()
# Edges (sepal_length, sepal_width) (sepal_length, petal_length) (sepal_length, petal_width) (sepal_width, petal_length) (sepal_width, petal_width) (petal_width, petal_length)
# Samples
# iris_0 0.000757 -0.001263 -0.000486 -0.001056 -0.000663 0.001352
# iris_1 -0.007569 0.001436 -0.000439 -0.003043 0.001497 -0.000325
# iris_2 0.000189 -0.001334 -0.000158 -0.000911 -0.000124 0.000483
# iris_3 0.000011 -0.005670 0.000814 -0.005226 0.001058 -0.000984
# iris_4 -0.000990 -0.000612 0.000142 -0.002192 -0.001285 0.001501
Sample-specific perturbation networks for compositional data using partial correlation with basis shrinkage
You can also now use partial_correlation_with_basis_shrinkage
from the compositional
package (Jin et al. 2022 and Erb 2020).
sspn_bshrink = enx.SampleSpecificPerturbationNetwork(name="Iris", node_type="leaf measurement", edge_type="association", observation_type="specimen")
sspn_bshrink.fit(X=X, y=y, metric="pcorr_bshrink", reference="setosa", n_iter=100, confidence_interval=97.5, copy_ensemble=True)
print(sspn_bshrink)
========================================================================================================
SampleSpecificPerturbationNetwork(Name:Iris, Reference: Reference(setosa[clone]), Metric: pcorr_bshrink)
========================================================================================================
* Number of nodes (leaf measurement): 4
* Number of edges (association): 6
* Observation type: specimen
----------------------------------------------------------------------------------------------------
| Parameters
----------------------------------------------------------------------------------------------------
* n_iter: 100
* sampling_size: 50
* random_state: 0
* with_replacement: True
* transformation: None
* memory: 787.684 KB
----------------------------------------------------------------------------------------------------
| Data
----------------------------------------------------------------------------------------------------
* Features (n=200, m=4, memory=9.930 KB)
----------------------------------------------------------------------------------------------------
| Intermediate
----------------------------------------------------------------------------------------------------
* Reference Ensemble (memory=220 B)
* Sample-specific Ensembles (memory=32.227 KB)
----------------------------------------------------------------------------------------------------
| Terminal
----------------------------------------------------------------------------------------------------
* Ensemble (memory=703.125 KB)
* Statistics (['median', 'median_abs_deviation', 'CI(2.5%)', 'CI(97.5%)', 'normaltest|stat', 'normaltest|p_value'], memory=42.188 KB)
Here we specify a custom function for the associations which is the inverse kullback leibler divergence.
# Custom association function
def inverse_kullbackleibler(a,b, base=2):
return 1/stats.entropy(pk=a, qk=b, base=base)
# Create ensemble network
sspn_kl = enx.SampleSpecificPerturbationNetwork(name="Iris", node_type="leaf measurement", edge_type="association", observation_type="specimen")
sspn_kl.fit(X=X, y=y, metric=inverse_kullbackleibler, reference="setosa", n_iter=100, function_is_pairwise=False, copy_ensemble=True)
print(sspn_kl)
==================================================================================================================
SampleSpecificPerturbationNetwork(Name:Iris, Reference: Reference(setosa[clone]), Metric: inverse_kullbackleibler)
==================================================================================================================
* Number of nodes (leaf measurement): 4
* Number of edges (association): 6
* Observation type: specimen
--------------------------------------------------------------------------------------------------------------
| Parameters
--------------------------------------------------------------------------------------------------------------
* n_iter: 100
* sampling_size: 50
* random_state: 0
* with_replacement: True
* transformation: None
* memory: 787.684 KB
--------------------------------------------------------------------------------------------------------------
| Data
--------------------------------------------------------------------------------------------------------------
* Features (n=200, m=4, memory=9.930 KB)
--------------------------------------------------------------------------------------------------------------
| Intermediate
--------------------------------------------------------------------------------------------------------------
* Reference Ensemble (memory=220 B)
* Sample-specific Ensembles (memory=32.227 KB)
--------------------------------------------------------------------------------------------------------------
| Terminal
--------------------------------------------------------------------------------------------------------------
* Ensemble (memory=703.125 KB)
* Statistics (['median', 'median_abs_deviation', 'CI(5%)', 'CI(95%)', 'normaltest|stat', 'normaltest|p_value'], memory=42.188 KB)
We are going to run Leiden community detection but since it is stochastic and not deterministic, we are going to use 100 different random seeds and only consider clusters that consistent (i.e., minimum_cooccurrence_rate=1.0
).
If you use this method, please cite:
Josh L Espinoza, Manolito Torralba, Pamela Leong, Richard Saffery, Michelle Bockmann, Claire Kuelbs, Suren Singh, Toby Hughes, Jeffrey M Craig, Karen E Nelson, Chris L Dupont, Differential network analysis of oral microbiome metatranscriptomes identifies community scale metabolic restructuring in dental caries, PNAS Nexus, Volume 1, Issue 5, November 2022, pgac239, https://doi.org/10.1093/pnasnexus/pgac239
# Get graph
graph = enx.convert_network(X.T.corr(), nx.Graph)
# Cluster graph
cn = enx.ClusteredNetwork(name="Iris", node_type="leaf measurement", edge_type="pearson")
# Fit model
cn.fit(graph, n_iter=100, algorithm="leiden", minimum_cooccurrence_rate=1.0)
print(cn)
===========================================
ClusteredNetwork(Name:Iris, weight_dtype: float64)
===========================================
* Algorithm: leiden
* Number of iterations: 100
* Minimum edge cooccurrence rate: 1.0
* Number of nodes clustered (leaf measurement): 150 (100.00%)
* Number of edges clustered (pearson): 6126 (54.82%)
---------------------------------------
| Cluster Sizes (N = 2)
---------------------------------------
Leiden_1 99
Leiden_2 51
Let's take a look at the cluster assignments:
cn.node_to_cluster_.head()
Nodes[Clustered]
iris_1 Leiden_2
iris_0 Leiden_2
iris_2 Leiden_2
iris_3 Leiden_2
iris_4 Leiden_2
Name: Clusters[Expanded], dtype: object
We can also get a cluster to nodes dictionary:
cn.cluster_to_nodes_
Leiden_1 {iris_117, iris_76, iris_59, iris_126,...
Leiden_2 {iris_0, iris_12, iris_43, iris_4, ...
Name: Clusters[Collapsed], dtype: object
We can also get the clustered graph in other formats:
# Default
graph = cn.graph_clustered_
weights = cn.to_pandas_series()
df_adjacency = cn.to_pandas_dataframe(vertical=False)
df_edgelist = cn.to_pandas_dataframe(vertical=True)
sym = cn.to_symmetric()
In associations networks, we often have both positive and negative associations which should be analyzed separately. With the BiDirectionalClusteredNetwork
we can split positive and negative associations into separate subgraphs, perform community detection, then merge the clustered graphs.
If you use this method, please cite:
Josh L Espinoza, Manolito Torralba, Pamela Leong, Richard Saffery, Michelle Bockmann, Claire Kuelbs, Suren Singh, Toby Hughes, Jeffrey M Craig, Karen E Nelson, Chris L Dupont, Differential network analysis of oral microbiome metatranscriptomes identifies community scale metabolic restructuring in dental caries, PNAS Nexus, Volume 1, Issue 5, November 2022, pgac239, https://doi.org/10.1093/pnasnexus/pgac239
from sklearn.datasets import make_classification
# Create dataset
n_samples=1000
X, y = make_classification(n_samples=n_samples, n_features=10, n_classes=2, n_clusters_per_class=1, random_state=0)
X = pd.DataFrame(X)
# Compute associations
df_corr = X.corr()
# Convert associations to graph
graph = enx.convert_network(df_corr, nx.Graph, tol=1e-10)
# Cluster positive and negative associations then merge into single clustered graph
bdcn = enx.BiDirectionalClusteredNetwork()
bdcn.fit(graph)
bdcn
# Detecting communities via `leiden` algorithm: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [00:00<00:00, 8034.76it/s]
# Converting clustered graph
# Computing node connectivity
# Computing cluster connectivity
# Detecting communities via `leiden` algorithm: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [00:00<00:00, 12542.40it/s]
# Converting clustered graph
# Computing node connectivity
# Computing cluster connectivity
# ========================================================================
# BiDirectionalClusteredNetwork(Name:None, NodeType: None, EdgeType: None)
# ========================================================================
# * Algorithm: leiden
# * Minimum edge cooccurrence rate: 1.0
# * Number of iterations: 100
# * Number of nodes clustered (Union): 10 (100.00%)
# * Number of nodes clustered (Intersection): 8 (80.00%)
# -------------------
# ClusteredNetwork(+)
# -------------------
# * Number of nodes clustered: 8 (80.00%)
# * Number of edges clustered: 4 (19.05%)
# --------------------------------------------------------------------
# | Cluster Sizes (N = 4)
# --------------------------------------------------------------------
# Leiden[+]_1 2
# Leiden[+]_2 2
# Leiden[+]_3 2
# Leiden[+]_4 2
# -------------------
# ClusteredNetwork(-)
# -------------------
# * Number of nodes clustered: 10 (100.00%)
# * Number of edges clustered: 17 (70.83%)
# --------------------------------------------------------------------
# | Cluster Sizes (N = 2)
# --------------------------------------------------------------------
# Leiden[+]_1 8
# Leiden[+]_2 2
Let's engineer some categories by collapsing by some predefined category. Check out Phylogenomic Functional Categories
in Espinoza et al. 2022 for how these are used in practice.
If you use this method, please cite:
Josh L Espinoza, Manolito Torralba, Pamela Leong, Richard Saffery, Michelle Bockmann, Claire Kuelbs, Suren Singh, Toby Hughes, Jeffrey M Craig, Karen E Nelson, Chris L Dupont, Differential network analysis of oral microbiome metatranscriptomes identifies community scale metabolic restructuring in dental caries, PNAS Nexus, Volume 1, Issue 5, November 2022, pgac239, https://doi.org/10.1093/pnasnexus/pgac239
from soothsayer_utils import get_iris_data
from scipy import stats
import pandas as pd
import ensemble_networkx as enx
X, y = get_iris_data(["X","y"])
# Usage
cef = enx.CategoricalEngineeredFeature(name="Iris", observation_type="sample")
# Add categories
category_1 = pd.Series(X.columns.map(lambda x:x.split("_")[0]), X.columns)
cef.add_category(
name_category="leaf_type",
mapping=category_1,
)
# Optionally add scaling factors, statistical tests, and summary statistics
# Compile all of the data
cef.compile(scaling_factors=X.sum(axis=0), stats_tests=[stats.normaltest])
# Unpacking engineered groups: 100%|██████████| 1/1 [00:00<00:00, 2974.68it/s]
# Organizing feature sets: 100%|██████████| 4/4 [00:00<00:00, 17403.75it/s]
# Compiling synopsis [Basic Feature Info]: 100%|██████████| 2/2 [00:00<00:00, 32768.00it/s]
# Compiling synopsis [Scaling Factor Info]: 100%|██████████| 2/2 [00:00<00:00, 238.84it/s]
# View the engineered features
cef.synopsis_
# initial_features number_of_features leaf_type(level:0) scaling_factors sum(scaling_factors) mean(scaling_factors) sem(scaling_factors) std(scaling_factors)
# leaf_type
# sepal [sepal_width, sepal_length] 2 sepal [458.6, 876.5] 1335.1 667.55 208.95 208.95
# petal [petal_length, petal_width] 2 petal [563.7, 179.90000000000003] 743.6 371.8 191.9 191.9
# Transform a dataset using the defined categories
cef.fit_transform(X, aggregate_fn=np.sum)
# leaf_type sepal petal
# sample
# iris_0 8.6 1.6
# iris_1 7.9 1.6
# iris_2 7.9 1.5
# iris_3 7.7 1.7