Skip to content

Python package for association, differential, clustered, and sample-specific ensemble networks. Direct support for compositional data analysis. Additional capabilities for feature engineering.

License

Notifications You must be signed in to change notification settings

jolespin/ensemble_networkx

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

32 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Ensemble NetworkX

High-level Ensemble Network implementations in Python. Built on top of NetworkX and Pandas.

Dependencies:

Compatible for Python 3.

pandas
numpy
scipy
networkx
matplotlib
soothsayer_utils
compositional

Citations:

  • Nabwera HM+, Espinoza JL+, Worwui A, Betts M, Okoi C, Sesay AK, Bancroft R, Agbla SC, Jarju S, Bradbury RS, Colley M, Jallow AT, Liu J, Houpt ER, Prentice AM, Antonio M, Bernstein RM, Dupont CL+, Kwambana-Adams BA+. Interactions between fecal gut microbiome, enteric pathogens, and energy regulating hormones among acutely malnourished rural Gambian children. EBioMedicine. 2021 Oct 22;73:103644. doi: 10.1016/j.ebiom.2021.103644. PMID: 34695658.
  • Josh L Espinoza, Manolito Torralba, Pamela Leong, Richard Saffery, Michelle Bockmann, Claire Kuelbs, Suren Singh, Toby Hughes, Jeffrey M Craig, Karen E Nelson, Chris L Dupont, Differential network analysis of oral microbiome metatranscriptomes identifies community scale metabolic restructuring in dental caries, PNAS Nexus, Volume 1, Issue 5, November 2022, pgac239, doi: 10.1093/pnasnexus/pgac239

Install:

# Stable release (Preferred)
pip install ensemble_networkx

# Current developmental release
pip install git+https://github.com/jolespin/ensemble_networkx

Source:

Supported metrics:

Case studies, tutorials and usage:

Documentation will be released upon version 1.0 once API is stabilized.

import ensemble_networkx as enx

Simple case of an iris dataset ensemble network

Here we randomly sample 100 times, calculate the associations for each draw, and calculate summary statistics for the distributons of association values (i.e., edge weights).

If you choose sampling_size float between 0 < x ≤ 1.0 then the number of samples drawn will be x * n where n is the number of samples. If an integer is used then it will grab that number of samples for each draw.

If you use this method, please cite:

Josh L Espinoza, Manolito Torralba, Pamela Leong, Richard Saffery, Michelle Bockmann, Claire Kuelbs, Suren Singh, Toby Hughes, Jeffrey M Craig, Karen E Nelson, Chris L Dupont, Differential network analysis of oral microbiome metatranscriptomes identifies community scale metabolic restructuring in dental caries, PNAS Nexus, Volume 1, Issue 5, November 2022, pgac239, https://doi.org/10.1093/pnasnexus/pgac239

import soothsayer_utils as syu
import numpy as np
from scipy import stats

# Load in data
X,y = syu.get_iris_data(["X", "y"])

# Create ensemble network
ens = enx.EnsembleAssociationNetwork(name="Iris", node_type="leaf measurement", edge_type="association", observation_type="specimen")
ens.fit(X=X, metric="pearson",  n_iter=100, sampling_size=1.0, with_replacement=True, stats_summary=[np.median, stats.median_abs_deviation], stats_tests=[stats.normaltest], copy_ensemble=True)
print(ens)

======================================================
EnsembleAssociationNetwork(Name:Iris, Metric: pearson)
======================================================
    * Number of nodes (leaf measurement): 4
    * Number of edges (association): 6
    * Observation type: specimen
    --------------------------------------------------
    | Parameters
    --------------------------------------------------
    * n_iter: 100
    * sampling_size: 150
    * random_state: 0
    * with_replacement: True
    * transformation: None
    * memory: 15.238 KB
    --------------------------------------------------
    | Data
    --------------------------------------------------
    * Features (n=150, m=4, memory=9.930 KB)
    * Ensemble (memory=4.812 KB)
    * Statistics (['median', 'median_abs_deviation', 'CI(5%)', 'CI(95%)', 'normaltest|stat', 'normaltest|p_value'], memory=508 B)

Let's look at the "ensemble" which includes all of the associations for each of the permutations.

# View ensemble
print(ens.ensemble_.head())
Edges       (sepal_length, sepal_width)  (petal_length, sepal_length)  \
Iterations                                                              
0                             -0.061921                      0.877956   
1                             -0.096871                      0.868054   
2                             -0.274915                      0.881042   
3                             -0.044374                      0.837341   
4                             -0.010991                      0.887051  

Now let's look at the summary statistics that were calculated for each of the edges:

# View statistics
print(ens.stats_.head())

Statistics                      median  median_abs_deviation    CI(5%)  \
Edges                                                                    
(sepal_length, sepal_width)  -0.111914              0.048581 -0.227990   
(petal_length, sepal_length)  0.872573              0.011590  0.845797   
(petal_width, sepal_length)   0.814497              0.010882  0.785130   
(petal_length, sepal_width)  -0.432412              0.036563 -0.522004   
(petal_width, sepal_width)   -0.370789              0.034571 -0.457290   

Statistics                     CI(95%)  normaltest|stat  normaltest|p_value  
Edges                                                                        
(sepal_length, sepal_width)  -0.015495         2.649309            0.265895  
(petal_length, sepal_length)  0.893086         3.434098            0.179595  
(petal_width, sepal_length)   0.844199         8.300075            0.015764  
(petal_length, sepal_width)  -0.327368         5.811593            0.054705  
(petal_width, sepal_width)   -0.261532        11.567172            0.003078  

Simple case of an ensemble network for binary data using Matthew's Correlation Coefficient (MCC)

Pearson correlation isn't designed for binary data (i.e., True/False or 0/1) so you can use MCC instead.

# Create ensemble network using MCC for binary data
n,m = 1000, 100
X_binary = pd.DataFrame(
    data=np.random.RandomState(0).choice([0,1], size=(n,m)),
    index=map(lambda i: f"sample_{i}", range(n)),
    columns=map(lambda j:f"feature_{j}", range(m)),
)
ens_binary = enx.EnsembleAssociationNetwork(name="Binary", edge_type="association")
ens_binary.fit(X=X_binary, metric="mcc",  n_iter=100, stats_summary=[np.mean,np.var], copy_ensemble=True)
print(ens_binary)

====================================================
EnsembleAssociationNetwork(Name:Binary, Metric: mcc)
====================================================
    * Number of nodes (None): 100
    * Number of edges (association): 4950
    * Observation type: None
    ------------------------------------------------
    | Parameters
    ------------------------------------------------
    * n_iter: 100
    * sampling_size: 1000
    * random_state: 0
    * with_replacement: True
    * transformation: None
    * memory: 4.969 MB
    ------------------------------------------------
    | Data
    ------------------------------------------------
    * Features (n=1000, m=100, memory=821.352 KB)
    * Ensemble (memory=3.777 MB)
    * Statistics (['mean', 'var', 'CI(5%)', 'CI(95%)', 'normaltest|stat', 'normaltest|p_value'], memory=399.742 KB)

Differential ensemble association networks

We are going to create a differential between setosa and not-setosa samples.

If you use this method, please cite:

Josh L Espinoza, Manolito Torralba, Pamela Leong, Richard Saffery, Michelle Bockmann, Claire Kuelbs, Suren Singh, Toby Hughes, Jeffrey M Craig, Karen E Nelson, Chris L Dupont, Differential network analysis of oral microbiome metatranscriptomes identifies community scale metabolic restructuring in dental caries, PNAS Nexus, Volume 1, Issue 5, November 2022, pgac239, https://doi.org/10.1093/pnasnexus/pgac239

X,y = syu.get_iris_data(["X", "y"])
y_setosaornot = y.map(lambda x: {True:"setosa", False:"not_setosa"}[x == "setosa"])

# Differential network between setosa and not setosa
dn = enx.DifferentialEnsembleAssociationNetwork(name="Iris")
dn.fit(X, y_setosaornot, reference="setosa", treatment="not_setosa", metric="rho")
print(dn)

========================================================================================================
DifferentialEnsembleAssociationNetwork(Name:Iris, Reference: setosa, Treatment: not_setosa, Metric: rho)
========================================================================================================
    * Number of nodes (None): 4
    * Number of edges (None): 6
    * Observation type: None
    ----------------------------------------------------------------------------------------------------
    | Parameters
    ----------------------------------------------------------------------------------------------------
    * n_iter: 1000
    * sampling_size: 50
    * random_state: 0
    * with_replacement: True
    * transformation: None
    * memory: 105.539 KB
    ----------------------------------------------------------------------------------------------------
    | Data
    ----------------------------------------------------------------------------------------------------
    * Features (n=150, m=4, memory=9.930 KB)
    ----------------------------------------------------------------------------------------------------
    | Intermediate
    ----------------------------------------------------------------------------------------------------
    * Reference Ensemble (memory=47.496 KB)
    * Treatment Ensemble (memory=47.496 KB)
    ----------------------------------------------------------------------------------------------------
    | Terminal
    ----------------------------------------------------------------------------------------------------
    * Initial Statistics (['median', 'median_abs_deviation', 'CI(5%)', 'CI(95%)', 'normaltest|stat', 'normaltest|p_value'])
    * Comparative Statistics (['wasserstein_distance', 'mannwhitneyu|stat', 'mannwhitneyu|p_value'], memory=364 B)
    * Differential Statistics (['median'], memory=268 B)

We can look at individual stats for each class:

dn.ensemble_reference_.stats_

Statistics	median	median_abs_deviation	CI(5%)	CI(95%)	normaltest|stat	normaltest|p_value
Edges						
(sepal_width, sepal_length)	0.786607	0.043967	0.655605	0.866409	125.652750	5.186232e-28
(petal_length, sepal_length)	0.460913	0.090514	0.236955	0.662517	10.686368	4.780625e-03
(petal_width, sepal_length)	-0.600863	0.019191	-0.644792	-0.548529	33.238974	6.056873e-08
(sepal_width, petal_length)	0.261425	0.095604	0.012081	0.483333	11.792565	2.749648e-03
(sepal_width, petal_width)	-0.599378	0.031315	-0.673050	-0.514876	25.564313	2.810476e-06
(petal_length, petal_width)	-0.513017	0.033226	-0.592368	-0.425776	7.823623	2.000423e-02

dn.ensemble_treatment_.stats_

Statistics	median	median_abs_deviation	CI(5%)	CI(95%)	normaltest|stat	normaltest|p_value
Edges						
(sepal_width, sepal_length)	0.341311	0.051657	0.217615	0.460566	1.117749	0.571852
(petal_length, sepal_length)	-0.019897	0.054220	-0.151335	0.109942	0.923136	0.630295
(petal_width, sepal_length)	-0.781452	0.017277	-0.820192	-0.729637	25.201274	0.000003
(sepal_width, petal_length)	-0.596316	0.037117	-0.678457	-0.493624	23.866671	0.000007
(sepal_width, petal_width)	-0.627446	0.027775	-0.687832	-0.555307	11.590521	0.003042
(petal_length, petal_width)	0.046047	0.043560	-0.063424	0.154860	0.228881	0.891865

We can also look stats that compare the distributions of each edge between the 2 conditions:

dn.stats_comparative_

	wasserstein_distance	mannwhitneyu|stat	mannwhitneyu|p_value
Edges			
(sepal_width, sepal_length)	0.434245	999756.0	0.000000e+00
(petal_length, sepal_length)	0.477957	998548.0	0.000000e+00
(petal_width, sepal_length)	0.179624	999994.0	0.000000e+00
(sepal_width, petal_length)	0.851155	1000000.0	0.000000e+00
(sepal_width, petal_width)	0.028351	675085.0	7.047032e-42
(petal_length, petal_width)	0.556989	0.0	0.000000e+00

Lastly, we can get the differentials between 2 statistics calculated for each conditions. We would use this as our resulting differential network:

dn.stats_differential_

	median
Edges	
(sepal_width, sepal_length)	-0.445296
(petal_length, sepal_length)	-0.480810
(petal_width, sepal_length)	-0.180589
(sepal_width, petal_length)	-0.857740
(sepal_width, petal_width)	-0.028068
(petal_length, petal_width)	0.559064

Simple case of creating sample-specific perturbation networks for compositional data using Rho Proportionality and confidence interval of [2.5, 97.5]

Iris data is NOT compositional but this is for demonstration since they are positive values.

Sampling size here is in relation to the reference class.

# Create ensemble network
sspn_rho = enx.SampleSpecificPerturbationNetwork(name="Iris", node_type="leaf measurement", edge_type="association", observation_type="specimen")
sspn_rho.fit(X=X, y=y, metric="rho", reference="setosa", n_iter=100, confidence_interval=97.5, copy_ensemble=True)

print(sspn_rho)

==============================================================================================
SampleSpecificPerturbationNetwork(Name:Iris, Reference: Reference(setosa[clone]), Metric: rho)
==============================================================================================
    * Number of nodes (leaf measurement): 4
    * Number of edges (association): 6
    * Observation type: specimen
    ------------------------------------------------------------------------------------------
    | Parameters
    ------------------------------------------------------------------------------------------
    * n_iter: 100
    * sampling_size: 50
    * random_state: 0
    * with_replacement: True
    * transformation: None
    * memory: 787.684 KB
    ------------------------------------------------------------------------------------------
    | Data
    ------------------------------------------------------------------------------------------
    * Features (n=200, m=4, memory=9.930 KB)
    ------------------------------------------------------------------------------------------
    | Intermediate
    ------------------------------------------------------------------------------------------
    * Reference Ensemble (memory=220 B)
    * Sample-specific Ensembles (memory=32.227 KB)
    ------------------------------------------------------------------------------------------
    | Terminal
    ------------------------------------------------------------------------------------------
    * Ensemble (memory=703.125 KB)
    * Statistics (['median', 'median_abs_deviation', 'CI(2.5%)', 'CI(97.5%)', 'normaltest|stat', 'normaltest|p_value'], memory=42.188 KB)

    

# View ensemble
print(*repr(sspn_rho.ensemble_).split("\n")[-4:], sep="\n")

Coordinates:
  * Samples     (Samples) <U8 'iris_0' 'iris_1' ... 'iris_148' 'iris_149'
  * Iterations  (Iterations) int64 0 1 2 3 4 5 6 7 8 ... 92 93 94 95 96 97 98 99
  * Edges       (Edges) object frozenset({'sepal_length', 'sepal_width'}) ......
  

# View statistics
print(*repr(sspn_rho.stats_).split("\n")[-4:], sep="\n")

Coordinates:
  * Samples     (Samples) object 'iris_0' 'iris_1' ... 'iris_148' 'iris_149'
  * Edges       (Edges) object frozenset({'sepal_length', 'sepal_width'}) ......
  * Statistics  (Statistics) <U20 'median' ... 'normaltest|p_value'

# View SSPN for a particular sample
graph = sspn_rho.to_networkx("iris_50")
list(graph.edges(data=True))[0]

'sepal_length',
 'sepal_width',
 {'median': 0.02589165067420246,
  'median_abs_deviation': 0.018200489248854534,
  'CI(2.5%)': -0.01811839255779411,
  'CI(97.5%)': 0.09388843207188755,
  'normaltest|stat': 4.196194170296813,
  'normaltest|p_value': 0.12268967426224149})

Now let's output the perturbation matrix which includes all SSPNs across all the samples using the median values of the perturbation distributions for the weight.

X_perturbation = sspn_rho.to_perturbation(weight='median')
X_perturbation.head()
# Edges	(sepal_length, sepal_width)	(sepal_length, petal_length)	(sepal_length, petal_width)	(sepal_width, petal_length)	(sepal_width, petal_width)	(petal_width, petal_length)
# Samples						
# iris_0	0.000757	-0.001263	-0.000486	-0.001056	-0.000663	0.001352
# iris_1	-0.007569	0.001436	-0.000439	-0.003043	0.001497	-0.000325
# iris_2	0.000189	-0.001334	-0.000158	-0.000911	-0.000124	0.000483
# iris_3	0.000011	-0.005670	0.000814	-0.005226	0.001058	-0.000984
# iris_4	-0.000990	-0.000612	0.000142	-0.002192	-0.001285	0.001501

Sample-specific perturbation networks for compositional data using partial correlation with basis shrinkage

You can also now use partial_correlation_with_basis_shrinkage from the compositional package (Jin et al. 2022 and Erb 2020).

sspn_bshrink = enx.SampleSpecificPerturbationNetwork(name="Iris", node_type="leaf measurement", edge_type="association", observation_type="specimen")
sspn_bshrink.fit(X=X, y=y, metric="pcorr_bshrink", reference="setosa", n_iter=100, confidence_interval=97.5, copy_ensemble=True)
print(sspn_bshrink)

========================================================================================================
SampleSpecificPerturbationNetwork(Name:Iris, Reference: Reference(setosa[clone]), Metric: pcorr_bshrink)
========================================================================================================
    * Number of nodes (leaf measurement): 4
    * Number of edges (association): 6
    * Observation type: specimen
    ----------------------------------------------------------------------------------------------------
    | Parameters
    ----------------------------------------------------------------------------------------------------
    * n_iter: 100
    * sampling_size: 50
    * random_state: 0
    * with_replacement: True
    * transformation: None
    * memory: 787.684 KB
    ----------------------------------------------------------------------------------------------------
    | Data
    ----------------------------------------------------------------------------------------------------
    * Features (n=200, m=4, memory=9.930 KB)
    ----------------------------------------------------------------------------------------------------
    | Intermediate
    ----------------------------------------------------------------------------------------------------
    * Reference Ensemble (memory=220 B)
    * Sample-specific Ensembles (memory=32.227 KB)
    ----------------------------------------------------------------------------------------------------
    | Terminal
    ----------------------------------------------------------------------------------------------------
    * Ensemble (memory=703.125 KB)
    * Statistics (['median', 'median_abs_deviation', 'CI(2.5%)', 'CI(97.5%)', 'normaltest|stat', 'normaltest|p_value'], memory=42.188 KB)

Create a SSPN using a custom association function

Here we specify a custom function for the associations which is the inverse kullback leibler divergence.

# Custom association function
def inverse_kullbackleibler(a,b, base=2):
    return 1/stats.entropy(pk=a, qk=b, base=base)

# Create ensemble network
sspn_kl = enx.SampleSpecificPerturbationNetwork(name="Iris", node_type="leaf measurement", edge_type="association", observation_type="specimen")
sspn_kl.fit(X=X, y=y, metric=inverse_kullbackleibler, reference="setosa", n_iter=100, function_is_pairwise=False, copy_ensemble=True)

print(sspn_kl)

==================================================================================================================
SampleSpecificPerturbationNetwork(Name:Iris, Reference: Reference(setosa[clone]), Metric: inverse_kullbackleibler)
==================================================================================================================
    * Number of nodes (leaf measurement): 4
    * Number of edges (association): 6
    * Observation type: specimen
    --------------------------------------------------------------------------------------------------------------
    | Parameters
    --------------------------------------------------------------------------------------------------------------
    * n_iter: 100
    * sampling_size: 50
    * random_state: 0
    * with_replacement: True
    * transformation: None
    * memory: 787.684 KB
    --------------------------------------------------------------------------------------------------------------
    | Data
    --------------------------------------------------------------------------------------------------------------
    * Features (n=200, m=4, memory=9.930 KB)
    --------------------------------------------------------------------------------------------------------------
    | Intermediate
    --------------------------------------------------------------------------------------------------------------
    * Reference Ensemble (memory=220 B)
    * Sample-specific Ensembles (memory=32.227 KB)
    --------------------------------------------------------------------------------------------------------------
    | Terminal
    --------------------------------------------------------------------------------------------------------------
    * Ensemble (memory=703.125 KB)
    * Statistics (['median', 'median_abs_deviation', 'CI(5%)', 'CI(95%)', 'normaltest|stat', 'normaltest|p_value'], memory=42.188 KB)

Clustered networks using Leiden or Louvain community detection

We are going to run Leiden community detection but since it is stochastic and not deterministic, we are going to use 100 different random seeds and only consider clusters that consistent (i.e., minimum_cooccurrence_rate=1.0).

If you use this method, please cite:

Josh L Espinoza, Manolito Torralba, Pamela Leong, Richard Saffery, Michelle Bockmann, Claire Kuelbs, Suren Singh, Toby Hughes, Jeffrey M Craig, Karen E Nelson, Chris L Dupont, Differential network analysis of oral microbiome metatranscriptomes identifies community scale metabolic restructuring in dental caries, PNAS Nexus, Volume 1, Issue 5, November 2022, pgac239, https://doi.org/10.1093/pnasnexus/pgac239

# Get graph
graph = enx.convert_network(X.T.corr(), nx.Graph)

# Cluster graph
cn = enx.ClusteredNetwork(name="Iris", node_type="leaf measurement", edge_type="pearson")

# Fit model
cn.fit(graph, n_iter=100, algorithm="leiden", minimum_cooccurrence_rate=1.0)
print(cn)

===========================================
ClusteredNetwork(Name:Iris, weight_dtype: float64)
===========================================
    * Algorithm: leiden
    * Number of iterations: 100
    * Minimum edge cooccurrence rate: 1.0
    * Number of nodes clustered (leaf measurement): 150 (100.00%)
    * Number of edges clustered (pearson): 6126 (54.82%)
    ---------------------------------------
    | Cluster Sizes (N = 2)
    ---------------------------------------
    Leiden_1    99
    Leiden_2    51

Let's take a look at the cluster assignments:

cn.node_to_cluster_.head()

Nodes[Clustered]
iris_1    Leiden_2
iris_0    Leiden_2
iris_2    Leiden_2
iris_3    Leiden_2
iris_4    Leiden_2
Name: Clusters[Expanded], dtype: object

We can also get a cluster to nodes dictionary:

cn.cluster_to_nodes_

Leiden_1    {iris_117, iris_76, iris_59, iris_126,...
Leiden_2    {iris_0, iris_12, iris_43, iris_4, ...
Name: Clusters[Collapsed], dtype: object

We can also get the clustered graph in other formats:

# Default
graph = cn.graph_clustered_

weights = cn.to_pandas_series()

df_adjacency = cn.to_pandas_dataframe(vertical=False)

df_edgelist = cn.to_pandas_dataframe(vertical=True)

sym = cn.to_symmetric()

Bidirectional clustered networks using Leiden or Louvain community detection

In associations networks, we often have both positive and negative associations which should be analyzed separately. With the BiDirectionalClusteredNetwork we can split positive and negative associations into separate subgraphs, perform community detection, then merge the clustered graphs.

If you use this method, please cite:

Josh L Espinoza, Manolito Torralba, Pamela Leong, Richard Saffery, Michelle Bockmann, Claire Kuelbs, Suren Singh, Toby Hughes, Jeffrey M Craig, Karen E Nelson, Chris L Dupont, Differential network analysis of oral microbiome metatranscriptomes identifies community scale metabolic restructuring in dental caries, PNAS Nexus, Volume 1, Issue 5, November 2022, pgac239, https://doi.org/10.1093/pnasnexus/pgac239

from sklearn.datasets import make_classification

# Create dataset
n_samples=1000
X, y = make_classification(n_samples=n_samples, n_features=10, n_classes=2, n_clusters_per_class=1, random_state=0)
X = pd.DataFrame(X)

# Compute associations
df_corr = X.corr()

# Convert associations to graph
graph = enx.convert_network(df_corr, nx.Graph, tol=1e-10)

# Cluster positive and negative associations then merge into single clustered graph
bdcn = enx.BiDirectionalClusteredNetwork()
bdcn.fit(graph)

bdcn
# Detecting communities via `leiden` algorithm: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [00:00<00:00, 8034.76it/s]
# Converting clustered graph
# Computing node connectivity
# Computing cluster connectivity
# Detecting communities via `leiden` algorithm: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [00:00<00:00, 12542.40it/s]
# Converting clustered graph
# Computing node connectivity
# Computing cluster connectivity
# ========================================================================
# BiDirectionalClusteredNetwork(Name:None, NodeType: None, EdgeType: None)
# ========================================================================
#     * Algorithm: leiden
#     * Minimum edge cooccurrence rate: 1.0
#     * Number of iterations: 100
#     * Number of nodes clustered (Union): 10 (100.00%)
#     * Number of nodes clustered (Intersection): 8 (80.00%)
# -------------------
# ClusteredNetwork(+)
# -------------------
#     * Number of nodes clustered: 8 (80.00%)
#     * Number of edges clustered: 4 (19.05%)
#     --------------------------------------------------------------------
#     | Cluster Sizes (N = 4)
#     --------------------------------------------------------------------
#     Leiden[+]_1    2
#     Leiden[+]_2    2
#     Leiden[+]_3    2
#     Leiden[+]_4    2
# -------------------
# ClusteredNetwork(-)
# -------------------
#     * Number of nodes clustered: 10 (100.00%)
#     * Number of edges clustered: 17 (70.83%)
#     --------------------------------------------------------------------
#     | Cluster Sizes (N = 2)
#     --------------------------------------------------------------------
#     Leiden[+]_1    8
#     Leiden[+]_2    2

Feature engineering using categories

Let's engineer some categories by collapsing by some predefined category. Check out Phylogenomic Functional Categories in Espinoza et al. 2022 for how these are used in practice.

If you use this method, please cite:

Josh L Espinoza, Manolito Torralba, Pamela Leong, Richard Saffery, Michelle Bockmann, Claire Kuelbs, Suren Singh, Toby Hughes, Jeffrey M Craig, Karen E Nelson, Chris L Dupont, Differential network analysis of oral microbiome metatranscriptomes identifies community scale metabolic restructuring in dental caries, PNAS Nexus, Volume 1, Issue 5, November 2022, pgac239, https://doi.org/10.1093/pnasnexus/pgac239

from soothsayer_utils import get_iris_data
from scipy import stats
import pandas as pd
import ensemble_networkx as enx

X, y = get_iris_data(["X","y"])
# Usage
cef = enx.CategoricalEngineeredFeature(name="Iris", observation_type="sample")

# Add categories
category_1 = pd.Series(X.columns.map(lambda x:x.split("_")[0]), X.columns)
cef.add_category(
    name_category="leaf_type", 
    mapping=category_1,
)
# Optionally add scaling factors, statistical tests, and summary statistics
# Compile all of the data
cef.compile(scaling_factors=X.sum(axis=0), stats_tests=[stats.normaltest])
# Unpacking engineered groups: 100%|██████████| 1/1 [00:00<00:00, 2974.68it/s]
# Organizing feature sets: 100%|██████████| 4/4 [00:00<00:00, 17403.75it/s]
# Compiling synopsis [Basic Feature Info]: 100%|██████████| 2/2 [00:00<00:00, 32768.00it/s]
# Compiling synopsis [Scaling Factor Info]: 100%|██████████| 2/2 [00:00<00:00, 238.84it/s]

# View the engineered features
cef.synopsis_
#   initial_features        number_of_features      leaf_type(level:0)      scaling_factors sum(scaling_factors)    mean(scaling_factors)   sem(scaling_factors)    std(scaling_factors)
# leaf_type                                                         
# sepal     [sepal_width, sepal_length]     2       sepal   [458.6, 876.5]  1335.1  667.55  208.95  208.95
# petal     [petal_length, petal_width]     2       petal   [563.7, 179.90000000000003]     743.6   371.8   191.9   191.9

# Transform a dataset using the defined categories
cef.fit_transform(X, aggregate_fn=np.sum)
# leaf_type sepal   petal
# sample            
# iris_0    8.6     1.6
# iris_1    7.9     1.6
# iris_2    7.9     1.5
# iris_3    7.7     1.7

About

Python package for association, differential, clustered, and sample-specific ensemble networks. Direct support for compositional data analysis. Additional capabilities for feature engineering.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages