CorrMapper is an online research tool for the integration and visualisation of complex biomedical and omics datasets.
It allows users to:
- map clinical metadata onto the omics datasets using an automatically generated dashboard interface,
- perform feature selection on the omics datasets using one of the clinical metadata variables,
- infer robust correlations between the selected features of one or two omics datasets,
- visualise and analyse the networks of these correlations using highly interactive modules.
CorrMapper is a data exploration and hypothesis generation tool. It does not try to automate statistical inference, or provide predicitve models. It is simply making the simultaneous exploration of omics datasets and clinical metadata easier by reducing the number of predictors to the clinically relevant ones, and by providing novel and interactive visualisation modules.
Very importantly, if you would like to use the features that were selected by CorrMapper for modelling, you must ensure that the model is built and validated on new data, that was not included in the feature selection. Otherwise the generalisation error of your model will be underestimated. See Chapter 7.10.2 The Wrong and Right Way to Do Cross-validation in The Elements of Statistical Learning on page 245.
CorrMapper's overall flowchart
The documentation of all modules and functions of CorrMapper live here[!!! ref] Here's a rough overview of the organisation of the code in this project. There are two main bodies of code: frontend and backend.
Frontend holds all the website (HTML, CSS, JS) and Flask app that handles the views, the database models, the forms and their validation scripts. This part of the application relies very heavily on ScienceFlask. Please make sure to read its README and deployment notes. The forms models and views are largely similar to ScienceFlask, but obviously they have CorrMapper specific components:
- more detailed forms and form checking logic
- a much richer data model for the study/analysis tables
- extra views and extended views that are specific to CorrMapper as a research tool.
In the following part we quickly go through the frontend folder structure, and point out the obvious differences between CorrMapper and ScienceFlask.
- dashboard: This is the main difference from ScienceFlask's folder
structure. It holds Python scripts for the automatic and programmatic generation
of CorrMapper's interactive metadata explorer.
- dashboard.py: This holds the main functions for transforming the user's metadata files, calculating the PCA scores of the scatter-plots, saving and loading the dashboard object.
- write_dashboard_js.py: This holds one monumental function (which badly needs to be refactored). It generates the 600-700 lines of JavaScript, and dc.js code that is required for driving the dashboard based on the user's metadata file.
- static: Holds all the CSS, JavaScript, JPG and font assets of the project.
These are all very similar to what you'll find in ScienceFlask.
- demo: This folder contains the JavaScript files of the three demo projects displayed on the opening page of corrmapper.com.
- templates: These are th HTML templates that Flask uses with the Jinja2
templating engine to build the various views of the website. These are mostly
similar to ScienceFlask's templates folder, except the following:
- dashboard.html: This is the template for the metadata explorer.
- demo_dashboard.html: Template for one of the three demo apps.
- demo_genomic.html: Template for one of the three demo apps.
- demo_network.html: Template for one of the three demo apps.
- vis.html: This is the template for general network explorer of CorrMapper.
- vis_genomic.html This is the template for the genomic network explorer of CorrMapper.
The backend folder is where CorrMapper's core scientific algorithms and pipeline live. The whole point of the ScienceFlask project is to make the lives of other researchers easier, by allowing them to wrap their scientific tool within the template of ScienceFlask. Therefore, the folders and Python scripts within the backend folder are all specific to CorrMapper. The majority of these functions are well documented, therefore make sure to check out the docs of CorrMapper hereref[!!!].
Here's what the backend of CorrMapper is actually doing to the uploaded data:
- bins: Submodule for the binning of genomic datasets according to the
chromosomal map of the studied species.
- binning.py: Stores functions for binning a set of genomic features using one of the species specific chromosomal maps and the annotation files of the genomic features (as provided/uploaded by the user).
- get_chromosomes_from_UCSC: Holds a function to download chromosomal map information for the most common model organisms.
- chromosome_files: Holds the length of each chromosome for a given species.
- bin_files: Contains the genome of a given species split into 300 equidistant bins.
- corr: Submodule for the estimation of sparse conditional independence
networks, Spearman correlations and their permuted p-values. Furthermore it also
has functions for finding modules within unipartite and bipartite networks and
saving this network information for JavaScript to be used by CorrMapper's
frontend.
- gpd: Submodule for the precise estimation of permuted p-values through the use of extreme value approximation with Generalised Pareto Distribution.
- bivar_modules.R: Implementation of the bivariate module finding algorithm through label propagation, by Steven Beckett.
- corr.py: Main function for the calculation of conditional independence networks, Spearman correlations with permuted p-values and finding modules.
- hugeR.py: Python wrapper function are the
huge
R package. This is used For the non-paranormal extension of the Graphical Lasso algorithm and the StARS network selection/regularisation method. - network.py: Various functions for the creation, and plotting of unipartite
and bipartite network objects using
networkX
. - pairplots.py: Functions for the generation of informative scatter-plots in the general and genomic network explorer interfaces of CorrMapper.
- permutation.py: Contains functions for the efficient (vectorised) and parallelized calculation of Spearman p-values through permutation testing and the correction of these using the GPD method.
- utils.py: Numerous helper functions for the ordering and manipulation of heatmap and network objects, and also for module finding.
- write_js_vars.py: Has functions for writing all the variables needed for the generation of the general network explorer's visualisations (network, heatmap).
- fs: Submodule for feature selection.
- fs.py: Holds main wrapper function for feature selection algorithms, both for categorical and continuous outcome variable. Univariate FDR and Boruta are called from here.
- lsvc_cv.py: Holds functions for FS with LinearSVC, which uses CV and adaptive grid expansion to find the right regularisation parameter.
- mi.py: Methods for calculating Mutual Information in an embarrassingly parallel way.
- mifs.py: Parallelized Mutual Information based Feature Selection module.
- genomic: Submodule for generating and saving the variables for the
genomic network explorer of CorrMapper.
- genomic.py: Contains wrapper for the functions in write_network.py.
- write_network.py: Functions for saving the network object for JavaScript.
- write_tables.py: Functions for saving the table objects for JavaScript.
- util: Submodule of utility functions
- check_uplaoded_files: Contains functions for sanity checking the uploaded data, metadta and annotation files.
- io_params: Has functions for loading and saving the params file, which is used by CorrMapper internally to keep track of the state of an analysis.
This simply collects analyses that failed during the execution of CorrMapper's data integration pipeline. This is mainly for the admin of the app and for debugging purposes.
This is where CorrMapper and celery (the distributed task queue used by CorrMapper for scheduling the jobs of users) will write their log files.
Contains code for reproducing the benchmarking experiments of the upcoming CorrMapper paper. Please read this folder's README for more information.
The folder where the uploaded datasets and performed analyses of the users are saved.