Skip to content

Latest commit

 

History

History
401 lines (321 loc) · 17.6 KB

README.md

File metadata and controls

401 lines (321 loc) · 17.6 KB

protemalyze

Description

protemalyze is an R package designed to analyze protein embeddings derived from protein Language Models (pLMs). Protein embeddings are numerical representations of protein sequences, where each protein is mapped to a fixed-length vector in a high-dimensional space (Chandra et al., 2023). pLMs have been shown to capture essential information about protein sequences, and predictive models trained with protein embeddings often perform comparably to other encoding methods, despite having fewer dimensions (Yang et al., 2018). These embeddings are particularly useful for downstream tasks, such as identifying structurally or functionally similar proteins (Elnaggar et al., 2021). Despite their potential, these tools remain underutilized in the scientific community, primarily due to a knowledge barrier. protemalyze aims to lower this barrier, providing a more accessible entry point for researchers to explore data from pLMs without having to engage with the technical details of the underlying methods.

The protemalyze package was developed using R version 4.4.1 (2024-06-14), Platform: aarch64-apple-darwin20 (64-bit) and Running under: macOS Sonoma 14.5.


Installation

You can install the development version of protemalyze like so:

install.packages("devtools")
library("devtools")
devtools::install_github("katarinaavucic/protemalyze", build_vignettes = TRUE)
library("protemalyze")

To run the Shiny app:

runProtemalyze()

Overview

To list all the functions available in the package:

ls("package:protemalyze")

To list the datasets in the package:

data(package = "protemalyze") 

To access tutorials for the package:

browseVignettes("protemalyze")

protemalyze contains 12 functions.

  1. loadEmbeddings for loading an embedding matrix from a “csv”, tsv”, or “h5” file.

  2. processData for removing NULL and duplicate values from an embedding matrix.

  3. generateDistanceMatrix for computing a distance matrix from an embedding matrix.

  4. generateRankMatrix for computing a rank matrix from an embedding matrix.

  5. getClosestPair for retrieving the closest pair for each protein in a ranked matrix.

  6. getFarthestPair for retrieving the farthest pair for each protein in a ranked matrix.

  7. getDistanceByMapping for retrieving the embedding distances from a distance matrix for protein pairs in a mapping.

  8. getRankByMapping for retrieving the embedding ranks from a rank matrix for protein pairs in a mapping.

  9. visualizeEmbeddingUMAP for visualizing the embedding matrix with an interactive plot of the UMAP created from the embedding matrix.

  10. visualizeDistanceDistribution for visualizing the distribution of embedding distances from a distance matrix according to a mapping.

  11. visualizeRankDistribution for visualizing the distribution of embedding ranks from a rank matrix according to a mapping.

  12. runProtemalyze for launching the Shiny app which provides an interactive interface for visualizing and analyzing protein embeddings.

The package also contains an embedding matrix from Escherichia coli (E. coli), called eColiEmbeddingMatrix and a mapping of the paralogs in E. coli, called eColiParalogMapping. Refer to package vignettes for more details. The package also contains an embedding matrix from Severe Acute Respiratory Syndrome Coronavirus 2 (SARS CoV-2), called SARSCoV2EmbeddingMatrix and an example mapping of the proteins in SARS CoV-2, called SARSCoV2Mapping. The raw data can be located at inst/extdata/ for external use as well. Refer to package help functions for more details. An overview of the package is illustrated below.


Contributions

The creator and maintainer of this package is Katarina Vucic, who wrote all of the functions.

  1. loadEmbeddings

Author: Katarina Vucic

The loadEmbeddings function is responsible for loading an embedding matrix from a file, which can be in “csv”, “tsv”, or “h5” format. It uses different file-importing techniques to read the data and store it in a matrix. The function makes use of the readr, rhdf5, and dplyr R packages. The readr package is used for reading CSV and TSV files. Therhdf5 is used for loading data stored in HDF5 format, especially for larger datasets. The dplyr package is used for organizing the imported data into a readable format. The rhdf5 vignette (Fischer & Smith, 2024) was referenced to read the h5 file.

  1. processData

Author: Katarina Vucic

The processData function is responsible for cleaning the embedding matrix by removing NULL values and duplicates. It uses standard data manipulation techniques to ensure the matrix is free from missing values or duplicate proteins. The function makes use of the dplyr and tibble R packages. The dplyr package is used for data manipulation tasks such as removing NULL and duplicate values. The tibble package is used for organizing the cleaned data into a tibble format.

  1. generateDistanceMatrix

Author: Katarina Vucic

The generateDistanceMatrix function computes a distance matrix from the provided embedding matrix. It calculates pairwise distances between protein embeddings, which can be used for further analysis. The function makes use of the parallelDist R packages. The parallelDist package is used for efficiently calculating the distance matrix using parallel processing, which helps handle large datasets. The parallelDist package performs the underlying pairwise distance computation.

  1. generateRankMatrix

Author: Katarina Vucic

The generateRankMatrix function computes a rank matrix from the embedding matrix. It ranks the distances between proteins, providing a relative measure of how close or far proteins are in the embedding space. The function makes use of the matrixStats R packages. The matrixStats package is used for efficiently calculating ranks in large datasets. The matrixStats package performs the underlying rank computation.

  1. getClosestPair

Author: Katarina Vucic

The getClosestPair function retrieves the closest protein pair for each protein in the ranked matrix. It identifies the most similar protein for each protein based on the calculated ranks. The function makes use of the base R package. The which() function is used to determine the columns with a rank of 1.

  1. getFarthestPair

Author: Katarina Vucic

The getFarthestPair function retrieves the farthest protein pair for each protein in the ranked matrix. It identifies the most dissimilar protein for each protein based on the calculated ranks. The function makes use of the base R package. The which() function is used to determine the columns with a rank equal to the number of rows.

  1. getDistanceByMapping

Author: Katarina Vucic

The getDistanceByMapping function retrieves the embedding distances from the distance matrix for protein pairs in a provided mapping. It returns the pairwise distances for proteins based on a predefined mapping, such as paralog mappings. The function makes use of the base R package. The match() function is used to find the indices of protein pairs in the distance matrix. The cbind() function is used to index the specific positions in the distance matrix for the protein pairs.

  1. getRankByMapping

Author: Katarina Vucic

The getRankByMapping function retrieves the embedding ranks from the rank matrix for protein pairs in a provided mapping. It returns the pairwise ranks for proteins based on a predefined mapping, such as paralog mappings. The function makes use of the base R package. The match() function is used to find the indices of protein pairs in the distance matrix. The cbind() function is used to index the specific positions in the distance matrix for the protein pairs.

  1. visualizeEmbeddingUMAP

Author: Katarina Vucic

The visualizeEmbeddingUMAP function visualizes the embedding matrix by reducing the data to a lower-dimensional space and creating an interactive UMAP plot. It allows the user to explore the positions of proteins in the embedding space. The function makes use of the umap and plotly R packages. The umap package is used for dimensionality reduction, specifically for creating the UMAP projection of the embedding matrix. The plotly package is used for creating interactive plots of the UMAP results. The plotly scatter and line plots in R and marker styling documentation (Plotly Group, n.d.) were referenced during the creation of the function.

  1. visualizeDistanceDistribution

Author: Katarina Vucic

The visualizeDistanceDistribution function visualizes the distribution of embedding distances from the distance matrix according to a given mapping. It helps in analyzing how the distances between proteins are distributed. The function makes use of the ggplot2 R packages. The ggplot2 is used for creating visualizations of the distance distribution. The ggplot2 geom_histogram examples (Wickham, n.d.) were referenced throughout the creation of this function. The Geeks for Geeks example on adding the mean line (GeeksforGeeks, 2024) was referenced to add the median line.

  1. visualizeRankDistribution

Author: Katarina Vucic

The visualizeRankDistribution function visualizes the distribution of embedding ranks from the rank matrix according to a given mapping. It helps in analyzing how the ranks of proteins are distributed. The function makes use of the ggplot2 R packages. The ggplot2 is used for creating visualizations of the distance distribution. The ggplot2 is used for creating visualizations of the distance distribution. The ggplot2 geom_histogram examples (Wickham, n.d.) were referenced throughout the creation of this function. The Geeks for Geeks example on adding the mean line (GeeksforGeeks, 2024) was referenced to add the median line.

  1. runProtemalyze

Author: Katarina Vucic

The runProtemalyze function is responsible for launching the Shiny app for the protemalyze package. It provides an interactive interface for visualizing and analyzing protein embeddings. The function makes use of the shiny R package. The shiny package is used for creating the web application framework and handling the app’s user interface and server logic. The Shiny documentation on Tabsets was used as a reference for designing the overall tabbed layout (Chang et al., 2022). The shiny fluidPage documentation was referenced to create the layout structure. The DT package is used for displaying interactive data tables within the app, as described in the Shiny Article on DataTables (Xie, 2017). The plotly package is used for creating interactive plots, allowing users to explore the data dynamically. The fileInput function for file uploads was implemented using the Shiny documentation on file upload and file download functionality is handled by the downloadHandler, as described in the Shiny Reference for downloadHandler (Chang et al., 2022). Additionally, the function references techniques from Mastering Shiny to manage file uploads and set an upload limit of 50MB, and retrieving the file type is also based on information from Mastering Shiny(Wickham, 2018).

Raw per-protein embeddings for E. coli, stored in eColiEmbeddingMatrix, were generated by Uniprot (Batemen et al., 2022). The one-to-many paralog mapping for E. coli, stored in eColiParalogMapping, were generated using the Orthologous Matrix (OMA) Browser (Altenhoff et al., 2018). Raw per-protein embeddings for SARS CoV-2, stored in SARSCoV2EmbeddingMatrix, were generated by Uniprot (Batemen et al., 2022). The one-to-many paralog mapping for SARS CoV-2, stored in SARSCoV2ParalogMapping, is example data and was generated by Katarina Vucic.

All code written to make the contents of the package were written by the creator and maintainer Katarina Vucic using the documentation for various packages listed above. Perplexity.ai was used on occasion to find specific packages or functions when simple Googling did not retrieve any useful information. For example, “Is there a R package to rank the values in every row of an R data.frame for very large matrices?” returned matrixStats, along with links to sources on where to access relevant functions from the package. However, no code was used from these sources, and all code was written by Katarina Vucic using the documentation or other examples as listed above.

References

Acknowledgements

This package was developed as part of an assessment for 2024 BCB410H: Applied Bioinformatics course at the University of Toronto, Toronto, CANADA. protemalyze welcomes issues, enhancement requests, and other contributions. To submit an issue, use the GitHub issues.