Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Questions about Hetionet and computations on Hetionet #8

Closed
semihsalihoglu-uw opened this issue May 30, 2018 · 6 comments
Closed

Questions about Hetionet and computations on Hetionet #8

semihsalihoglu-uw opened this issue May 30, 2018 · 6 comments

Comments

@semihsalihoglu-uw
Copy link

Hi,

I sent an email to Daniel asking three broad questions about Hetionet that I was interested in for my research. He asked me to post this as a Github issue, so I'm writing them here.

  1. What kind of queries and graph computations do you run on Hetionet? Did you (or other people) run any clustering algorithms on Hetionet? As another example, I see from Daniel's phd defense presentation that he ran some ML algorithms too. What were these?

  2. What kind of software do you use to run these queries? From Daniel's slides, I see Neo4j for example as a graph software.

  3. Were there any features that you think was missing from the software that you were using, or things that were difficult to do? One specific feature we had in mind is since your graph is quite heterogenous, I was curious if you extract simpler, more homogenous graphs, out of hetionet, say of only gene gene interactions, then store it as a separate graph, and do computations on it? If so, how many different "homogeneous graphs" inside hetionet do you think you have extracted so far?

Thanks!

Semih

@dhimmel
Copy link
Member

dhimmel commented May 30, 2018

Thanks @semihsalihoglu-uw for your questions. For those who don't know, @semihsalihoglu-uw is currently conducting a survey on graph database usage, which I believe anyone is free to take. They have previousely explored this topic in a work titled The Ubiquity of Large Graphs and Surprising Challenges of Graph Processing. For some reason, the DOI for that article has not been registered (shame on the publisher).

Anyways, here are my answers to the questions:

What kind of queries and graph computations do you run on hetionet?

The primary computation we run is to compute degree-weighted path counts (DWPCs, initially described here). DWPCs measure the extent of connectivity between two nodes along a given type of path (metapath). They are related to path counts (the number of paths), but have an adjustment for node degree to downweight paths through high degree nodes.

We also ran some one-of-a-kind queries to investigate specific questions. These usually rely on Cypher queries to Hetionet in Neo4j (see examples).

What kind of software do you use to run these queries

We have three implementations for computing the DWPC. Here they are in order of both date created and sophistication:

  1. Using a function in the hetio python package which takes a hetio.hetnet.Graph object. This requires the whole graph to be read into memory.

  2. Using a Cypher implementation (background) that computes DWPCs from a Neo4j database. This has the advantages that the graph can be stored on disk and concurrent queries are possible. Generally, we still use functions from the hetio package to template these queries.

  3. Matrix multiplication approaches we're currently developing for the hetmech project. This approach stores hetnets as matrices (one adjacency matrix for each relationship type). We can achieve massive efficiency gains by computing DWPCs with matrix multiplication. The two downsides are that this method doesn't track which paths connect nodes (just how many) and that excluding duplicate nodes in a path is tricky. We have built considerable python infrastructure to do this. The hetmech infrastructure still uses parts of the hetio package.

So as you can see, each new implementation builds off the previous ones and often depends on parts of the existing codebases.

I was curious if you extract simpler, more homogenous graphs, out of hetionet, say of only gene gene interactions

We implement a get_subgraph method for hetio.hetnet.Graphs. However, we have mostly used this to generate sub-hetnets (usually to create testing networks) rather than homogeneous networks. Since I feel that hetnets are underutilized compared to homonets, I don't spend much time working on approaches for homonets.

Somewhat related, in hetmech, we've created a HetMat data structure that stores hetnets on disk. Each adjacency matrix is a different file (exported from numpy or scipy). In this way, users only interested in certain parts of the hetnet, don't have to read all relationship matrices.

Were there any features that you think was missing from the software that you were using, or things that were difficult to do?

I think visualization of hetnets is still a pain point. Especially visualizing large numbers of nodes and relationships. Of course, visualizing 50 thousand nodes and millions of relationships won't tell you much about specific nodes or relationships, but these views help communicate the network generally. We've used Cytoscape here, but even this become unwieldy and was very manual.

Feel free to follow up with any additional questions. Or if you have nothing else to ask, you can close the issue.

@semihsalihoglu-uw
Copy link
Author

semihsalihoglu-uw commented May 31, 2018

Thank you very much for the detailed response. The Cypher queries here are especially very useful. Two follow up question:

  1. Are you building the hetmech for performance reasons only? Or were there computations you thought were simply much easier to express as matrix multiplications.
  2. There are several well developed graph libraries that provide adjacency matrix representations and operations of graphs, such as networkx. Would these serve your needs or did you choose to build hetmetch from scratch because there were specific computations they would not satisfy?

And I think VLDB might be registering the dois in September when conference is held. They always do but I'm not sure of the exact timeline.

@dhimmel
Copy link
Member

dhimmel commented May 31, 2018

Are you building the hetmech for performance reasons only?

Hetmech and the related HetMat data structure are motivated primarily by performance. Personally I don't find matrices and their dot products an intuitive data structure for hetnets. To me, it's much more intuitive to use a data structure that more closely resembles a network and that more easily allows nodes/edges to be annotated with properties. However, the performance improvement from calculating path counts via matrix multiplication turns out to be too compelling to ignore. The matrix multiplication is faster than path traversal algorithms in two important ways:

  1. Computation time scales linearly with path length because matrix multiplication does not track which paths arrive at a certain destination. Path traversal methods blow up with increasing path length.
  2. The matrix multiplication approaches compute DWPCs for all source-target node pairs. We usually are interested in all pairs, so it's a huge bonus to get all the DWPCs in a single output matrix.

Together these factors lead to a several orders of magnitude efficiency improvement.

There are several well developed graph libraries that provide adjacency matrix representations and operations of graphs, such as networkx. Would these serve your needs or did you choose to build hetmetch from scratch because there were specific computations they would not satisfy?

We are always on the lookout for larger projects that provide functionality for hetnets and did consider alternatives before creating HetMat. networkx isn't a good option as it's hetnet support is mediocre --- the MultiGraph supports relationship types but not node types and doesn't really provide first-class type support. While network can export to adjacency matrices, it doesn't really have the functionality we needed to perform computations on them. Building the HetMat data structure from scratch allowed us to do some really cool things:

  1. implement an on-disk data structure for hetnets
  2. enable types of on-disk caching
  3. enable additional types of in-memory caching and optimizations

Currently, our hetnet stack consists of the following tools:

  1. hetio.hetnet.MetaGraph objects for storing the hetnet schema (metagraph)
  2. hetmech.hetmat.HetMat for storing hetnets as matrices
  3. neo4j for enabling custom cypher queries, interactive visualization, and path traversal operations

So as our research has progressed, it seems like we're using more tools, since we're finding where each tool excels and using it for just those applications. For some projects, we do use networkx (see obonet for example), just less so for our hetnet work.

@semihsalihoglu-uw
Copy link
Author

Great thank you very much for the detailed answer again.

One final question: When integrating data to the hetnet, or even after that when using it for research, did you ever have to do "graph cleaning"? For example, you noticed that some nodes or edges looked suspicious and then had to remove them? Or was each network data that you integrated clean data?

@dhimmel
Copy link
Member

dhimmel commented May 31, 2018

When integrating data to the hetnet, or even after that when using it for research, did you ever have to do "graph cleaning"?

We released an initial version of Hetionet quite a bit before we released version 1.0, which had several additions, changes, and improvements. First, every relationship type required preprocessing, which is where the cleaning occurred. In general, each resource had it's own repository where most of the preprocessing took place. Then we had a single notebook that integrated the data from all of the source repositories.

Graph cleaning was an iterative approach. Neo4j was super helpful here because it provided a visual way for us to quickly explore and sanity check the networks. Of course, you often will notice bugs or possible improvements. For example, certain metadata may be missing, certain things may be misspelled, or additional processing may be required. Cleaning the data as well as mapping everything to common standardized identifiers was a laborious process, as were the legal issues surrounding the data reuse.

See this table with all of the resources we integrated and citations to the related supplementary materials.

@semihsalihoglu-uw
Copy link
Author

Great, thank you very much! I'm closing the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants