-
Notifications
You must be signed in to change notification settings - Fork 69
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Questions about Hetionet and computations on Hetionet #8
Comments
Thanks @semihsalihoglu-uw for your questions. For those who don't know, @semihsalihoglu-uw is currently conducting a survey on graph database usage, which I believe anyone is free to take. They have previousely explored this topic in a work titled The Ubiquity of Large Graphs and Surprising Challenges of Graph Processing. For some reason, the DOI for that article has not been registered (shame on the publisher). Anyways, here are my answers to the questions:
The primary computation we run is to compute degree-weighted path counts (DWPCs, initially described here). DWPCs measure the extent of connectivity between two nodes along a given type of path (metapath). They are related to path counts (the number of paths), but have an adjustment for node degree to downweight paths through high degree nodes. We also ran some one-of-a-kind queries to investigate specific questions. These usually rely on Cypher queries to Hetionet in Neo4j (see examples).
We have three implementations for computing the DWPC. Here they are in order of both date created and sophistication:
So as you can see, each new implementation builds off the previous ones and often depends on parts of the existing codebases.
We implement a Somewhat related, in hetmech, we've created a HetMat data structure that stores hetnets on disk. Each adjacency matrix is a different file (exported from numpy or scipy). In this way, users only interested in certain parts of the hetnet, don't have to read all relationship matrices.
I think visualization of hetnets is still a pain point. Especially visualizing large numbers of nodes and relationships. Of course, visualizing 50 thousand nodes and millions of relationships won't tell you much about specific nodes or relationships, but these views help communicate the network generally. We've used Cytoscape here, but even this become unwieldy and was very manual. Feel free to follow up with any additional questions. Or if you have nothing else to ask, you can close the issue. |
Thank you very much for the detailed response. The Cypher queries here are especially very useful. Two follow up question:
And I think VLDB might be registering the dois in September when conference is held. They always do but I'm not sure of the exact timeline. |
Hetmech and the related HetMat data structure are motivated primarily by performance. Personally I don't find matrices and their dot products an intuitive data structure for hetnets. To me, it's much more intuitive to use a data structure that more closely resembles a network and that more easily allows nodes/edges to be annotated with properties. However, the performance improvement from calculating path counts via matrix multiplication turns out to be too compelling to ignore. The matrix multiplication is faster than path traversal algorithms in two important ways:
Together these factors lead to a several orders of magnitude efficiency improvement.
We are always on the lookout for larger projects that provide functionality for hetnets and did consider alternatives before creating HetMat.
Currently, our hetnet stack consists of the following tools:
So as our research has progressed, it seems like we're using more tools, since we're finding where each tool excels and using it for just those applications. For some projects, we do use networkx (see obonet for example), just less so for our hetnet work. |
Great thank you very much for the detailed answer again. One final question: When integrating data to the hetnet, or even after that when using it for research, did you ever have to do "graph cleaning"? For example, you noticed that some nodes or edges looked suspicious and then had to remove them? Or was each network data that you integrated clean data? |
We released an initial version of Hetionet quite a bit before we released version 1.0, which had several additions, changes, and improvements. First, every relationship type required preprocessing, which is where the cleaning occurred. In general, each resource had it's own repository where most of the preprocessing took place. Then we had a single notebook that integrated the data from all of the source repositories. Graph cleaning was an iterative approach. Neo4j was super helpful here because it provided a visual way for us to quickly explore and sanity check the networks. Of course, you often will notice bugs or possible improvements. For example, certain metadata may be missing, certain things may be misspelled, or additional processing may be required. Cleaning the data as well as mapping everything to common standardized identifiers was a laborious process, as were the legal issues surrounding the data reuse. See this table with all of the resources we integrated and citations to the related supplementary materials. |
Great, thank you very much! I'm closing the issue. |
Hi,
I sent an email to Daniel asking three broad questions about Hetionet that I was interested in for my research. He asked me to post this as a Github issue, so I'm writing them here.
What kind of queries and graph computations do you run on Hetionet? Did you (or other people) run any clustering algorithms on Hetionet? As another example, I see from Daniel's phd defense presentation that he ran some ML algorithms too. What were these?
What kind of software do you use to run these queries? From Daniel's slides, I see Neo4j for example as a graph software.
Were there any features that you think was missing from the software that you were using, or things that were difficult to do? One specific feature we had in mind is since your graph is quite heterogenous, I was curious if you extract simpler, more homogenous graphs, out of hetionet, say of only gene gene interactions, then store it as a separate graph, and do computations on it? If so, how many different "homogeneous graphs" inside hetionet do you think you have extracted so far?
Thanks!
Semih
The text was updated successfully, but these errors were encountered: