You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In #324 , we implemented a performance / stability tweak that caps the overall number of entities being held in memory during query execution. This issue tracks a further potential enhancement to our performance / stability efforts.
Imagine a three-node, two-hop predict query that looks something like fanconi anemia [disease] - [gene] - [gene] (Example FA query below). The first hop includes many genes that are very specific to FA (including the canonical genes FANCA, FANCB, FANCC, etc.). But it also includes many "promiscuous" genes like TP53 that will get exploded to many results in the second hop. We expect that results that go through TP53 will be down-prioritized in the subsequent sorting and ranking step based on something like the Normalized Google Distance. But having to track these entities and all the entities they are linked to does affect performance and stability -- the query below does not complete on my local machine possibly due to this issue. In addition to promiscuous genes, there are also many promiscuous diseases (e.g., "cancer"), promiscuous drugs (e.g., "acetaminophen"), anatomical entities (e.g., "brain"), etc.
Here, I suggest we create the ability to optionally filter out promiscuous nodes in the course of query execution. I don't know the exact mechanism of implementing this feature, so this probably deserves some brainstorming. Naively, I propose two options:
comparing to some explicitly enumerate list of "excluded entities" (either centrally maintained or user-specified, with different pros and cons)
dynamically trying to assess promiscuity and removal via node attribute filters (filter on node attributes #174); could query pubmed or our semmeddb API as a data sources to score promiscuity
In addition to the question of how we will calculate a promiscuity score, we also need to decide how the user intent can be expressed in a TRAPI query (since the use of this filter would likely be use-case dependent). Is there a place where optional parameters can be specified in a TRAPI query?
In #324 , we implemented a performance / stability tweak that caps the overall number of entities being held in memory during query execution. This issue tracks a further potential enhancement to our performance / stability efforts.
Imagine a three-node, two-hop predict query that looks something like
fanconi anemia [disease]
-[gene]
-[gene]
(Example FA query below). The first hop includes many genes that are very specific to FA (including the canonical genes FANCA, FANCB, FANCC, etc.). But it also includes many "promiscuous" genes like TP53 that will get exploded to many results in the second hop. We expect that results that go through TP53 will be down-prioritized in the subsequent sorting and ranking step based on something like the Normalized Google Distance. But having to track these entities and all the entities they are linked to does affect performance and stability -- the query below does not complete on my local machine possibly due to this issue. In addition to promiscuous genes, there are also many promiscuous diseases (e.g., "cancer"), promiscuous drugs (e.g., "acetaminophen"), anatomical entities (e.g., "brain"), etc.Here, I suggest we create the ability to optionally filter out promiscuous nodes in the course of query execution. I don't know the exact mechanism of implementing this feature, so this probably deserves some brainstorming. Naively, I propose two options:
In addition to the question of how we will calculate a promiscuity score, we also need to decide how the user intent can be expressed in a TRAPI query (since the use of this filter would likely be use-case dependent). Is there a place where optional parameters can be specified in a TRAPI query?
Example FA query
The text was updated successfully, but these errors were encountered: