-
-
Notifications
You must be signed in to change notification settings - Fork 153
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add network-based ranking to search results for enhanced relevancy #4436
Comments
I did some investigation on this: The ability to store a script in a file to tweak relevance was removed since Elastic 6.0 This feature was replaced by stored scripts, as explained in the documentation: Documentation says:
Using this feature, we would need to create a stored script via the API using a request like:
Then, in the query, we could use a
However, we should assess the performance of this approach, as I assume the stored script may need to handle millions of parameters (one score per document)? Another downside is that new documents won’t be included in the stored script’s parameters, so they won’t be boosted by the relevance script until the parameters are updated. An alternative approach for implementing this feature in ES could be to use a rank_feature field and query.This is the method ES recommends for applying static signals (similar to PageRank) to enhance relevance: This approach seems to be the best in terms of performance. According to the documentation:
Since we currently have However, this approach would require adding a new
The
The
One downside to this approach is that it requires adding a new field to the documents and populating it with values. If we only add this field to the parent documents, we could potentially update the new values relatively quickly using an asynchronous However, I’m unsure whether the ranking algorithm we'll use can compute scores for new documents at indexing time, or if it would require re-evaluating the entire corpus to compute fresh scores for all documents when new ones are added or removed. If the latter is the case, we’d need to perform a whole reindex for this field each time the scores are recomputed. Let me know if you need more details about any of these approaches. |
Interesting. Sounds like the second approach is probably the way to go. @mattdahl can correct me, but I think we'd want to update this value once a month or so, which would be a bummer if it meant updating every value in all 10 million decisions, particularly since the vast majority of the cases don't get cited at all, let alone in a given month. I haven't seen how his algorithm works, but it'd be really neat if we had a way of re-running it every month and then only updating some of the cases, like those that got new inbound citations, say, or that changed by more than a certain threshold. If it's like Pagerank though, I worry that one new citation will affect the entire network of ranks. Do you have any thoughts, Matt? |
Looks like we've been talking about network ranking algos for a while and we've got some other research on such things over here: #643. Before we solve this one, we should be sure to check out that work and make sure we're making the right move. Perhaps @mattdahl or @idc9 will know off the top of their head though! |
Sorry for my slow response! Here is my paper: Can you clarify what you think the bottleneck will be? Is it
If the former, then Mike is correct that under my algorithm (which is based on PageRank), any changes to the network will require re-computation of all the scores. So there is no way to only compute scores for the new cases that have just been added in e.g. the last month. I also don't have a good sense of how long the computation would take on the CL network, which is huge. (My paper only looks at the SCOTUS network; cases n = ~27k, citations n = ~200k.) However, my code uses graph-tool, which is written in C and is faster than the other main network packages. If monthly computation is too expensive, it might be possible to simply initialize the scores for new cases (as they get added) at some fixed score, and leave the existing cases' scores intact for longer periods of time. In the paper there is some analysis of what happens to the scores as the network is permuted, which suggests that the scores are actually pretty stable as new citations are added (Table 2). So we may not lose much by not re-calculating everything very frequently. (Though, again, my paper only looks at the SCOTUS network and the full CL network might exhibit other properties. The Table 2 analysis also doesn't capture the scenario of new cases being added, only new citations.) If the bottleneck is number 2, I have no idea how ES works and can't really speak to that problem. If it's cheaper to only update in the database cases whose scores have changed by more than some delta, I see no reason not to do that. On the topic of choosing a network algorithm in the first place, the primary target of my paper is the Kleinberg HITS (or hubs and authorities) algorithm (which was popularized in the legal literature by Fowler et al. and has been widely used). I show that HolmesRank outperforms that, as well as simpler metrics like out-degree and in-degree (Table 1). It's not in the paper, but HolmesRank is also better than PageRank (I think everyone agrees that PageRank is ill-suited to case law because of its temporal bias). I also just re-read Iain's paper. He looks at some additional algorithms that aren't in my paper, the best of which is CiteRank, which is a time-aware version of PageRank. He also notes the surprising performance of out-degree. HolmesRank builds in both of these things (though it justifies them on different theoretical grounds: for me, the idea is to capture the notion of Shepardizing in the legal research process). But we don't have direct comparisons here. And if all of this is too expensive to compute anyway, it might make sense to just do something simple (like in-degree or out-degree). |
Thanks for the notes, Matt. This is such an interesting problem space. I haven't read your paper, but if it's anything like PageRank, I think both computing the scores and propagating them into Elastic will be expensive. Doing that once is fine; doing it over and over would make us sad. The ideal solution is one that we could do in a realtime basis, but that's probably impossible. For calculating the scores, if I recall correctly from when we last did PageRank scores, there are two CPU- and memory-expensive steps:
We started out using For propagating into the search engine, with Solr, the output was a CSV file, so that part was easy. With Elastic, it seems like it'll be a lot harder since we'll need to make 10M update requests to the index each time. Ouch. So I guess I'm left trying to think of hacks:
Or maybe we just use out-degree? This is just the number of items a case cites and that's it, right? If we use this, then I suppose we can easily do it in realtime, which is pretty great. |
This is a helpful way to think about it! I just ran a quick test using This is good for us, because this "hack" you suggested is available to us:
This is possible in However, this still leaves the problem of step 2, calculating the scores. I do not know of a way to easily update pre-existing scores, so I think we have three options:
At least as a first pass (and possibly as a permanent one), I lean toward option 1. The main objection would be that it is atheoretical, because there is little reason to believe that the number of precedents cited would be a good measure of importance. However, both Iain's work and my own shows that out-degree is actually surprisingly okay. In terms of accuracy, my paper shows: HolmesRank > In-Degree > HITS > Out-Degree > Binary Expert Judgments. (My paper is also published now for anyone interested: https://doi.org/10.1111/jels.12401.) Finally, there is step 3: propagating the scores to ES. I don't have any insight about how to do this efficiently. |
Again, this is all so helpful, @mattdahl. I think where I'm gravitating is:
I think the way forward is to start experimenting with larger data sets and see:
Let's start by making a small script that can measure some of these things. Maybe you tell it the max edge count to compute, we run it, it measures, we take the next step after that. |
Over in #4312 (comment), we have a little discussion about a new network ranking algorithm that @mattdahl is working on.
He says:
I think this would be great. In the past, with Solr, we computed pagerank once/month, and it cranked out a CSV file that we used to boost results. It worked fine until we moved Solr to its own server and it became too difficult to move the CSV over to that server from the one that computed it.
I suspect Elastic will have a better solution. @albertisfu, do you know if Elastic has a way of pulling boosting scores from a file? Any ideas if it can be done from S3 or something like that?
Matt, if this is doable technically, I'd love to consider doing it. Can you share your paper? I've been wanting a better network rank for a while.
I think if we put this in place, we'd wind up with:
Any others I'm forgetting?
That's feeling like a lot, but maybe it's fine. Maybe we wouldn't need the time decay with this algo?
The text was updated successfully, but these errors were encountered: