-
-
Notifications
You must be signed in to change notification settings - Fork 153
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add "Create an Alert" for RECAP searches #612
Comments
I thought about this, but I haven't done it for the moment. The thing that's slowing me down is that most of the PACER alert systems (like Docket Alarm) will go and check a docket for you on some sort of regular basis. I'm afraid that if we create an alert system, people will expect that kind of service. I don't think this is hard though, since we already have alerts for two object types (oral args and opinions). I also want to create alerts for dockets themselves. This would use the same system as the regular search, just filtered to a query like |
So this turns out to have two complicated problems:
I think what we can do, to fix both of these problems, is to only do alerts for the text of documents. What we can do then is limit our alerts to documents that got text since the last time the alerts ran — in other words, only trigger on PDF text. We don't have to think about the problem that the docket name might have changed, and we don't have to think about the gazillions of docket entry descriptions that we otherwise would be searching against. It also solves problem 1 because all of those files could be kept in a sidecar index or even in a little database table that could be a lot smaller. This limits our alerts a bit, yep, but it isn't horrible and it simplifies them a bunch. Apologies if this is a bit like confused rambling. Working through this is busting my brain a bit. |
I pondered this one some more. My first solution was:
That was weak, but solved part of the problem. New idea:
That will suck when the following happens:
In other words, whenever we create the item with a subset of the fields, we won't trigger on those fields until we get the PDF associated with that document. This is...not great, but it's also not terrible. Some fields will be left out of alerts part of the time. Another solution is to keep a solr index containing the diff of new content whenever we get it. There's probably a way to identify which fields are new when we get something and to only store those into Solr. This way, alerts would only trigger on new content, not on old, and we'd have a weird Solr index with a bunch of partial objects. For example, say we had three fields representing pizza toppings:
In other words, this index only keeps track of new information since the last time the item was updated. It's a diff index, if you will. Next, we run alerts, and we search for, "Any pizza will onions". We run it against all of the fields, topping1, topping2, and topping3, and we learn that pizza #1 is a match! Great. We return that result. This seems like it could work, assuming that doing the diffs as I describe them here isn't too terribly difficult. I don't love it because it's complicated, but nothing is ever easy. |
Another idea that could solve this neatly. We keep a table of the documents that are triggered for a given alert. So in essence, you only can get an alert for a document once, no matter how many times it gets new pieces of data. I think this is the solution to this issue I've been looking for. |
Here's a service selling these alerts and saying they're "economical" at about $10 each / month: https://www.courtalert.com/Business-Development-Realtime-Federal-Complaints.asp We should really get this figured out. |
Features:
|
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
|
Lots more discussion on this today. A few things to note and reiterate:
So the plan going forward is:
|
@mlissner Here is a summary of the different features and requirements of this project we've been discussing, a brief overview of the architecture could use, and some questions so we can agree on the approach and start working on this project. Since the percolator doesn't support join queries such as
This index will include all the Docket fields that are currently part of the
This index will include the same fields as We won't need to perform an additional flattening process either for the percolator mapping or the documents ingested before percolating them. Following this approach, when a new Docket is created/updated, it will be percolated into the When a new RECAPDocument is added/updated, the However, here are the limitations I found with this approach. Consider this query: It returns 2 Dockets and 17 Docket entries in the fronted. But the It matches the 17 RECAPDocuments. This is possible because the The problem is the following: Initial Docket Then we receive an upload that updates the Docket We'll percolate the Docket into I'll be matched and trigger the alert, that's good. However, now all the RECAPDocuments have also been updated with the new Docket So, it's possible that many of those
To do this, we'll need to percolate every
But this query only allows percolating one document at a time. So we'll need to repeat this query 10,000 times in the example scenario (and there can be worse scenarios with many more RDs that would need to be percolated). We can also use the multi-search API to execute many requests at once. I'll need to measure the performance of this approach, but I think it won't be as performant as we wanted and it could use a lot of resources, considering
Can match a This problem could be solved only by applying a join query, but unfortunately, they’re not supported by the percolator. So maybe we could just inform users that an Alert involving a query with a party filter won’t work as in the frontend or the API. Basically, if they want consistent results similar to the ones they could get in the frontend or the API, they should only mix party filters with other Docket fields and avoid using a query string, since a query string could also match Or we could also identify whether an Alert query has a party field and alert the user during its creation about its limitations or avoid creating it. Changes in the UI:We’ll need to have a UI where users can save RECAP Search Alerts. During the creation, they can decide if they want to match Something like: So we could create one or two alerts with the same query, one for the Docket alert and/or one for the RECAPDocument alert. The query version we’ll store in the percolator will be the one specific to the document type, excluding all the join queries. We already have these queries that are used to get the Docket and RECAPDocument counts separately. So these queries can be indexed to their percolator index, either Avoid triggering duplicate alerts.We need to avoid an alert being triggered more than once by the same Docket.
For doing that, we planned to use a bloom filter that will keep track of the alerts that have been sent so they’re not triggered more than once. However, I think the bloom filter is possibly not the right approach. We could have a global bloom filter to store Docket-Alert pairs so we can know when it has already been triggered and avoid triggering it again. The problem with this global filter is that it'll grow too fast since new elements will be added to every alert that is triggered. So it'd be better to have one bloom filter for each Alert in the database so it can store the But the problem I see with the bloom filter is:
The problem is false positives because we'll store the Since false negatives are not possible, there is no possibility of duplicate alerts, which is good. But I recall we discussed that it's more important to not miss any alerts. We could reduce the probability of getting a false positive by selecting a big bloom filter size and a good hash function, but possibly it's better to just use a SET. So the alternative approach is just to create a Redis SET for each alert and store each
Adding new elements to the set or checking if an ID is already in the set can be done in constant time.
So if an ID is already in the SET, we just omit sending the alert. Grouping alerts.Another requirement is grouping RT alerts whenever possible according to #3102. As described in the issue I think the only way to achieve this if we end up using the percolator (in the inverse query this won't be required) is to add a wait before sending the alerts so if more alerts are matched during the waiting time, we could send them in a single email.
Let me know what do you think. |
About percolators and parent-child queries...This is a real bummer and you're right that it comes with a bunch of tradeoffs. From a design perspective, I want this to be as seamless as possible. Ideally, people do a query in the front end, create an alert, and it works with some minor tradeoffs or imperfections. I'm afraid that where we're headed is:
If that's where we land, I think we're in trouble, so we have some work ahead of us to sort this out. I did a little research on the parent-child percolator, and one person said they could use nested queries against the percolator. Is that a crazy idea?
This is really not performant, and I want organizations to be able to make 10,000 alerts each, creating millions of alerts. If each alert takes 1s, it'll never work, or at best it'll take a huge number of servers. It's also a bummer that it's not actually real time. About changes in the UI...I'd like to avoid users even thinking about dockets vs. documents when they make alerts, but it could be the solution we need. Maybe instead of one button to create alerts, we offer two:
If we do that, I bet most of our users would be satisfied, and it'd be clear that cross-object queries are going to work poorly. I think if we do this, we remove the docket-related stuff from the RECAP percolator (no case name, etc.) Q: Can we robustly identify when somebody is making a cross-object query? On not sending dups...Yeah, bloom filters would have been fun. Someday. Redis sets it is. About grouping alerts...Spec:
Thank you! |
In this case, we need to convert the parent-child queries to nested queries and evaluate if they match the same documents or if this conversion results in some false positives or false negatives. However, this approach might still have some performance issues to evaluate. For instance, to percolate a document against the nested query percolator, we need a document structured as parent-nested-child, which means creating a JSON document in memory representing the I'll do some tests around this idea to measure its performance.
Great, I like the idea of offering two different buttons to create these alerts. I'll propose some ideas about where and how we can place these buttons in the UI instead of the current bell icon.
This means that if we can't find a good solution to percolate the original frontend query without the problems described above, we'll end up offering Docket Alerts that only match Docket fields and Document Alerts that only match RECAPDocument fields?
I'm afraid this is not possible. While we can robustly identify whether the query contains combined parent or child filters, or even within a string query if the user is using advanced syntax: The problem is that we cannot identify a cross-object query in simple query strings. For instance: This query can match the string within some Docket fields or RECAPDocument fields. For example, it can match a docket with part of the string in the In cases like these, it's impossible to know (without performing the actual query) whether the query can match only Dockets, only RECAPDocuments, or both.
Great! Thanks for your answers and suggestions to explore. |
At first, I was thinking that If you got changes to a docket, you could just percolate only the docket info, without any documents at all, and that if you got changes to a document, you could just nest that one document within the docket. But now I'm realizing that if you have a query like:
You might get this information today:
You wouldn't send an alert, because the docket_name doesn't match. But tomorrow the name might be updated to I think that implies that:
If that's right, I think we're getting close to a solution. There is one other strategy that we can use here, which is to create a new index each day, and to use that for a sweep (so many sweeps, lately!). The idea here is that querying 500M items is really hard and slow. The only thing you really need to query is the new stuff of the day. So, what you do is:
If we do that in addition to the nested queries, we'd be sure to get everything, and we'd have a somewhat performant solution, since we'd only be querying against a couple hundred thousand items. Most alerts would be real time. Some cross-object ones even would be, and the corner case would be covered. What do you think?
Yes. Kind of lame though.
So that would be considered a cross-object query, because it queries across more than one object type. |
Yeah, exactly. The problem is directly related to docket field updates that can impact cross-object queries.
Yeah, the nesting you have in mind (nesting the document into the docket and percolating it) is to allow us to match any cross-object query, including parties, correct? Because most of the docket fields (except parties) are indexed into each
Yeah, that's correct.
This is a pretty good idea! Just some questions:
So we'll need to categorize the alerts into two types:
Got it. I think we can use the same set in Redis proposed to avoid duplicates. So we'll have one set per alert that will store either One question here is how are we going to tag/schedule alerts sent at midnight.
In the percolator approach in OA, we do the following: We trigger webhooks in real-time for all the rates.
I think we can do the same for alerts that are matched in real-time by the percolator. But what would happen, for instance, for RT cross-object alerts that were missed during the day? Once they're a hit at midnight, will we group all the missed alerts during the day for a user and send a single email? If the missed alerts belong to the daily rate, maybe we could execute the midnight sweep and see if some of the daily alerts had hits, then append those hits to the scheduled hits during the day via the percolator and send a single daily email. For weekly and monthly rates, I think it can work similarly. Use the midnight sweep to store and schedule the hits according to the rate so they can be sent every week or month alongside the ones scheduled by the percolator. Webhooks |
Yes.
I've been thinking about this for years, but I was hoping not to have to do this, so hadn't mentioned it. But here we are. :)
Yeah, I think so, but if we do a sloppy job that says some docket-only or document-only alerts are actually cross-object, that'd be fine, right? We'd run an extra query, but wouldn't send extra alerts. So long as we err in that direction, we should be fine?
Pretty simple. We run our sweep, and send an email with the sweep results. We put extra words in the subject and body to explain what it's about. We continue doing everything with the daily, weekly, and monthly alerts same as before.
Sure, or you can send them in separate payloads. Whatever is easier. I assume it's easier to keep these processes separate.
Real time, and then we document the situation by saying:
What else??? :) |
Hmm, I think in that scenario we'd miss alerts. If we mistakenly tag On the other hand, if we mistakenly tag docket-only or document-only queries as cross-object, we'll run extra queries, but we won't send duplicates. So, we should be careful when categorizing the queries or run the sweep over all the queries.
Perfect, this is for the RT rate, right?
Got it. So, in this case, to continue doing everything for the daily rate as before, we'd just need to ensure the normal daily send is triggered after the midnight sweep so those hits can be included in the daily send. For the weekly and monthly rates, if we want to include the results of that day as well, they should also run after the midnight sweep. We just need to confirm if the sending time is okay because if the midnight sweep runs at 12:00 and takes 15 minutes to complete, we'll need to send the Daily, Weekly, or Monthly emails after 12:15. If that's not okay, they can be included the next day for the daily rate or the next week or month, for the other rates. |
Yeah, we want to err on the side of saying something is cross-object if we have any doubt. I agree.
Yes, exactly.
Yeah, that's fine. Nobody cares if their daily/weekly/monthly alerts are exactly at midnight. I'd suggest making this one command that does both the sweep and the daily/monthly/weekly alerts, so that it does one task, then the other without having to schedule things and hope the sweep is done before the other one triggers. |
Excellent! I think we now have a good plan to work on. Thank you! |
And you. Epic! Would it make sense to do two PRs? One for regular alerts and one for the sweep? |
Yeah, I agree, two PRs make sense for the project! |
@mlissner working on adding the Percolator index for RECAP, I have a couple of new questions that can impact the Percolator and the sweep index design. We plan to percolate RD documents nested within a Docket document to trigger alerts for RECAPDocuments reliably or percolate only Docket documents without any nested RD for triggering Docket-only alerts.
Considering we'll solve the issue related to cross-object queries on document updates by using the daily midnight sweep, should we still divide alerts into two types? Alternatively, we could have only one type of alert, For example, consider the following scenario:
Does it make sense for user needs to still divide alerts for Docket and RDs? I think it would still make sense to split alerts in two types if users want to know which type of object triggered the hit in the alert they're receiving. The following question is also a bit related but regarding the alert structure and also the percolator design. What would be the structure of the emails if we end up offering two types of alerts to users? Or in case we go with only one alert type for RECAP.
Using nested queries (or a plain approach I'm experimenting with) and the midnight sweep, we'll be able to send alerts for non-cross-object and cross-object queries. I imagine the email for a document alert (RD) like this: In this case, imagine the If this is the expected behavior, it will be important to show the Docket fields similarly to the frontend, so users can understand why those documents are being matched even if the keywords don't appear directly within the RDs. Or should the behavior be that only Depending on the expected behavior, the design of the daily sweep index will change. If RDs can be matched by Docket fields, we could simply mirror the current RECAP search index. If the second option is preferred, we would need to switch to an index with a nested documents approach and expect RDs to be matched independently of Docket fields. And for the Docket-only alert (in case we still need to split alerts) can be as follows: The main difference is that it will include only Dockets without any entries. |
No. If we can avoid the two alert types, we really should. That was just an idea if we couldn't find a better way forward.
I think the emails should try to match the search results as much as possible. So when there's a docket result, it just shows dockets, when it's a document result, it shows the nested document inside the correct docket. To the user, it should be seamless and they shouldn't think about documents vs. dockets when making or receiving alerts (just like they don't when doing a query).
I don't think that's ideal, but if it matches the front end, it's OK. Ideally, the email would just have a docket if it only matched on docket fields (and the front end too, I guess).
That is better, yes.
I think this just depends on how hard it is. We'd like to go for the ideal, correct solution at first. How much more time would you estimate it would take? If it's just a little bit, then let's go for it. If it's more than a few days, maybe it's better to do it as an enhancement down the road? |
Got it. Yeah, I agree, this seems like the better approach.
Yeah, this is how the frontend currently behaves. However, I don't think it's an issue in the frontend because the documents matched by docket fields don't affect the meaning of the search; they're just "extra documents." However, in alerts, I can see how it could be confusing because users might think those documents are directly related to the keywords in the query when the only relation is that they belong to the docket.
Well, going for the correct solution, which involves only matching One of the things we should take care of when doing this is ensuring that this new approach follows the results in the frontend as closely as possible without missing anything, except for matching |
Great. If it's only two days, let's go for it.
I don't understand what you mean here. Can you explain for me? |
Perfect!, already working on it.
Sure, I meant that a nested document will look something like this:
The first issue is related to the number of documents nested within the parent document. The more nested documents there are, the more memory is required to handle them within the cluster. The documentation states that the default limit is 10,000 to prevent performance issues: Since we'll only add documents created or modified during the day, I expect the number of nested documents in a Docket to not be too large and to always remain below 10,000. The other issue concerns indexing and updates. A document with a nested field is treated as a single unit, so in order to change a parent field or add/update a nested document, Elasticsearch requires performing a complete reindexing of the document. Thus, if a Docket contains too many documents for the day, and we continue adding/updating it, the cluster internally performs a full reindex of this document every time it is changed. I think we have two options to handle this process:
|
Yes, that's a very safe assumption. The worst case are bankruptcy cases, which can have something like 100 docs in a day, but that's still not common. For indexing performance, it sounds like there are three options:
Number 1 is least performant, but simplest. Number 2 saves some bandwidth, but doesn't help the cluster ("internally the cluster will perform the complete reindex for each of these requests"). Number 3 will save the elastic cluster some effort at the cost of the database and batching everything at the end. My vote is for number 1 because we should avoid doing premature optimizations, and it seems simplest. I also always prefer processes that spread performance over the day instead of doing big pulls all at once, which also favors number 1. So I'd suggest we go that direction, and if it isn't fast enough we can upgrade to a better solution? |
Got it. Yeah, I agree, option 1 is the simpler solution, and we can perform optimizations if they’re required. Just a note about option 1 that I noticed will be required. Every time we get a Docket or RD add/update during the day, we’ll need to create a JSON file holding the updated state of the case (Docket fields + RDs) for that day. Therefore, a database query that filters out the |
I thought we had already solved all the important issues here and had a solid plan, but while working on it, more problems and questions have surfaced. I created in #4127 the RECAP Search Alerts sweep index based on the nested approach and also created a compatible nested query approach and tested them. I found the following: Most of the tests that involved only docket-only fields text queries, RECAP-only fields text queries, or any combination of filters (only docket, only RECAPDocuments, or combined fields on filters) worked well with no difference from the parent-child approach used in the frontend and the API. However, tests related to cross-object fields text queries are failing. One of the reasons we decided to try the nested index approach was to avoid sending false positive alerts when ingesting RECAPDocuments that belong to a docket that could trigger alerts involving only docket fields (which should be triggered only by a docket ingestion). Since those fields are indexed in the regular index into each RECAPDocument, documents could trigger alerts in those cases. In fact, the nested index approach helps to prevent the problem described above. When using a nested query, it can only reach fields in the child documents, and the parent query component can only reach parent fields. However, this feature of nested documents is also causing cross-object text queries not to work. For instance, consider the following case document:
Now consider the query: In the current RECAP Search this query will return:
This is possible because parent fields like This also allows fielded text queries to work properly: Will return:
However, I found that this type of cross-object query is failing in the nested index approach because, in the nested approach, the document looks like this:
Parent fields are not indexed into each nested document. So a query like: Looks like:
So the whole phrase "Motion Ipsum America" is not found in any of the child documents or parent documents within their local fields. It is also not possible to query a parent field from the nested query or, conversely, a nested field within the parent query context. The solution would be the same as we used in the parent-child approach: index parent fields into each nested document. However, this brings us back to where we began, as we wouldn't be able to avoid triggering alerts for docket-only queries when ingesting any RECAPDocument that contains the docket fields indexed. In brief:
So the proposed solution and its trade-offs are explained in the following tables: Sweep index:
Percolator:
In summary, the proposed solution considering the trade-offs above will be as follows: Create the sweep index using the same structure as the regular search index for RECAP.
However, in this approach, we’ll still have a partial issue regarding Docket indexing and cross-object queries. We’ll be able to trigger alerts for cross-object queries via the sweep index but only for those For instance, consider the following example:
Original case:
During the day, the Docket is updated to Also during the day, we get an upload for
At midnight, the sweep index runs the query: Final questions and considerations:
|
Thanks for all the details, and shoot, I guess it's back to plan A. Using the highlighting to do alert filtering is a great and novel idea. Nice one. Let's do that.
You're right, it should be included in this case and we can't document our way out of it, so when this is the case, we'll just have to do the batch updating. A few thoughts:
|
Yeah, I think this is better. Just collect all the dockets that changed during the day and index them at the end of the day into the sweep index along with all their child documents.
Sure, I think this is a perfect task for the Reindex API. We have used it in the past to migrate an entire index, but it's possible to use it with a query that selects which documents should be moved. We could just select dockets with a Thanks! |
An update/question here. We're going to use a Redis For instance, if The SET will be updated as: So, if the alert is triggered again by the same Docket ID 400, it won't be sent again. However, I noticed that we'll need to keep track of the RD IDs because RD-only alerts or cross-object queries can also be triggered by RDs. So, it is possible that an alert is triggered by an RD in the case, and then it can also be triggered by a different RD in the same case. If we only store the Docket ID that triggered the alert, we won't be able to trigger the alert for different RDs in the same case. Therefore, I'm thinking of updating the SET to store Docket or RD IDs, so it'll look like this:
or holding a
This way, we can keep track of the Dockets or RDs that triggered the alert independently. Does that sound right to you? Alerts can be triggered by different RDs in the same case? |
Yes, this is exactly right. I think two keys per alert looks tidier, but I'd suggest something more like |
Following up on the question raised during the RECAP Search Alerts architecture review regarding the Percolator's lack of support for parent-child queries and the possibility of contributing to a solution. According to elastic/elasticsearch#2960 (comment) the main issue they describe with adding support for parent-child queries is the need to store documents in memory to percolate them one by one. The approach they seem to be considering involves percolating a parent document. Since the document can only trigger queries involving parent fields, it would be necessary to retrieve all child documents belonging to the parent (from the main documents index), store them in memory, and percolate each one individually to match This approach would be resource intensive specially regarding memory and would not scale well, especially with parent documents that have a high cardinality of child documents. |
This is now in beta. We're working on the pricing for it and experimenting with it. |
No description provided.
The text was updated successfully, but these errors were encountered: