-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
terms facet gives wrong count with n_shards > 1 #1305
Comments
+1 for this bug. I have reproduced the problem using just documents with one field.
Got:
|
After more experimenting I'd say that it's caused by naïve top-N facet merging. Something like Phase three here should be added: Or something smarter: http://netcins.ceid.upatras.gr/papers/Klee_VLDB.pdf |
Right, the way top N facets work now is by getting the top N from each shard, and merging the results. This can give inaccurate results. The phase 3 thingy is not really a solution, will read the paper though :) |
Right. The "phase 3 thingy" only solves the second problem:
But while the first problem can easily stay unnoticed, the second one leads to quite a lousy user experience in very common faceting scenarios: The web displays top 10 commenters with number of their comments in parentheses. You click the last one with 30 comments and Voilà, it shows her 48 comments. This is hard not to notice. Meanwhile a slow and not-really-always-working workaround: |
@kimchy: why do you think so? isn't it a similar approach as query then fetch? now only for facets and its counts instead of docs and its score? (it even could be implemented in the second query to avoid traffic, no?) Another workaround would be 'routing': every facet value should go to the same shard. But then the data should have a lot facet values and all of them should have not a too high count to avoid wrong balancing ... |
Is this on the roadmap? We discovered this same problem with regular fields and large datasets with multiple nodes and shards. This is problematic for us since the faceted counts can fluctuate wildly based on what filters we apply. Is there a workaround? |
I still see this bug in 19.4, and the wrong counts are also a show-stopper for us. Adding "150 or another magical constant" only helps for tiny datasets; does there exist a more robust workaround until this is fixed properly? |
+1 for a better workaround or fix The only completely robust workaround I know of is to limit yourself to one shard. Obviously that pretty much takes the 'elastic' out of 'elasticsearch', but if accurate counts are critical to your app and your index isn't too big you might get away with it... |
Any update on this? I do have an application where counts are critically important. As you are saying to limit to one shard for a particular entity which suppose to undergo such queries. |
@jmchambers Not really -- you can slice the data into many one-shard indices, with a weekly (daily, hourly, ...) rotation. You can use aliases (or wildcards, ...) to query them in a reasonable way from your application. Would mulitple one-shard indices work around the facet problem? |
@karmi: only if the indices are large enough. To my knowledge, the scoring stats (IDF etc) are computed per-index (actually, per-shard by default). So comparing relevancy scores across indices is like comparing apples to oranges, even with But once the indices are large enough, this doesn't matter anymore as the stats converge (assuming identical doc/word distribution in each index, for the stats to actually converge). |
+1 for a fix. The workaround (one-shard, multiple one-shard index, etc.) don't sound convincing. |
I'm a bit confused on this issue: I have five shards and the top value occurs on each shard as #1, by a large margin. Still, when I facet, I get a 20% smaller number than when I simply query for the value directly. |
Won't you run into the same problem when doing a faceted count over multiple indices? |
Any news about this issue? |
Any hope of this being fixed in the near future? |
For what it's worth, the multiple index/single shard approach is working for us. Our application has a new index per week of data anyway, so it's not actually too painful for our current usage. |
+1 for a fix |
+1 for a fix Terms Stats Facet seems to be affected with the same issue as well. |
It seems there is no fix forthcoming; how about at least making the most common scenario less annoying/more palatable? What I mean by that: when ES returns let's say 20 facets (with the wrong counts...), it could automatically run a second request against all shards asking for an accurate count only for these exact 20 facets. Then add up these accurate counts, and only return that as a result. Would that make sense? |
+1 for a fix. Incorrect facet counts is resulting in a bad user experience |
Seems this bug did not fix completely I did a facet query and sort by count, the results were right: but if sort by total (same query), the results dose not right: |
+1 |
Hi, I am trying to run a search query using multiple facets, and I want the facet numbers to update in the same way as the site Zappos does. For example, when I search for Nike in the men's shoe section, I get facets for Brand and Color. [cid:93D592E2-8A16-49EC-81A2-5F5804ED1836] When I narrow my search by Brand (by selecting 'Nike'), the numbers within the Brand facet do not change, but the numbers within the Color facet change to reflect the narrowed search results. I can widen my search by selecting 'Nike Action', and still no numbers within the Brand facet are updated, but numbers in the Color facet have been updated to reflect the additional results. [cid:56423869-2538-47FA-8F43-D6355DC5AA48] I can see the same expected results if I select a term within the Color facet: [cid:BDC2DCEB-B7A5-4E86-BFF6-3DDD68E7ADED] I can think of two ways to do this using Elastic Search, and I'm looking for any guidance/suggestions as to the best way to implement this.
Which of the two options would perform better? Is there a better option that I'm not thinking of? I can add example JSON if it will help explain my thoughts. Thanks! Bill |
@finn1317, this is something you should ask on the elasticsearch mailing list. I don't think it's related to the issue being discussed here. |
+1 for a fix. |
Question, could this bug be related to the following result i'm getting. query uses the |
@jpountz Thank you for the explanation. I agree with @piskvorky that a parameter like "favor_accuracy_over_perfofmance" (or whatever name) will be a good idea moving forward. but for now my question is: is there some sort of a formula, or rule of thumb as to what the shard_size parameter should be? as I specified in my question on the mailing list -- https://groups.google.com/d/msg/elasticsearch/xYRaJa04wfc/lcmEvP53rR4J -- I have 5 shards in my index -- and I see this problem when my agg result is less than 50 -- is that because also, for some reason my |
@igal-getrailo Did you read the post above from @karmi "Check out the new cardinality aggregation added in #5426." Did you try to us aggregations instead of using facets? There is a nice presentation from ElasticSearch team here: https://speakerdeck.com/bleskes/deep-dive-into-aggregations |
@megastef yes, I am using the aggregations and am getting the same problem (didn't try the cardinality aggs yet but will check it out). I only found this bug report here after trying to get a response in the mailing list for over a week, but as can be seen in the example I posted at -- https://groups.google.com/forum/#!msg/elasticsearch/xYRaJa04wfc -- I am passing and I actually went over Boaz's slides a couple of days ago, but I guess without the audio of the presentation the slides are not that helpful. anyway, at first glance Cardinality Aggregations look promising. I will look deeper into it. thanks! |
@megastef it looks like the the docs were updated since I last checked them, and apparently setting the thank you and @jpountz for your help. |
There is no formula, it depends on what your data looks like. Terms that are missing or have inaccurate counts in the final response are those that are not in the top |
@jpountz Thank you! |
It seems that aggregations are the future and facets are on the way out. However, I want faceting functionality and I am trying to determine whether I should use facets with a set |
Have you considered using the aggregation feature within the facets feature to fix this issue? |
It would be nice to know if there are plans to review/work on this in near future? |
Is this issue likely to be "fixed" given that it appears facets, en masse, are deprecated? See the warning on http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-facets.html. |
@lloy0076 it definitely won't be fixed in facets. However, the same problem exists in aggregations, due the to nature of distributed systems. We'll leave this issue open as a reminder - at some stage we hope to provide a slower but more accurate version. |
Yes, if you use the "scatter/gather" pattern you naturally hit the limits of a two-phase approach. When building Crate.IO* (a SQL Database using Elasticsearch, https://github.com/crate/crate) we decided to extend it to support a distributed map/reduce algorithm to provide accurate aggregations. |
@jodok Because @clintongormley used words "slower" and "more accurate" in "we hope to provide a slower but more accurate version", can you comment on the performance and accuracy of your implementation? Specifically, @clintongormley is implying numbers will be closer to the truth, but not necessarily 100% accurate, and I'm wondering if in case of CrateIO they are always 100% accurate? And, of course, I'm wondering if you've compared performance of CrateIO approach vs. ES approach (at scale)? Thanks. |
@otisg the performance of "normal" queries that don't need the 2-phase map/reduce approach is the same. but when it comes to exact aggregations - they need one more cluster roundtrip. instead of 1. collect data (all shards), 2. merge result (one node) - an additional step is added: 1. collect (all shards), 2. reshuffle data (to multiple nodes) and do aggregation, 3. merge (one node). this requires one more network roundrip in the cluster, but usually this is even less than a ms. this method scales horizontally, performance heavily depends on cardinality and the size of the result set. happy to help running tests on sampledata on a demo cluster. |
@clintongormley @kimchy couldn't ES "borrow" the the approach/idea that Crate or Solr(Cloud) use? Both give 100% accurate counts. |
Facets have been deprecated for a while now and have been removed from master/2.0 (#7337) with aggregations being the replacement. Time to close this ticket? |
Agreed |
I think it is worth to keep open this ticket as the aggregations framework suffers from the same issue with terms aggregates on more than 1 shard. |
I've opened #12316 instead |
I'm working with nested documents and have noticed that my faceted search interface is giving the wrong counts when I have more than one shard. To be more specific, I'm working with RDF triples (entity > attribute > value) and I'm nesting the attributes (called predicates in my example):
I've created a shell script (https://gist.github.com/1196986) that recreates the problem with a fresh index. The created data set has these totals:
With only one shard the following query gives the correct counts no matter what the size parameter is set to:
However, with more than one shard the size parameter affects the accuracy of the counts. If it is equal to or greater than the number of terms returned by the facet query (5 in this case) then it works fine. However, the terms at the bottom of the list start to display low counts as you reduce the size parameter:
With "size" : 4
With "size" : 3
With "size" : 2
So it looks like the sub-totals from some of the shards aren't being included for some reason. BTW I'm on ubuntu and the problem seems to affect all versions of ES I've tried (17.0, 17.1 and 17.6). Any ideas...?
P.S. absolutely loving ES - it's made my life a lot easier :)
The text was updated successfully, but these errors were encountered: