terms facet gives wrong count with n_shards > 1 #1305

jmchambers · 2011-09-06T09:32:06Z

I'm working with nested documents and have noticed that my faceted search interface is giving the wrong counts when I have more than one shard. To be more specific, I'm working with RDF triples (entity > attribute > value) and I'm nesting the attributes (called predicates in my example):

{
  "_id" : "512a2c022f0b4e3daa341e6c8bcf6c2f",
  "url": "http://dbpedia.org/resource/Alan_Shepard",
  "predicates": [
    {
      "type": "type",
      "string_value": ["thing", "person", "astronaut"]
    }, {
      "type": "label",
      "string_value": ["Alan Shepard"]
    }, {
      "type": "time in space",
      "float_value": [216.950]
    },
    ... lots more
  ]
}

I've created a shell script (https://gist.github.com/1196986) that recreates the problem with a fresh index. The created data set has these totals:

thing (30)
creative work (20)
video game (10)
tv show (10)
people (10)

With only one shard the following query gives the correct counts no matter what the size parameter is set to:

{
  "size": 0,
  "query": {
    "match_all": {}
  },
  "facets": {
    "type_counts": {
      "terms": {
        "field": "string_value",
        "size": 5
      },
      "nested": "predicates",
      "facet_filter": {
        "term": {
          "type": "type"
        }
      }
    }
  }
}

However, with more than one shard the size parameter affects the accuracy of the counts. If it is equal to or greater than the number of terms returned by the facet query (5 in this case) then it works fine. However, the terms at the bottom of the list start to display low counts as you reduce the size parameter:

With "size" : 4

thing (30)
creative work (20)
video game (10)
tv show (9)

With "size" : 3

thing (30)
creative work (15)
video game (9)

With "size" : 2

thing (30)
creative work (15)

So it looks like the sub-totals from some of the shards aren't being included for some reason. BTW I'm on ubuntu and the problem seems to affect all versions of ES I've tried (17.0, 17.1 and 17.6). Any ideas...?

P.S. absolutely loving ES - it's made my life a lot easier :)

The text was updated successfully, but these errors were encountered:

losomo · 2011-10-17T15:48:52Z

+1 for this bug. I have reproduced the problem using just documents with one field.
Complete test (as run on version 0.17.8):
as Perl script: https://gist.github.com/1292897
generated shell test: https://gist.github.com/1292912
The first error can be seen on line 951.
Expected:

"terms" : [
 {
 "count" : 10,
 "term" : "user 10"
 }
 ],

Got:

"terms" : [
 {
 "count" : 7,
 "term" : "user 9"
 }
 ],

losomo · 2011-10-17T18:02:37Z

After more experimenting I'd say that it's caused by naïve top-N facet merging. Something like Phase three here should be added:

http://wiki.apache.org/solr/DistributedSearchDesign#Phase_3:_REFINE_FACETS_.28only_for_faceted_search.29

Or something smarter: http://netcins.ceid.upatras.gr/papers/Klee_VLDB.pdf

kimchy · 2011-10-18T00:17:47Z

Right, the way top N facets work now is by getting the top N from each shard, and merging the results. This can give inaccurate results. The phase 3 thingy is not really a solution, will read the paper though :)

losomo · 2011-10-18T08:01:05Z

Right. The "phase 3 thingy" only solves the second problem:

The recall is not 100% (some values may be simply missing)
The numbers are often lower than they should be.

But while the first problem can easily stay unnoticed, the second one leads to quite a lousy user experience in very common faceting scenarios: The web displays top 10 commenters with number of their comments in parentheses. You click the last one with 30 comments and Voilà, it shows her 48 comments. This is hard not to notice.

Meanwhile a slow and not-really-always-working workaround:
Set size to what you want + 150 or another magical constant when requesting the facets and then filter out the top size facets on the client side.

karussell · 2011-10-19T07:23:21Z

The phase 3 thingy is not really a solution

@kimchy: why do you think so? isn't it a similar approach as query then fetch? now only for facets and its counts instead of docs and its score? (it even could be implemented in the second query to avoid traffic, no?)

Another workaround would be 'routing': every facet value should go to the same shard. But then the data should have a lot facet values and all of them should have not a too high count to avoid wrong balancing ...

agnellvj · 2012-05-24T20:09:28Z

Is this on the roadmap? We discovered this same problem with regular fields and large datasets with multiple nodes and shards. This is problematic for us since the faceted counts can fluctuate wildly based on what filters we apply. Is there a workaround?

piskvorky · 2012-06-03T15:21:26Z

I still see this bug in 19.4, and the wrong counts are also a show-stopper for us.

Adding "150 or another magical constant" only helps for tiny datasets; does there exist a more robust workaround until this is fixed properly?

jmchambers · 2012-06-03T16:44:53Z

+1 for a better workaround or fix

The only completely robust workaround I know of is to limit yourself to one shard. Obviously that pretty much takes the 'elastic' out of 'elasticsearch', but if accurate counts are critical to your app and your index isn't too big you might get away with it...

tarunjangra · 2012-06-24T20:33:21Z

Any update on this? I do have an application where counts are critically important. As you are saying to limit to one shard for a particular entity which suppose to undergo such queries.
Does this make sense to route entities by entity type in corresponding shards?

karmi · 2012-07-17T07:11:56Z

The only completely robust workaround I know of is to limit yourself to one shard. Obviously that pretty much takes the 'elastic' out of 'elasticsearch' (...)

@jmchambers Not really -- you can slice the data into many one-shard indices, with a weekly (daily, hourly, ...) rotation. You can use aliases (or wildcards, ...) to query them in a reasonable way from your application. Would mulitple one-shard indices work around the facet problem?

piskvorky · 2012-07-17T08:07:53Z

@karmi: only if the indices are large enough. To my knowledge, the scoring stats (IDF etc) are computed per-index (actually, per-shard by default). So comparing relevancy scores across indices is like comparing apples to oranges, even with dfs_query_then_fetch.

But once the indices are large enough, this doesn't matter anymore as the stats converge (assuming identical doc/word distribution in each index, for the stats to actually converge).

ajhalani · 2012-08-09T20:59:22Z

+1 for a fix. The workaround (one-shard, multiple one-shard index, etc.) don't sound convincing.

Downchuck · 2012-08-27T01:50:49Z

I'm a bit confused on this issue: I have five shards and the top value occurs on each shard as #1, by a large margin. Still, when I facet, I get a 20% smaller number than when I simply query for the value directly.

jrydberg · 2012-09-01T19:21:26Z

@karmi

you can slice the data into many one-shard indices, with a weekly (daily, hourly, ...) rotation

Won't you run into the same problem when doing a faceted count over multiple indices?

giamma · 2012-10-29T13:10:40Z

Any news about this issue?

tgruben · 2012-11-08T12:55:10Z

Any hope of this being fixed in the near future?

danfairs · 2012-11-22T09:29:56Z

For what it's worth, the multiple index/single shard approach is working for us. Our application has a new index per week of data anyway, so it's not actually too painful for our current usage.

markwaddle · 2012-11-27T08:16:14Z

+1 for a fix
My application has 7m+ docs of varying sizes (600GB index) and growing so 1 shard is not feasible.
For my application I am willing to trade performance (and/or hardware resources) for accurate facet counts.

webmusing · 2013-04-05T07:14:47Z

+1 for a fix

Terms Stats Facet seems to be affected with the same issue as well.

piskvorky · 2013-04-07T11:40:29Z

It seems there is no fix forthcoming; how about at least making the most common scenario less annoying/more palatable?

What I mean by that: when ES returns let's say 20 facets (with the wrong counts...), it could automatically run a second request against all shards asking for an accurate count only for these exact 20 facets. Then add up these accurate counts, and only return that as a result. Would that make sense?

vincentpoon · 2013-05-10T18:46:04Z

+1 for a fix. Incorrect facet counts is resulting in a bad user experience

songday · 2013-07-05T08:56:46Z

Seems this bug did not fix completely

I did a facet query and sort by count, the results were right:
"terms" : [ {
"term" : "AAA",
"count" : 59,
"total_count" : 59,
"min" : 1.0,
"max" : 54.0,
"total" : 391.0,
"mean" : 6.627118644067797
}, {
"term" : "BBB",
"count" : 55,
"total_count" : 55,
"min" : 1.0,
"max" : 17.0,
"total" : 154.0,
"mean" : 2.8
}]

but if sort by total (same query), the results dose not right:
"terms" : [ {
"term" : "AAA",
"count" : 56, //this is not right
"total_count" : 56,
"min" : 1.0,
"max" : 54.0,
"total" : 388.0,
"mean" : 6.928571428571429
}, {
"term" : "BBB",
"count" : 56, //this is not right
"total_count" : 56,
"min" : 1.0,
"max" : 17.0,
"total" : 171.0,
"mean" : 3.0535714285714284
}]

tommymonk · 2013-07-23T09:42:17Z

+1

HeyBillFinn · 2013-08-02T14:20:42Z

Hi,

I am trying to run a search query using multiple facets, and I want the facet numbers to update in the same way as the site Zappos does. For example, when I search for Nike in the men's shoe section, I get facets for Brand and Color.

[cid:93D592E2-8A16-49EC-81A2-5F5804ED1836]

When I narrow my search by Brand (by selecting 'Nike'), the numbers within the Brand facet do not change, but the numbers within the Color facet change to reflect the narrowed search results.
[cid:8C58B26B-E7AC-44F3-8DDA-23A0BBD0AD0B]

I can widen my search by selecting 'Nike Action', and still no numbers within the Brand facet are updated, but numbers in the Color facet have been updated to reflect the additional results.

[cid:56423869-2538-47FA-8F43-D6355DC5AA48]

I can see the same expected results if I select a term within the Color facet:

[cid:BDC2DCEB-B7A5-4E86-BFF6-3DDD68E7ADED]

I can think of two ways to do this using Elastic Search, and I'm looking for any guidance/suggestions as to the best way to implement this.

Filtered query using fairly complex facet filters within each facet, including global = true flag.
Top-level filter (which I understand does not affect the facet results) with slightly less complex facet filters within each facet

Which of the two options would perform better? Is there a better option that I'm not thinking of? I can add example JSON if it will help explain my thoughts.

Thanks!

Bill

danfairs · 2013-08-02T14:26:50Z

@finn1317, this is something you should ask on the elasticsearch mailing list. I don't think it's related to the issue being discussed here.

peakx · 2013-08-08T13:42:07Z

+1 for a fix.

SanderDemeester · 2013-08-09T09:14:10Z

Question, could this bug be related to the following result i'm getting.
If i request the first json document where the value for field x is y i get a result back.
But when If i request all distict values for field x i get a list back with values. Byt 'y' is not included.

query uses the all_terms and i match use match_all.

igal-getrailo · 2014-05-03T17:32:39Z

@jpountz Thank you for the explanation. I agree with @piskvorky that a parameter like "favor_accuracy_over_perfofmance" (or whatever name) will be a good idea moving forward.

but for now my question is: is there some sort of a formula, or rule of thumb as to what the shard_size parameter should be? as I specified in my question on the mailing list -- https://groups.google.com/d/msg/elasticsearch/xYRaJa04wfc/lcmEvP53rR4J -- I have 5 shards in my index -- and I see this problem when my agg result is less than 50 -- is that because shard_size has a default of 10 so num_shards * shard_size ?

also, for some reason my size parameter is ignored and I get 10 values no matter what size I pass.

megastef · 2014-05-03T18:55:55Z

@igal-getrailo Did you read the post above from @karmi "Check out the new cardinality aggregation added in #5426." Did you try to us aggregations instead of using facets? There is a nice presentation from ElasticSearch team here: https://speakerdeck.com/bleskes/deep-dive-into-aggregations
As far I understood aggregations will replace facets in the future. So I think the ES team listens to the users.

igal-getrailo · 2014-05-03T21:19:59Z

@megastef yes, I am using the aggregations and am getting the same problem (didn't try the cardinality aggs yet but will check it out).

I only found this bug report here after trying to get a response in the mailing list for over a week, but as can be seen in the example I posted at -- https://groups.google.com/forum/#!msg/elasticsearch/xYRaJa04wfc -- I am passing "aggs": {"available_tags": {"terms": {"field": "tags"}}, "size": 20} with the request.

and I actually went over Boaz's slides a couple of days ago, but I guess without the audio of the presentation the slides are not that helpful.

anyway, at first glance Cardinality Aggregations look promising. I will look deeper into it. thanks!

igal-getrailo · 2014-05-04T00:32:40Z

@megastef it looks like the cardinality aggregation can do the trick when used as a sub-agg of the terms aggregation. in my case, however, I was placing the size param in the wrong scope, and it was therefore ignored. it should have been instead "aggs": {"available_tags": {"terms": {"field": "tags", "size": 20}}}

the docs were updated since I last checked them, and apparently setting the size param properly to a value that's higher than shard_size also sets shard_size, so in my specific case (I don't have a high-cardinality data set) setting the size param properly resolves the issue.

thank you and @jpountz for your help.

jpountz · 2014-05-04T17:08:41Z

but for now my question is: is there some sort of a formula, or rule of thumb as to what the shard_size parameter should be?

There is no formula, it depends on what your data looks like. Terms that are missing or have inaccurate counts in the final response are those that are not in the top shard_size terms on one shard or more. So if your shards are uniform, even a slight increase of the shard size might improve the accuracy of counts a lot.

igal-getrailo · 2014-05-04T18:16:40Z

@jpountz Thank you!

speedplane · 2014-06-12T08:00:52Z

It seems that aggregations are the future and facets are on the way out. However, I want faceting functionality and I am trying to determine whether I should use facets with a set shard_size or a cardinality aggregation and set the precision_threshold. Is there any documentation on the performance / accuracy comparison between those two related options? Is one always better than the other, and if not, what factors influence that design decision?

eliasah · 2014-08-25T07:52:09Z

Have you considered using the aggregation feature within the facets feature to fix this issue?

ajhalani · 2014-10-13T18:48:29Z

It would be nice to know if there are plans to review/work on this in near future?
As some of the previous updates, an option to force extra round-trip to force correct values would be a nice compromise for many users.

lloy0076 · 2015-02-13T00:51:54Z

Is this issue likely to be "fixed" given that it appears facets, en masse, are deprecated? See the warning on http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-facets.html.

clintongormley · 2015-02-13T20:15:03Z

@lloy0076 it definitely won't be fixed in facets. However, the same problem exists in aggregations, due the to nature of distributed systems. We'll leave this issue open as a reminder - at some stage we hope to provide a slower but more accurate version.

jodok · 2015-02-13T21:08:34Z

Yes, if you use the "scatter/gather" pattern you naturally hit the limits of a two-phase approach. When building Crate.IO* (a SQL Database using Elasticsearch, https://github.com/crate/crate) we decided to extend it to support a distributed map/reduce algorithm to provide accurate aggregations.
*(disclaimer: i'm one of the Crate guys)

otisg · 2015-02-14T13:28:41Z

@jodok Because @clintongormley used words "slower" and "more accurate" in "we hope to provide a slower but more accurate version", can you comment on the performance and accuracy of your implementation? Specifically, @clintongormley is implying numbers will be closer to the truth, but not necessarily 100% accurate, and I'm wondering if in case of CrateIO they are always 100% accurate? And, of course, I'm wondering if you've compared performance of CrateIO approach vs. ES approach (at scale)? Thanks.

jodok · 2015-02-14T20:20:13Z

@otisg the performance of "normal" queries that don't need the 2-phase map/reduce approach is the same. but when it comes to exact aggregations - they need one more cluster roundtrip. instead of 1. collect data (all shards), 2. merge result (one node) - an additional step is added: 1. collect (all shards), 2. reshuffle data (to multiple nodes) and do aggregation, 3. merge (one node). this requires one more network roundrip in the cluster, but usually this is even less than a ms. this method scales horizontally, performance heavily depends on cardinality and the size of the result set. happy to help running tests on sampledata on a demo cluster.

otisg · 2015-02-15T03:36:07Z

@clintongormley @kimchy couldn't ES "borrow" the the approach/idea that Crate or Solr(Cloud) use? Both give 100% accurate counts.

ppf2 · 2015-07-14T19:51:39Z

Facets have been deprecated for a while now and have been removed from master/2.0 (#7337) with aggregations being the replacement. Time to close this ticket?

clintongormley · 2015-07-14T20:29:35Z

Agreed

thanodnl · 2015-07-15T12:06:33Z

I think it is worth to keep open this ticket as the aggregations framework suffers from the same issue with terms aggregates on more than 1 shard.

clintongormley · 2015-07-17T12:18:31Z

I've opened #12316 instead

dadoonet mentioned this issue Mar 18, 2013

TermsFacet across index returns wrong results (0.90) #2797

Closed

spinscale mentioned this issue Jul 5, 2013

Inconsistent facet counts #1832

Closed

rashidkpc mentioned this issue Jul 30, 2013

terms panel shows top terms based on *alphabetical* sorting of counts elastic/kibana#301

Closed

jpountz mentioned this issue May 5, 2014

Data mismatch in terms_stats facet #6046

Closed

jpountz mentioned this issue May 27, 2014

Add from support to top_hits aggregator. #6299

Closed

This was referenced Jul 2, 2014

Aggregations: Return an upper bound of the maximum error for terms #6696

Closed

Aggregations: Memory-bound terms #6697

Closed

clintongormley added the discuss label Jul 8, 2014

clintongormley added high hanging fruit and removed discuss labels Jul 18, 2014

rashidkpc mentioned this issue Oct 8, 2014

terms stats facet not accurate when dealing with high cardinality fields elastic/kibana#921

Closed

clintongormley closed this as completed Jul 14, 2015

brettlyman mentioned this issue Jun 6, 2016

Explore option of supporting more flexible search types #12316

Closed

terms facet gives wrong count with n_shards > 1 #1305

terms facet gives wrong count with n_shards > 1 #1305

Comments

jmchambers commented Sep 6, 2011

losomo commented Oct 17, 2011

losomo commented Oct 17, 2011

kimchy commented Oct 18, 2011

losomo commented Oct 18, 2011

karussell commented Oct 19, 2011

agnellvj commented May 24, 2012

piskvorky commented Jun 3, 2012

jmchambers commented Jun 3, 2012

tarunjangra commented Jun 24, 2012

karmi commented Jul 17, 2012

piskvorky commented Jul 17, 2012

ajhalani commented Aug 9, 2012

Downchuck commented Aug 27, 2012

jrydberg commented Sep 1, 2012

giamma commented Oct 29, 2012

tgruben commented Nov 8, 2012

danfairs commented Nov 22, 2012

markwaddle commented Nov 27, 2012

webmusing commented Apr 5, 2013

piskvorky commented Apr 7, 2013

vincentpoon commented May 10, 2013

songday commented Jul 5, 2013

tommymonk commented Jul 23, 2013

HeyBillFinn commented Aug 2, 2013

danfairs commented Aug 2, 2013

peakx commented Aug 8, 2013

SanderDemeester commented Aug 9, 2013

igal-getrailo commented May 3, 2014

megastef commented May 3, 2014

igal-getrailo commented May 3, 2014

igal-getrailo commented May 4, 2014

jpountz commented May 4, 2014

igal-getrailo commented May 4, 2014

speedplane commented Jun 12, 2014

eliasah commented Aug 25, 2014

ajhalani commented Oct 13, 2014

lloy0076 commented Feb 13, 2015

clintongormley commented Feb 13, 2015

jodok commented Feb 13, 2015

otisg commented Feb 14, 2015

jodok commented Feb 14, 2015

otisg commented Feb 15, 2015

ppf2 commented Jul 14, 2015

clintongormley commented Jul 14, 2015

thanodnl commented Jul 15, 2015

clintongormley commented Jul 17, 2015