Reduce synchronization on field data cache #27365

tinder-xli · 2017-11-13T19:31:06Z

During our perf test against elasticsearch, we noticed two synchronized block in the call stack of fetching doc values, which i think is necessary and cause a serious gc issue for us, especially when "size" param is large and fetch docvalue fields.

There is a synchronized block for getting from fieldDataCaches map, which is unnecessary when cache != null. And getForField method is called for every hit thus blocking at here impact performance a lot. We suggest changing to double checked locking and only do synchronize when cache==null.
We see threads waiting on getForField method in our JFR recording.

gradle test all passed in my local.

For reference, below is a sample query we used for testing:

{
  "stored_fields": "_none_",
  "docvalue_fields": ["user_number"],
  "size": 33000,
  "query": {
    "bool": {
    	 "must": [
    	   {
    	   	"match_all":{}
    	   }
    	 ]
    }
  }
}

This PR links to #27350
@jasontedor @s1monw

elasticmachine · 2017-11-13T19:31:07Z

Since this is a community submitted pull request, a Jenkins build has not been kicked off automatically. Can an Elastic organization member please verify the contents of this patch and then kick off a build manually?

tinder-fren · 2017-11-13T21:25:49Z

+@jasontedor for the double-checked locking solution.

jasontedor · 2017-11-14T00:32:04Z

test this please

jasontedor

LGTM.

The field data cache can come under heavy contention in cases when lots of search threads are hitting it for doc values. This commit reduces the amount of contention here by using a double-checked locking strategy to only lock when the cache needs to be initialized. Relates #27365

s1monw · 2017-11-14T08:41:16Z

I am not convinced this is a good fix. From my perspective double checked locking doesn't apply in this case since assume all objects inside the map are safely published which is by no means guaranteed. You can apply this pattern if you read volatile variable to prevent synchronization but not to access a HashMap the JavaDocs explicitly state:

If multiple threads access a hash map concurrently, and at least one of
the threads modifies the map structurally, it must be
synchronized externally.  (A structural modification is any operation
that adds or deletes one or more mappings; merely changing the value
associated with a key that an instance already contains is not a
structural modification.)  This is typically accomplished by
synchronizing on some object that naturally encapsulates the map.

either we use a ConcurrentHashMap here or we synchronize the map. But this change is broken depending on the implementation of the HashMap

tinder-xli · 2017-11-14T08:55:32Z

@s1monw this fix is synchronizing on some object that naturally encapsulates the map during addition, which is the synchronized (this) line.
Can you help me understand your exact concern by giving some examples which demonstrate the flaw of my fix?

danielmitterdorfer · 2017-11-14T09:00:49Z

I completely agree with @s1monw here. This line

IndexFieldDataCache cache = fieldDataCaches.get(fieldName);

does not establish a happens-before edge with any other access to fieldDataCaches or in more simple terms: Calling fieldDataCaches.get() outside of a synchronized block risks reading a stale value for that key.

s1monw · 2017-11-14T09:08:57Z

@s1monw this fix is synchronizing on some object that naturally encapsulates the map during addition, which is the synchronized (this) line.
Can you help me understand your exact concern by giving some examples which demonstrate the flaw of my fix?

while your fix might work with some map implementations you have no guarantee that it does on all JVMs or spec impls. it might work on all but then by accident. You are reading from a datastructure that is not threadsafe that is concurrently modified. you are basically using single writer multiple reader principle but this datastrucuture is only designed for single writer single reader. For instance if the IndexFieldDataCache instance is not safely published ie. has some initialization that are outside of the ctor etc. you are reading a stale value. This might all accidentally work but there are too many if's in this equation for me to accept this fix as it is. I am am ok with using a concurrent hash map instead for the map then your fix is just fine. @danielmitterdorfer WDYT

jimczi · 2017-11-14T09:21:39Z

I don't think that the issue is related to synchronization but rather to a misusage of this cache by DocValueFieldsFetchSubPhase. In 5.x we call getForField for each field on every hit, if you set the size to 33,000 then we'll call getForField this number of times. This is not optimized at all for this kind of request (lots of hits to retrieve). In 6.x we changed the code to access this cache only once per segment (and only if needed):
#25644
getForField should not be called very often, few times per request to access the values but for doc_values sub phase we called it way too often and the pr above fixed this. Maybe we could backport this enhancement (#25644) to 5.x instead ?

s1monw · 2017-11-14T09:36:33Z

@jimczi very good point ++

jasontedor · 2017-11-14T10:57:02Z

Good catch @s1monw.

@tinder-xli The problem is again the clear methods. These can remove a key, yet because the first get in this change does not establish a happens-before relationship with such a removal, this get can still see a key that has already been removed from the map.

This reverts commit 2e86357. Relates #27365

This reverts commit cd9f26c. Relates #27365

This reverts commit bda9257. Relates #27365

tinder-xli · 2017-11-14T19:00:09Z

@jasontedor @s1monw ok I think I got what you are saying, so looks like using a concurrenthashmap will resolve this issue right? since it will try to synchronize the entry before removing it thus get() won't see a key which has already been removed from the map.
I can create another PR and change to user concurrenthashmap and then we can backport that to 5.6.x?

jasontedor · 2017-11-14T19:45:08Z

@tinder-xli I think I would rather see the impact of backporting #25644 first.

tinder-xli · 2017-11-14T19:47:47Z

@jasontedor i don't see "backport" label on that PR - is that change going to be in 5.6.x too?

jasontedor · 2017-11-14T19:49:57Z

@tinder-xli It does not have the label because we have not decided to backport it. If we do backport it would be in 5.6.x.

tinder-xli · 2017-11-14T19:51:18Z

also I see getForField still happens for each hit after that change(under a for (SearchHit hit : hits) loop), so how does the change improve the perf? According to @jimczi it is changing to fetch once per segment, not per hit.

jimczi · 2017-11-14T21:17:53Z

also I see getForField still happens for each hit after that change(under a for (SearchHit hit : hits) loop), so how does the change improve the perf?

It is called only if the document is on a new segment (different from the previous document) so at most it will be called N times, N being the number of segments in the shard. This is why we ensure that the document are sorted by doc ids just before the loop. In fact we could call getForField only once in your example, that's the pattern expected for this cache since it returns an instance that can access doc values for the field on a per segment basis.
@colings86 do you think it's safe to backport #25644 to 5.6.x ?

jasontedor · 2017-11-14T22:38:40Z

@tinder-xli You're missing this guard:

                    // if the reader index has changed we need to get a new doc values reader instance
                    if (subReaderContext == null || hit.docId() >= subReaderContext.docBase + subReaderContext.reader().maxDoc()) {
                        int readerIndex = ReaderUtil.subIndex(hit.docId(), context.searcher().getIndexReader().leaves());
                        subReaderContext = context.searcher().getIndexReader().leaves().get(readerIndex);
                        data = context.getForField(fieldType).load(subReaderContext);
                        values = data.getScriptValues();
                    }

This helps give the semantics that @jimczi is describing.

jimczi · 2017-11-17T12:49:41Z

I had a chat with @colings86 regarding the backport of #25644 to 5.6.
We did this change after the upgrade to Lucene 7 (for ES 6) where doc values have moved to become pure iterators. This is not an optim for pre-Lucene 7 doc values that have full random access support. This is an important point for your use case because retrieving 30,000 doc values could have a different cost in 6.0 no matter how much time we call getForField.
Also I don't understand how calling getForField and slow gcs are linked. It seems to me that your issue is more related with the fact that you're retrieving too many documents in a single request.
For these reasons we don't think that it is worth backporting in 5.6.x which should only contain bug fixes at this point. 6.0 is out (https://www.elastic.co/blog/elasticsearch-6-0-0-released) so you can upgrade and test the performance with the new version.

tinder-xli added 4 commits November 13, 2017 11:18

double checked locking

d494ce7

assign to cache

dd2175b

one line

c100c39

add comment

3652ef3

tinder-xli mentioned this pull request Nov 13, 2017

use ConcurrentHashMap to avoid synchronized block #27350

Closed

jasontedor approved these changes Nov 14, 2017

View reviewed changes

jasontedor added :Search/Search Search-related issues that do not fall into other categories v5.6.5 v6.0.1 v6.1.0 v7.0.0 >enhancement labels Nov 14, 2017

jasontedor changed the title ~~use double checked locking to avoid synchronization when cache is already created~~ Reduce synchronization on field data cache Nov 14, 2017

jasontedor merged commit 2e86357 into elastic:master Nov 14, 2017

jasontedor added the backport pending label Nov 14, 2017

jasontedor added a commit that referenced this pull request Nov 14, 2017

Revert "Reduce synchronization on field data cache"

be39996

This reverts commit 2e86357. Relates #27365

jasontedor added a commit that referenced this pull request Nov 14, 2017

Revert "Reduce synchronization on field data cache"

57c6a7d

This reverts commit cd9f26c. Relates #27365

jasontedor added a commit that referenced this pull request Nov 14, 2017

Revert "Reduce synchronization on field data cache"

bcc438a

This reverts commit bda9257. Relates #27365

jasontedor added >non-issue and removed backport pending >enhancement labels Nov 14, 2017

jimczi added v7.0.0-beta1 and removed v7.0.0 labels Feb 7, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce synchronization on field data cache #27365

Reduce synchronization on field data cache #27365

tinder-xli commented Nov 13, 2017

elasticmachine commented Nov 13, 2017

tinder-fren commented Nov 13, 2017

jasontedor commented Nov 14, 2017

jasontedor left a comment

s1monw commented Nov 14, 2017

tinder-xli commented Nov 14, 2017

danielmitterdorfer commented Nov 14, 2017

s1monw commented Nov 14, 2017

jimczi commented Nov 14, 2017

s1monw commented Nov 14, 2017

jasontedor commented Nov 14, 2017

tinder-xli commented Nov 14, 2017

jasontedor commented Nov 14, 2017

tinder-xli commented Nov 14, 2017

jasontedor commented Nov 14, 2017

tinder-xli commented Nov 14, 2017

jimczi commented Nov 14, 2017

jasontedor commented Nov 14, 2017

jimczi commented Nov 17, 2017

Reduce synchronization on field data cache #27365

Reduce synchronization on field data cache #27365

Conversation

tinder-xli commented Nov 13, 2017

elasticmachine commented Nov 13, 2017

tinder-fren commented Nov 13, 2017

jasontedor commented Nov 14, 2017

jasontedor left a comment

Choose a reason for hiding this comment

s1monw commented Nov 14, 2017

tinder-xli commented Nov 14, 2017

danielmitterdorfer commented Nov 14, 2017

s1monw commented Nov 14, 2017

jimczi commented Nov 14, 2017

s1monw commented Nov 14, 2017

jasontedor commented Nov 14, 2017

tinder-xli commented Nov 14, 2017

jasontedor commented Nov 14, 2017

tinder-xli commented Nov 14, 2017

jasontedor commented Nov 14, 2017

tinder-xli commented Nov 14, 2017

jimczi commented Nov 14, 2017

jasontedor commented Nov 14, 2017

jimczi commented Nov 17, 2017