Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce synchronization on field data cache #27365

Merged
merged 4 commits into from
Nov 14, 2017

Conversation

tinder-xli
Copy link
Contributor

During our perf test against elasticsearch, we noticed two synchronized block in the call stack of fetching doc values, which i think is necessary and cause a serious gc issue for us, especially when "size" param is large and fetch docvalue fields.

There is a synchronized block for getting from fieldDataCaches map, which is unnecessary when cache != null. And getForField method is called for every hit thus blocking at here impact performance a lot. We suggest changing to double checked locking and only do synchronize when cache==null.
We see threads waiting on getForField method in our JFR recording.
screen shot 2017-11-10 at 4 23 39 pm

gradle test all passed in my local.

For reference, below is a sample query we used for testing:

{
  "stored_fields": "_none_",
  "docvalue_fields": ["user_number"],
  "size": 33000,
  "query": {
    "bool": {
    	 "must": [
    	   {
    	   	"match_all":{}
    	   }
    	 ]
    }
  }
}

This PR links to #27350
@jasontedor @s1monw

@elasticmachine
Copy link
Collaborator

Since this is a community submitted pull request, a Jenkins build has not been kicked off automatically. Can an Elastic organization member please verify the contents of this patch and then kick off a build manually?

@tinder-fren
Copy link

+@jasontedor for the double-checked locking solution.

@jasontedor
Copy link
Member

test this please

Copy link
Member

@jasontedor jasontedor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@jasontedor jasontedor added :Search/Search Search-related issues that do not fall into other categories v5.6.5 v6.0.1 v6.1.0 v7.0.0 >enhancement labels Nov 14, 2017
@jasontedor jasontedor changed the title use double checked locking to avoid synchronization when cache is already created Reduce synchronization on field data cache Nov 14, 2017
@jasontedor jasontedor merged commit 2e86357 into elastic:master Nov 14, 2017
jasontedor pushed a commit that referenced this pull request Nov 14, 2017
The field data cache can come under heavy contention in cases when lots
of search threads are hitting it for doc values. This commit reduces the
amount of contention here by using a double-checked locking strategy to
only lock when the cache needs to be initialized.

Relates #27365
jasontedor pushed a commit that referenced this pull request Nov 14, 2017
The field data cache can come under heavy contention in cases when lots
of search threads are hitting it for doc values. This commit reduces the
amount of contention here by using a double-checked locking strategy to
only lock when the cache needs to be initialized.

Relates #27365
@s1monw
Copy link
Contributor

s1monw commented Nov 14, 2017

I am not convinced this is a good fix. From my perspective double checked locking doesn't apply in this case since assume all objects inside the map are safely published which is by no means guaranteed. You can apply this pattern if you read volatile variable to prevent synchronization but not to access a HashMap the JavaDocs explicitly state:

If multiple threads access a hash map concurrently, and at least one of
the threads modifies the map structurally, it must be
synchronized externally.  (A structural modification is any operation
that adds or deletes one or more mappings; merely changing the value
associated with a key that an instance already contains is not a
structural modification.)  This is typically accomplished by
synchronizing on some object that naturally encapsulates the map.

either we use a ConcurrentHashMap here or we synchronize the map. But this change is broken depending on the implementation of the HashMap

@tinder-xli
Copy link
Contributor Author

@s1monw this fix is synchronizing on some object that naturally encapsulates the map during addition, which is the synchronized (this) line.
Can you help me understand your exact concern by giving some examples which demonstrate the flaw of my fix?

@danielmitterdorfer
Copy link
Member

I completely agree with @s1monw here. This line

IndexFieldDataCache cache = fieldDataCaches.get(fieldName);

does not establish a happens-before edge with any other access to fieldDataCaches or in more simple terms: Calling fieldDataCaches.get() outside of a synchronized block risks reading a stale value for that key.

@s1monw
Copy link
Contributor

s1monw commented Nov 14, 2017

@s1monw this fix is synchronizing on some object that naturally encapsulates the map during addition, which is the synchronized (this) line.
Can you help me understand your exact concern by giving some examples which demonstrate the flaw of my fix?

while your fix might work with some map implementations you have no guarantee that it does on all JVMs or spec impls. it might work on all but then by accident. You are reading from a datastructure that is not threadsafe that is concurrently modified. you are basically using single writer multiple reader principle but this datastrucuture is only designed for single writer single reader. For instance if the IndexFieldDataCache instance is not safely published ie. has some initialization that are outside of the ctor etc. you are reading a stale value. This might all accidentally work but there are too many if's in this equation for me to accept this fix as it is. I am am ok with using a concurrent hash map instead for the map then your fix is just fine. @danielmitterdorfer WDYT

@jimczi
Copy link
Contributor

jimczi commented Nov 14, 2017

I don't think that the issue is related to synchronization but rather to a misusage of this cache by DocValueFieldsFetchSubPhase. In 5.x we call getForField for each field on every hit, if you set the size to 33,000 then we'll call getForField this number of times. This is not optimized at all for this kind of request (lots of hits to retrieve). In 6.x we changed the code to access this cache only once per segment (and only if needed):
#25644
getForField should not be called very often, few times per request to access the values but for doc_values sub phase we called it way too often and the pr above fixed this. Maybe we could backport this enhancement (#25644) to 5.x instead ?

@s1monw
Copy link
Contributor

s1monw commented Nov 14, 2017

@jimczi very good point ++

@jasontedor
Copy link
Member

Good catch @s1monw.

@tinder-xli The problem is again the clear methods. These can remove a key, yet because the first get in this change does not establish a happens-before relationship with such a removal, this get can still see a key that has already been removed from the map.

jasontedor added a commit that referenced this pull request Nov 14, 2017
jasontedor added a commit that referenced this pull request Nov 14, 2017
jasontedor added a commit that referenced this pull request Nov 14, 2017
@tinder-xli
Copy link
Contributor Author

@jasontedor @s1monw ok I think I got what you are saying, so looks like using a concurrenthashmap will resolve this issue right? since it will try to synchronize the entry before removing it thus get() won't see a key which has already been removed from the map.
I can create another PR and change to user concurrenthashmap and then we can backport that to 5.6.x?

@jasontedor
Copy link
Member

@tinder-xli I think I would rather see the impact of backporting #25644 first.

@tinder-xli
Copy link
Contributor Author

@jasontedor i don't see "backport" label on that PR - is that change going to be in 5.6.x too?

@jasontedor
Copy link
Member

@tinder-xli It does not have the label because we have not decided to backport it. If we do backport it would be in 5.6.x.

@tinder-xli
Copy link
Contributor Author

also I see getForField still happens for each hit after that change(under a for (SearchHit hit : hits) loop), so how does the change improve the perf? According to @jimczi it is changing to fetch once per segment, not per hit.

@jimczi
Copy link
Contributor

jimczi commented Nov 14, 2017

also I see getForField still happens for each hit after that change(under a for (SearchHit hit : hits) loop), so how does the change improve the perf?

It is called only if the document is on a new segment (different from the previous document) so at most it will be called N times, N being the number of segments in the shard. This is why we ensure that the document are sorted by doc ids just before the loop. In fact we could call getForField only once in your example, that's the pattern expected for this cache since it returns an instance that can access doc values for the field on a per segment basis.
@colings86 do you think it's safe to backport #25644 to 5.6.x ?

@jasontedor
Copy link
Member

@tinder-xli You're missing this guard:

                    // if the reader index has changed we need to get a new doc values reader instance
                    if (subReaderContext == null || hit.docId() >= subReaderContext.docBase + subReaderContext.reader().maxDoc()) {
                        int readerIndex = ReaderUtil.subIndex(hit.docId(), context.searcher().getIndexReader().leaves());
                        subReaderContext = context.searcher().getIndexReader().leaves().get(readerIndex);
                        data = context.getForField(fieldType).load(subReaderContext);
                        values = data.getScriptValues();
                    }

This helps give the semantics that @jimczi is describing.

@jimczi
Copy link
Contributor

jimczi commented Nov 17, 2017

I had a chat with @colings86 regarding the backport of #25644 to 5.6.
We did this change after the upgrade to Lucene 7 (for ES 6) where doc values have moved to become pure iterators. This is not an optim for pre-Lucene 7 doc values that have full random access support. This is an important point for your use case because retrieving 30,000 doc values could have a different cost in 6.0 no matter how much time we call getForField.
Also I don't understand how calling getForField and slow gcs are linked. It seems to me that your issue is more related with the fact that you're retrieving too many documents in a single request.
For these reasons we don't think that it is worth backporting in 5.6.x which should only contain bug fixes at this point. 6.0 is out (https://www.elastic.co/blog/elasticsearch-6-0-0-released) so you can upgrade and test the performance with the new version.

@jimczi jimczi added v7.0.0-beta1 and removed v7.0.0 labels Feb 7, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>non-issue :Search/Search Search-related issues that do not fall into other categories v5.6.5 v6.0.1 v6.1.0 v7.0.0-beta1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants