Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Script queries using joda time getMillis() slower than using java time toInstant().toEpochMilli() #35754

Closed
conet opened this issue Nov 20, 2018 · 9 comments
Assignees
Labels
:Core/Infra/Scripting Scripting abstractions, Painless, and Mustache

Comments

@conet
Copy link

conet commented Nov 20, 2018

Elasticsearch version (bin/elasticsearch --version):
Version: 6.5.0, Build: default/tar/816e6f6/2018-11-09T18:58:36.352602Z, JVM: 1.8.0_191

Plugins installed: []
No extra plugins

JVM version (java -version):
java -version
java version "1.8.0_191"
Java(TM) SE Runtime Environment (build 1.8.0_191-b12)
Java HotSpot(TM) 64-Bit Server VM (build 25.191-b12, mixed mode)

Description of the problem including expected versus actual behavior:
I've been using script queries that called getMillis() without issues until I upgraded to 6.5.0 after I started noticing serious query time deterioration (I'm talking about tens of seconds up from a few seconds), eventually by reading the changelog I realized that joda time was being deprecated, after switching the getMillis() call to toInstant().toEpochMilli() I got back the original query times (the ones before the upgrade). See a more detailed description in this PR. Since I've been asked to open an issue with the steps to reproduce it here they are.

Steps to reproduce:

  1. Download and start elasticsearch 6.5.0
  2. Download and start kibana 6.5.0
  3. Load the Sample web logs sample datasets
  4. Increase the doc count by copying the sample index (I copied it 40 times):
curl -XPOST "http://localhost:9200/_reindex" -H 'Content-Type: application/json' -d'
{
  "source": {
    "index": "kibana_sample_data_logs"
  },
  "dest": {
    "index": "kibana_sample_data_logs_1"
  }
}'
  1. Run the fast query:
curl -XPOST "http://localhost:9200/kibana_sample_data_logs*/_search" -H 'Content-Type: application/json' -d'
{
  "size": 0,
  "query": {
    "bool": {
      "must": [
        {
          "script": {
            "script": {
              "source": "doc[\"timestamp\"].value.toInstant().toEpochMilli() < params.millis",
              "lang": "painless",
              "params": {
                "millis": 1542747500000
              }
            }
          }
        }
      ]
    }
  }
}'
---
{
  "took" : 38,
  "timed_out" : false,
  "_shards" : {
    "total" : 201,
    "successful" : 201,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 94259,
    "max_score" : 0.0,
    "hits" : [ ]
  }
}
  1. Run the slow query:
curl -XPOST "http://localhost:9200/kibana_sample_data_logs*/_search" -H 'Content-Type: application/json' -d'
{
  "size": 0,
  "query": {
    "bool": {
      "must": [
        {
          "script": {
            "script": {
              "source": "doc[\"timestamp\"].value.getMillis() < params.millis",
              "lang": "painless",
              "params": {
                "millis": 1542747500000
              }
            }
          }
        }
      ]
    }
  }
}'
---
#! Deprecation: Use of the joda time method [getMillis()] is deprecated. Use [toInstant().toEpochMilli()] instead.
{
  "took" : 1088,
  "timed_out" : false,
  "_shards" : {
    "total" : 201,
    "successful" : 201,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 94259,
    "max_score" : 0.0,
    "hits" : [ ]
  }
}

We're talking about 95K documents and the deprecated query is already 30 times slower, one could argue that it is because the deprecation log itself but from my experience with bigger indexes the problem becomes more noticeable, the slow query seems to slow down linearly with the number of documents checked by it.

@rjernst rjernst self-assigned this Nov 20, 2018
@rjernst rjernst added the :Core/Infra/Scripting Scripting abstractions, Painless, and Mustache label Nov 20, 2018
@conet
Copy link
Author

conet commented Nov 20, 2018

@rjernst I updated the description since I figured out a way to increase the sample dataset hit by the query.

@rjernst
Copy link
Member

rjernst commented Dec 22, 2018

Thanks for the steps @conet. I am now able to reproduce this and have done a detailed investigation.

Your example has about 200 shards due to the 5 default shards per index and 40 indexes you create. When I first tried running your reproduction steps, I only did 20 copies, and noticed it was about half the time you saw (around 400-500ms). When I went up to 40 shards, I then see about the same times you saw. This led me to hypothesize the issue is scales linearly with the number of shards. To confirm this theory, I reindexed all 40 indexes into a single index, and ran the slow query. The response was immediate: no slowness. So, the overhead of a single deprecation call is not an issue, it is only when many are done within the same request, across different shards on the same node.

I then ran a profiler while running the slow query a few times. The interesting part of the results were as follows:

--- 14030000000 ns (37.35%), 1403 samples
  [ 0] org.apache.lucene.util.CloseableThreadLocal.set
  [ 1] org.elasticsearch.common.util.concurrent.ThreadContext$ContextThreadLocal.set
  [ 2] org.elasticsearch.common.util.concurrent.ThreadContext.addResponseHeader
  [ 3] org.elasticsearch.common.logging.DeprecationLogger.deprecated
  [ 4] org.elasticsearch.common.logging.DeprecationLogger.deprecatedAndMaybeLog
  [ 5] org.elasticsearch.script.JodaCompatibleZonedDateTime.lambda$logDeprecated$0
  [ 6] org.elasticsearch.script.JodaCompatibleZonedDateTime$$Lambda$3213.1313296263.run
  [ 7] java.security.AccessController.doPrivileged
  [ 8] org.elasticsearch.script.JodaCompatibleZonedDateTime.logDeprecated
  [ 9] org.elasticsearch.script.JodaCompatibleZonedDateTime.logDeprecatedMethod
  [10] org.elasticsearch.script.JodaCompatibleZonedDateTime.getMillis

So the time is spent in the thread local we set to store warning response headers. This actually has nothing to do with scripting exactly, but instead the fact the scripts are run per shard, and the deprecation call happens many times across different ThreadContexts (because each shard's search is run independently).

As far as how to avoid this immediately, there are two options:

  • reduce the number of shards you search against
  • convert your scripts to use the non-deprecated method (which it sounds like you may have already done)

A long term fix is may be more complicated, and needs some agreement from other developers here. I'm marking this issue for discussion after the holidays. There are a few options to consider:

  1. Don't do anything: this is only for deprecations, and there is a workaround (and in general there always should be, ie what replaces the deprecated usage). This option isn't great from a user perspective, as upgrading should generally not cause a slowdown just because you were using something that was previously ok to use.
  2. Create tracking across threads for deprecations: this would most likely be incredibly complex, and I'm unsure how it could work effectively in practice as the ThreadContexts intention is to avoid cross thread barriers.
  3. Use the same logging guard for warning headers: we currently guard against logging the same deprecation warning multiple times by using a unique key per deprecation site, and keeping track of the last 128 unique deprecations on the node. We made the decision at that time to continue writing all warning headers all the time. If deprecations can be truncated in the log, why not also in warning headers?

My vote is for 3, as I think this is the simplest to implement.

@conet
Copy link
Author

conet commented Dec 22, 2018

Thanks for the thorough investigation, I agree that 3 sounds like the best solution.

@pcsanwald
Copy link
Contributor

@rjernst we discussed this in FixIt thursday, and there was general agreement that 3 seems like the right approach. obv up to you if you wanna leave it discuss or remove the label.

@colings86
Copy link
Contributor

I'm not sure if suppressing warnings is a good idea here. When we suppress deprecation warnings in logs I think it's ok because it's the same consumer (the log file) that would get all the warnings. If we suppress the warning headers int he same way then only 1 in every 128 callers will get any given deprecation warning meaning some clients may never receive the warnings. This seems like it might turn the deprecations warnings into a lottery making it hard to know that you are not hitting any deprecations when parsing a response because you may have hit some suppressed deprecation site.

If we are going to suppress deprecations warnings I think we should only do it within the same request so every request that hits a deprecation site is guaranteed to emit at least one of those deprecation warning headers, though I fear this will probably not be enough to solve the original performance issue

@rjernst
Copy link
Member

rjernst commented Jan 8, 2019

@colings86 IMO the deprecation log is the "official" source of truth for deprecations that occurred. Deprecations may not occur on a request: for example, they could be when reading a setting on startup. In your example, the deprecation log would already have an entry. I really wonder if deprecation warnings should be removed altogether from responses.

@pcsanwald Given @colings86`s concerns, I think this should be discussed again.

@colings86
Copy link
Contributor

colings86 commented Jan 9, 2019

I really wonder if deprecation warnings should be removed altogether from responses.

I think the warnings in responses are useful because the caller is often not the administrator and so often will not have access to the deprecation logs. Having an indication in the response means the caller can be made aware they are using a deprecated feature and have the chance to rectify things without the admin needing to identify and alert each user themselves (though they may still need to do this if some users ignore the warnings).

I agree that it would be good to discuss this again though

@jasontedor
Copy link
Member

We discussed this in our team discussion meeting today. Because responses are ephemeral in nature, it seems that the warning headers should stay on each response. What we do in the log messages is okay because the log files are generally durable. Instead of addressing this by circumventing the issue, we want to go after the performance problems. We investigated some of them during our meeting and found room for improvement in a few areas. I opened #37590 and #37597 for these. What remains is half explained by the de-duplication (#37530). Let's attack that now.

ebadyano added a commit that referenced this issue Feb 22, 2019
1. Setting length for formatWarning String to avoid AbstractStringBuilder.ensureCapacityInternal calls
2. Adding extra check for parameter array length == 0 to avoid unnecessarily creating StringBuilder in LoggerMessageFormat.format

Helps to narrow the performance gap in throughout for geonames benchmark (#37411) by 3%. For more details: #37530 (comment) 

Relates to #37530
Relates to #37411
Relates to #35754
ebadyano added a commit to ebadyano/elasticsearch that referenced this issue Feb 25, 2019
1. Setting length for formatWarning String to avoid AbstractStringBuilder.ensureCapacityInternal calls
2. Adding extra check for parameter array length == 0 to avoid unnecessarily creating StringBuilder in LoggerMessageFormat.format

Helps to narrow the performance gap in throughout for geonames benchmark (elastic#37411) by 3%. For more details: elastic#37530 (comment) 

Relates to elastic#37530
Relates to elastic#37411
Relates to elastic#35754
ebadyano added a commit that referenced this issue Feb 25, 2019
1. Setting length for formatWarning String to avoid AbstractStringBuilder.ensureCapacityInternal calls
2. Adding extra check for parameter array length == 0 to avoid unnecessarily creating StringBuilder in LoggerMessageFormat.format

Helps to narrow the performance gap in throughout for geonames benchmark (#37411) by 3%. For more details: #37530 (comment) 

Relates to #37530
Relates to #37411
Relates to #35754
@stu-elastic
Copy link
Contributor

With the changes mentioned by @jasontedor and as @ebadyano notes in a comment the performance hit has reduced to a 4% hit.

Since Joda removal is planned for version 8 and the performance improvements landed in 7, we recommend moving to the latest 7 to avoid this performance trap.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Core/Infra/Scripting Scripting abstractions, Painless, and Mustache
Projects
None yet
Development

No branches or pull requests

7 participants