Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cold start increased significantly by enabling powertools-logging #121

Closed
drissamri opened this issue Oct 4, 2020 · 15 comments
Closed

Cold start increased significantly by enabling powertools-logging #121

drissamri opened this issue Oct 4, 2020 · 15 comments
Assignees
Labels
question Further information is requested

Comments

@drissamri
Copy link
Contributor

drissamri commented Oct 4, 2020

What were you trying to accomplish?
While testing in my sample (cold start optimized) application I can measure up to 20%+ increase by enable powertools-logging instead of lambda-logging (SLF4J + Logback)

The tests are executed by leveraging lumigo-cli measure-lambda-cold-starts.

Expected Behavior

Cold starts timings shouldn't become much worse by switching out another logging framework for powertools-logging

Current Behavior

lambda-logging (baseline)

 {
      "functionName": "client-service-dev-create-client",
      "memorySize": 3008,
      "coldStarts": 40,
      "min": 1142.29,
      "p25": 1192.22,
      "median": 1213.9,
      "p75": 1223.73,
      "p95": 1261.69,
      "max": 1276.78,
      "stddev": 29.0004
}

powertools-logging

{
      "functionName": "client-service-dev-create-client",
      "memorySize": 3008,
      "coldStarts": 71,
      "min": 1787.6,
      "p25": 1952.06,
      "median": 1981.9,
      "p75": 2004.21,
      "p95": 2063.38,
      "max": 2149.03,
      "stddev": 53.575
}

Possible Solution

/

Steps to Reproduce (for bugs)

  1. git clone https://github.com/drissamri/powertools-coldstart-issue
  2. mvn clean package
  3. sls deploy --region eu-central-1
  4. npm install -g lumigo-cli
  5. lumigo-cli measure-lambda-cold-starts -p dev -n client-service-dev-create-client -r eu-central-1 -f src/test/resources/create-client-event.json

This gives you the baseline results, now we enable powertools:

  1. Switch to powertools git branch
  2. mvn clean package
  3. sls deploy --region eu-central-1
  4. lumigo-cli measure-lambda-cold-starts -p dev -n client-service-dev-create-client -r eu-central-1 -f

Environment

  • Powertools version used: 0.4.0-beta
  • Packaging format (Layers, Maven/Gradle): maven (zip)
  • AWS Lambda function runtime: java11
@drissamri drissamri added the bug Something isn't working label Oct 4, 2020
@pankajagrawal16
Copy link
Contributor

pankajagrawal16 commented Oct 5, 2020

Hey @drissamri Thanks for opening the issue. I quickly looked at different branches, and its interesting to see this behaviour.

While as far I can see difference between branches, you have not actually used any Powertools logging feature, but rather just changed to use log4j2 instead which you offcourse get as a dependency via logging module from powertools. We actually did a performance test before releasing the module and found log4j2 performing much faster compared to logback/slf4j instead. We in fact specifically tested via lambda-logging itself.

Even if you look at this article on performance benchmark for different logging library we see the same results.

Nevertheless, since you have concrete examples to show the difference, I will be diving a bit deep later this week to check what might be leading to the observation you have in the example above.

@pankajagrawal16 pankajagrawal16 self-assigned this Oct 5, 2020
@drissamri drissamri changed the title Cold start increased by 60%+ by enabling powertools-logging Cold start increased significantly by enabling powertools-logging Oct 5, 2020
@drissamri
Copy link
Contributor Author

After I posted this ticket I tested a few more things and saw a mistake in the implementation too (not too logging itself), if I didn't add the @powertools annotation I must've forgot it on that branch after my tests again.

Going to circle back on the tests results later tonight. There was still a big difference after my last run, but will update the branches so you have a correct view.

@drissamri
Copy link
Contributor Author

There was a bug in the lumigo-cli that made the invocation not complete succesfully, so the invocation timings weren't always correct. Thanks to @theburningmonk that's fixed now.

@pankajagrawal16 I created two branches baseline (lambda-logging) and powertoolslogging and updated the full statistics for both in the README.md.

baseline: p95 of 200 cold start iterations: 1400-1435ms
powertoolslogging: p95 of 200 cold start iterations: 1990ms

In the powertoolslogging you do get to see these messages, so not sure if they cause for an extra delay:

2020-10-06T00:18:26.585+02:00 | SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
-- | --
  | 2020-10-06T00:18:26.585+02:00 | SLF4J: Defaulting to no-operation (NOP) logger implementation
  | 2020-10-06T00:18:26.585+02:00 | SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.

If I'm still doing anything wrong or if I can help in any way, let me know!

@pankajagrawal16
Copy link
Contributor

pankajagrawal16 commented Oct 6, 2020

There was a bug in the lumigo-cli that made the invocation not complete succesfully, so the invocation timings weren't always correct.

Aha, This makes sense, coz I tried replicating yesterday and found the same issue that function was failing in logs. I spent most of my time figuring out, what i might be doing wrong :D. I just ended up changing lambda function contract to work with String and just measure it based on that in both implementation.

The findings for me were bit similar as well what you have posted above. I still have to dive a bit deep to see why this might be happening.

When it come to warning, I too noticed that and also fixed it by adding:

           <dependency>
                <groupId>org.apache.logging.log4j</groupId>
                <artifactId>log4j-slf4j-impl</artifactId>
                <version>${log4j.version}</version>
            </dependency>

I will look at your branches and dive a bit more into details some time this week.

@drissamri
Copy link
Contributor Author

drissamri commented Oct 6, 2020

After a discussion with @theburningmonk today, I realized the lumigo-cli only prints out initDuration (Initialization in X-Ray) and not (execution) Duration. This will be changed by him later on.

I don't think it matters for the statistics I posted before since the clear distinction is there in both in the Initialization alone already, but end to end should be measured to make sure one of both is not losing more time during execution.

@pankajagrawal16
Copy link
Contributor

pankajagrawal16 commented Oct 6, 2020

After a discussion with @theburningmonk today, I realized the lumigo-cli only prints out initDuration (Initialization in X-Ray) and not (execution) Duration. This will be changed by him later on.

I don't think it matters for the statistics I posted before since the clear distinction is there in both in the Initialization alone already, but end to end should be measured to make sure one of both is not losing more time during execution.

Yeah I have been looking at logic of cloudwatch log insights query that the cli executes and I could see just looks for Init time in lambda logs.

I do agree, regardless it gives some indication anyways.

@drissamri
Copy link
Contributor Author

FYI: I worked together with Yan and now the measure coldstarts CLI measures both initDuration and duration time.

@qtdzz
Copy link

qtdzz commented Jun 9, 2021

Hi @drissamri and @pankajagrawal16, may I ask if there is any update on this issue? Is it still a problem in v1.5.0? The reported version seems to be in a beta version a while ago. Many thanks!

@pankajagrawal16
Copy link
Contributor

hi @qtdzz There was some optimizations done before going GA to the logging module, not specifically targeting this specific issue.

@drissamri
Copy link
Contributor Author

I went ahead and tried it again with the latest version of the powertools (1.7.1) . I had to add jackson-core and jackson-annotations which wasn't the case before? Was it considered to use jackson-jr instead of jackson? (as recommended by docs)

Sadly, the results are worse than before :(
baseline

    {
      "functionName": "client-service-dev-create-client",
      "memorySize": 3008,
      "coldStarts": 78,
      "min": 1482.33,
      "p25": 1579.68,
      "median": 1601.91,
      "p75": 1636.67,
      "p95": 1688.14,
      "max": 1730.54,
      "stddev": 42.1809
    }

powertools:

   {
      "functionName": "client-service-dev-create-client",
      "memorySize": 3008,
      "coldStarts": 85,
      "min": 2329.65,
      "p25": 2399.04,
      "median": 2430.01,
      "p75": 2542.72,
      "p95": 2780.87,
      "max": 3133.66,
      "stddev": 140.8058
    }

@pankajagrawal16
Copy link
Contributor

pankajagrawal16 commented Jul 26, 2021

Hi @drissamri Thanks for following up on this again.

I have taken some time again today to understand what's going on between different implementations.

Starting with your question on jackson-core and annotation missing. This has not changed. Reason why this is failing in your sample project is because jackson-jr-objects is bringing inn older version of jackson core with it and that confilcts with latest version of jackson core and log4j2 which needs JacksonFeature and has been added in 2.12.x. So that is why in order to remove this dependency conflict, you might have needed to add those explicitly.

Now for performance with slf4j and logback, we need to compare similar feature set. Which means powertools-loggging by default provides support for structured logging whereas comparison made above is not doing any structured logging.

Therefore, I enabled json logging with logback with lambda-logging and results showed increased latency with cold start as well.

{
  "note": "The values include both DURATION as well as INIT DURATION",
  "result": [
    {
      "functionName": "powertools-baseline-dev-create-client",
      "memorySize": 3008,
      "coldStarts": 78,
      "min": 1752.49,
      "p25": 1798.27,
      "median": 1835.61,
      "p75": 1871.94,
      "p95": 1950.44,
      "max": 2083.6,
      "stddev": 55.9915
    }
  ]
}

{
  "note": "The values include both DURATION and INIT DURATION",
  "result": [
    {
      "functionName": "powertools-baseline-dev-create-client",
      "memorySize": 3008,
      "coldStarts": 72,
      "min": 1722.84,
      "p25": 1798.01,
      "median": 1831.7,
      "p75": 1865.05,
      "p95": 1941.07,
      "max": 2031.49,
      "stddev": 54.8105
    }
  ]
}

Other aspect is that powertools-logging provides several best practices on top of logging and that will definitely contribute to bit more processing and init time and thus contributing a bit more latency during init. Using jackson-jr wont be helpful since log4j2 internally using jackson anyways to perform structured logging.

I performed another test on latest powertools and results: (Also note difference in number of cold starts for metrics between powertools and lambda-logging)

{
  "note": "The values include both DURATION as well as INIT DURATION",
  "result": [
    {
      "functionName": "powertools-issue-dev-create-client",
      "memorySize": 3008,
      "coldStarts": 92,
      "min": 2122.98,
      "p25": 2379.61,
      "median": 2411.51,
      "p75": 2450.12,
      "p95": 2515,
      "max": 2659.88,
      "stddev": 64.3841
    }
  ]
}

{
  "functionName": "powertools-issue-dev-create-client",
  "note": "The values include both DURATION as well as INIT DURATION",
  "result": [
    {
      "functionName": "powertools-issue-dev-create-client",
      "memorySize": 5120,
      "coldStarts": 91,
      "min": 2178.07,
      "p25": 2264.82,
      "median": 2298.19,
      "p75": 2396.69,
      "p95": 2487.52,
      "max": 3056.99,
      "stddev": 111.0053
    }
  ]
}

That being said, I will spend some more time in this week to find the possibility of optimising jackson object mapper initialisation for structured logs

@pankajagrawal16
Copy link
Contributor

Done minor tweaks to logging utility as well #484. Based on comparison and also difference in number of cold starts in metrics, I feel comfortable not doing anything more.

I have had a look at default log4j2 config and it is optimal for the lambda workload already. Users always have ability to configure log4j2 the way they need like using asynchronous logging, etc.

cc @drissamri I will be interested in hearing your thoughts on observations above.

@pankajagrawal16 pankajagrawal16 added the question Further information is requested label Aug 2, 2021
@pankajagrawal16
Copy link
Contributor

Some improvements shipped in https://github.com/awslabs/aws-lambda-powertools-java/releases/tag/v1.7.2 as well. I will wait for @drissamri thoughts before closing it.

@pankajagrawal16 pankajagrawal16 removed the bug Something isn't working label Aug 28, 2021
@pankajagrawal16
Copy link
Contributor

@drissamri Let me know if you have any thoughts on above findings ? Or else i will close this issue for now.

@pankajagrawal16
Copy link
Contributor

Closing this issue for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

4 participants