Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing spans from lambda #201

Closed
Ancient-Dragon opened this issue Jan 10, 2022 · 8 comments
Closed

Missing spans from lambda #201

Ancient-Dragon opened this issue Jan 10, 2022 · 8 comments

Comments

@Ancient-Dragon
Copy link

Hi there,

We're facing an issue when using the lambda layer for node js (this can be replicated across multiple versions). What we're seeing is that for about 1000 spans created we're losing about 20-50 of them. It seems to always be the case that they are grouped by invocation of the lambda and therefore are all under the same trace id. Our team has noticed that when the spans are created it has isRecording set to false instead of true (which we're assuming is why they're not being sent to the collector). The other thing is that the traceFlags is set to 0 instead of 1 (although changing this has no impact). Has anyone seen this before / know how to resolve this issue?

@Ancient-Dragon
Copy link
Author

@NathanielRN
Copy link
Contributor

Hi @Ancient-Dragon, thanks for raising this issue.

I would start by asking what is your system's environment before and after Lambda? For example do you have AWS API Gateway -> AWS Lambda -> AWS S3?

Next, when you use the NodeJS layer, do you use any custom Collector configuration or configure OpenTelemetry in any way other than the auto instrumentation method we provide with the layer? OpenTelemetry JS should by default be using the parentbased_alwayson sampler, is that the one you are using to sample your traces as far as you know?

Finally, does your Lambda Layer have active tracing enabled or is it disabled? This decides the value of ;Sampled=; whether it is 0 or 1. This is used to decide the span context and therefore whether the span should be sampled. (You don't have to set it to active, OTel JS in particular allows you to disable active tracing and extract the context your own custom way).

I'm curious if this is related to open-telemetry/opentelemetry-python-contrib#649 (comment)? In that issue a user was finding their traces were not being propagated. This was because the trace header (which AWS Lambda either gets from upstream services like API Gateway or produces itself) did not specifically state whether the trace itself should be ;Sampled=1. With no default the AWS Propagator in OTel JS assumes this value to be ;Sampled=0, which means traceFlags is set to 0 (but you said this had no impact) and isRecording is set to 0 (which may have more of an impact).

Please let me know if this makes sense and if you know the answer to my questions above!

@Ancient-Dragon
Copy link
Author

Ancient-Dragon commented Jan 14, 2022

Hi @NathanielRN. Before the lambda we have a aws sqs queue and nothing after it (the point fo this lambda is to read the messages and create a span with an event in open-telemetry. Full flow is dynamodb -> lambda -> SQS -> lambda (this is the one failing)

I linked the issue we posted in the aws repository which has a zip of our code which is failing, it also contains a collector.yaml file which looks like:

receivers:
  otlp:
    protocols:
      http:
      grpc:

exporters:
  logging:
    loglevel: debug
  otlp:
    endpoint: "${OTLP_EXPORT_ENDPOINT}"
    timeout: 10s
    tls:
      insecure: true
    sending_queue:
      enabled: false
    retry_on_failure:
      enabled: true
      initial_interval: 10s
      max_interval: 60s
      max_elapsed_time: 120s

service:
  pipelines:
    traces:
      receivers: [ otlp ]
      exporters: [ logging, otlp ]

As far as I know we are using the parentbased_alwayson sampler. We haven't done anything specific to change it ourselves. We haven't done any configuration to the lambda layer ourselves (it is the one provided by aws). The only config is what we have above, so I would assume that active tracing is enabled. Especially given the 950/1000 have it enabled it is just the odd one that invocation of the lambda that doesn't have it enabled.

Currently none of our traces are joined up (we're not entirely sure why but they don't propegate through sns/sqs so we are in the process of implmenting a work around but we haven't got to either this function or the function that puts messages into the sqs before this function. So I wouldn't expect it to be sending any headers that might cause it like the issue you posted.

Unforunately I can't find a way to find what is the current sampler, could you point me in the right direction if you still think this is the cause of the issue?

@NathanielRN
Copy link
Contributor

Thanks for your reply! I reviewed the PDF you provided and it's super clear about what the problem is.

Thought Process

The only config is what we have above, so I would assume that active tracing is enabled.

It would be a good idea to check this is set up in your Lambda function. Enabling active tracing is one of the steps we call out specifically in the ADOT Documentation for the NodeJS Layer in Step 5.

image

This is a very important step, because by default it is turned off. With active tracing "off" the ;Sampled= flag is 0 as I mentioned above.

I recognize that you said

Especially given the 950/1000 have it enabled it is just the odd one that invocation of the lambda that doesn't have it enabled.

So if they are arriving at HoneyComb (as per the linked issue in aws-otel-lambda) then it is reasonable to say that the exporter is concluding, yes, the traces are Sampled and should be exported for those 950 spans.

However, I know you said the isRecording flag is set to false. In the OTel Specification for SDK Sampling they make a clear distinction between the two which I think is relevant here.

isRecording decides whether the span makes it to the processor (which has an exporter)

isSampled decides whether the span makes it to the exporter

If isRecording is false, then isSampled MUST be false, which is why we wouldn't expect it to arrive at the exporter like you said.

However, even if isRecording is true, if isSampled is false, it won't arrive at the exporter. This means one of the missing ~50 spans could have isRecording as true in theory. But based on your discovery we can focus on those isRecording=false span first because they definitely won't arrive at Honeycomb, but I just wanted to point that out.

It would be good to ask if there is any chance that the Lambda is being destroyed and created again through terraform or CloudFormation templatesn which could cause active tracing to be turned off? But I assume not.

As far as I know we are using the parentbased_alwayson sampler. ... could you point me in the right direction if you still think this is the cause of the issue?

If you aren't trying to change it then you probably have the parentbased_alwayson sampler. But for your knowledge you should be able to go to Lambda and do

const tracer = api.trace.getTracer('example-basic-tracer-node');
console.log(`Tracer sampler information: ${tracer['_sampler'].toString()}`)

in your Lambda function which outputs

ParentBased{
    root=AlwaysOnSampler,
    remoteParentSampled=AlwaysOnSampler,
    remoteParentNotSampled=AlwaysOffSampler,
    localParentSampled=AlwaysOnSampler,
    localParentNotSampled=AlwaysOffSampler
}

When I did this I got some collector errors, but you should be able to ignore them because they are probably trying to serialize this message here.

Actually this was very helpful for my own investigation, I noticed that the message above has some defaults to the AlwaysOffSampler. This will surely set your isRecording flag to be false and surely set your isSampled flag to be false as discussed above and which will prevent those traces from making it all the way through.

There are two samplers that the ParentBased sampler uses which would trigger a AlwaysOff response. I want to focus on the remoteParentNotSampled case:

if (parentContext.isRemote) {
      if (parentContext.traceFlags & TraceFlags.SAMPLED) { // FOCUS ON THIS
        return this._remoteParentSampled.shouldSample(
          context,
          traceId,
          spanName,
          spanKind,
          attributes,
          links
        );
      }
      return this._remoteParentNotSampled.shouldSample(
        context,
        traceId,
        spanName,
        spanKind,
        attributes,
        links
      );
    }

Based on your described system, you are using our Lambda Layers. Our Lambda Layers have the AWS Propagator. The AWS Propagator follows the OpenTelemetry Specification on determining the parent span context in an AWS Lambda environment.

If you look at the AWS Propagator code it will always create a new context with isRemote=true.

Based on the code path above, since we know the trace isRemote=true, and as @Ancient-Dragon you pointed out sometimes the tracer header is ;isSampled=0 (and as I confirmed in span context images in the PDF @simonmarshall posted), then this will cause the trace to be isRecord=false and isSampled=false.

So... if you are confident your Lambda's Active Tracing settings is set to ON. Then the other solution we can do is change from the ParentBased Sampler to simply a AlwaysOn Sampler.

All you need to do is set your environment variable OTEL_TRACES_SAMPLER=always_on.

we're not entirely sure why but they don't propegate through sns/sqs so we are in the process of implmenting a work around but we haven't got to either this function or the function that puts messages into the sqs before this function

By referencing an issue with SQS -> Lambda you might be describing this issue aws/aws-xray-sdk-node#208. Unfortunately this is a very complicated problem in the tracing community which doesn't have a concrete answer due to scalability questions. It's difficult to match several SQS spans to 1 Lambda span. OpenTelemetry is actively trying to solve this problem as a community and their progress is mentioned in the OpenTelemetry Specification for messaging.

So I wouldn't expect it to be sending any headers that might cause it like the issue you posted.

Just to confirm, can you or @simonmarshall post a copy of what the Lambda event looks like? The one that it receives when it polls from SQS. I'm looking for any information there that could change the tracing header.

From the other issue...

One thing I noticed today (based off what you said in my github issue) is that sometimes the amazon x-ray trace header is coming through with sampled 0.

This is precisely what I want to check by asking you to double check Lambda's active tracing settings. As I just recently discovered above, in this Lambda layer, having ;Sampled=0 means we will use the AlwaysOff sampler. This will set BOTH isRecording=false and isSampled=false.

Conclusion

TL;DR Can you try setting OTEL_TRACES_SAMPLER=always_on in your Lambda environment variable configuration?

I believe the problem is in your Lambda having ;Sampled=0 (or even not having ;Samped= set at all which defaults to 0). But if you are sure you have enabled Active Tracing then the above solution should provide a workaround, and I will discuss internally to see if we can reproduce your case and file a bug report if it's unexpected behavior.

@simonmarshall
Copy link

hi @NathanielRN, fantastic, thanks. it looks like OTEL_TRACES_SAMPLER=always_on is the key for us. i've enabled this and will continue to monitor it over the coming days with our real feed triggering the lambda. thanks again!

@NathanielRN
Copy link
Contributor

@Ancient-Dragon @simonmarshall I just brought your findings up to my team today! They pointed me to this paragraph in the 'using X-Ray with AWS Lambda' documentation':

X-Ray applies a sampling algorithm to ensure that tracing is efficient, while still providing a representative sample of the requests that your application serves. The default sampling rule is 1 request per second and 5 percent of additional requests. This sampling rate cannot be configured for Lambda functions.

Giving that you were sending 1000 requests, I feel like you could have easily hit the limit and then get sampled at 5 percent of additional requests. But hopefully the OTEL_TRACES_SAMPLER=always_on workaround will let you take advantage of OpenTelemetry's configurability 🙂

@NathanielRN
Copy link
Contributor

Hey @simonmarshall, anything left to follow up on for this issue or aws-observability/aws-otel-lambda#203?

@simonmarshall
Copy link

Hi @NathanielRN , all is good, so I think both can be closed, thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants