Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[exporter/datadog] source provider loading takes too long and timeouts lambda initialization #16442

Closed
RangelReale opened this issue Nov 22, 2022 · 15 comments
Labels
bug Something isn't working exporter/datadog Datadog components priority:p2 Medium Stale

Comments

@RangelReale
Copy link

RangelReale commented Nov 22, 2022

Component(s)

exporter/datadog

What happened?

Description

When using the Datadog exporter in a lambda environment, I get these messages on both the trace and metrics initialization:

2022-11-22T20:57:53.513Z	debug	provider/provider.go:43	Unavailable source provider	
{
    "kind": "exporter",
    "data_type": "traces",
    "name": "datadog",
    "provider": "config",
    "error": "empty configuration hostname"
}
2022-11-22T20:57:54.573Z	debug	provider/provider.go:43	Unavailable source provider	
{
    "kind": "exporter",
    "data_type": "traces",
    "name": "datadog",
    "provider": "azure",
    "error": "failed to query Azure IMDS: Get \"http://169.254.169.254/metadata/instance/compute?api-version=2020-09-01&format=json\": dial tcp 169.254.169.254:80: connect: connection refused"
}
2022-11-22T20:57:54.573Z	debug	provider/provider.go:43	Unavailable source provider	
{
    "kind": "exporter",
    "data_type": "traces",
    "name": "datadog",
    "provider": "ecs",
    "error": "not running on ECS Fargate"
}

Then the lambda initialization times out, and never starts successfully.

If I disable one of the exporter, for example, the traces one, leaving only the metrics one enabled, then it takes less time and ends up being able to initialize.

I already use a resource detector, why does this exporters needs to detect things by itself?

Collector version

v0.64.1

Environment information

Environment

OS: Debian Bullseye
Compiler: go 1.17

OpenTelemetry Collector configuration

No response

Log output

2022-11-22T20:57:53.513Z	debug	provider/provider.go:43	Unavailable source provider	
{
    "kind": "exporter",
    "data_type": "traces",
    "name": "datadog",
    "provider": "config",
    "error": "empty configuration hostname"
}
2022-11-22T20:57:54.573Z	debug	provider/provider.go:43	Unavailable source provider	
{
    "kind": "exporter",
    "data_type": "traces",
    "name": "datadog",
    "provider": "azure",
    "error": "failed to query Azure IMDS: Get \"http://169.254.169.254/metadata/instance/compute?api-version=2020-09-01&format=json\": dial tcp 169.254.169.254:80: connect: connection refused"
}
2022-11-22T20:57:54.573Z	debug	provider/provider.go:43	Unavailable source provider	
{
    "kind": "exporter",
    "data_type": "traces",
    "name": "datadog",
    "provider": "ecs",
    "error": "not running on ECS Fargate"
}
2022-11-22T20:30:46.545Z	debug	ec2/ec2.go:67	EC2 Metadata not available	
{
    "kind": "exporter",
    "data_type": "traces",
    "name": "datadog"
}
2022-11-22T20:30:46.545Z	debug	provider/provider.go:43	Unavailable source provider	
{
    "kind": "exporter",
    "data_type": "traces",
    "name": "datadog",
    "provider": "ec2",
    "error": "instance ID is unavailable"
}

Additional context

No response

@RangelReale RangelReale added bug Something isn't working needs triage New item requiring triage labels Nov 22, 2022
@github-actions github-actions bot added the exporter/datadog Datadog components label Nov 22, 2022
@github-actions
Copy link
Contributor

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@mx-psi mx-psi added priority:p2 Medium and removed needs triage New item requiring triage labels Nov 23, 2022
@mx-psi
Copy link
Member

mx-psi commented Nov 23, 2022

Thanks for reporting @RangelReale! A workaround for this is to set the hostname setting to hardcode the fallback hostname to a custom value and prevent the source provider from running, but we acknowledge that the current running time can be too long in some setups. Could I ask what lambda environment are you using?

@mx-psi
Copy link
Member

mx-psi commented Nov 23, 2022

I already use a resource detector, why does this exporters needs to detect things by itself?

To expand on this: in the general case we don't know what your pipeline looks like and how your data is transformed before reaching the Datadog exporter. Since a missing source identifier (be it a hostname or a task id) results in incomplete data and a bad experience we want to have a fallback value so that if there is no resource processor or it is misconfigured, we can still add some id to your metrics/traces/logs. There is some work we can do here to improve speed, but we almost always need to run it at some point, no matter what your pipeline looks like.

@RangelReale
Copy link
Author

Could I ask what lambda environment are you using?
I'm running Go and Python services in docker images, using provided.al2.

@RangelReale
Copy link
Author

I already use a resource detector, why does this exporters needs to detect things by itself?

To expand on this: in the general case we don't know what your pipeline looks like and how your data is transformed before reaching the Datadog exporter. Since a missing source identifier (be it a hostname or a task id) results in incomplete data and a bad experience we want to have a fallback value so that if there is no resource processor or it is misconfigured, we can still add some id to your metrics/traces/logs. There is some work we can do here to improve speed, but we almost always need to run it at some point, no matter what your pipeline looks like.

Probably there should be a configuration on what source providers you want to run, or a client-side resource detector that adds these fields, and the exporter checks to see if these fields are available.

@mx-psi
Copy link
Member

mx-psi commented Nov 23, 2022

Probably there should be a configuration on what source providers you want to run, or a client-side resource detector that adds these fields, and the exporter checks to see if these fields are available.

That is a reasonable suggestion, but we first need to figure out how to provide this flexibility in a way that does not have lots of papercuts. For now, setting the hostname option will skip the source resolution process entirely, so it should be an acceptable workaround.

@kedare
Copy link

kedare commented Jan 10, 2023

Does it makes sense to log those as debug level ?
It's quite impacting and cause the liveness/readiness probes to fail on the opentelemetry helm chart for example (and no clue why without forcing the debug logs)

@kedare
Copy link

kedare commented Jan 10, 2023

Also it may be interesting to be able to force a specific way to get the host metadata (I would like to force to take it via the kubernetes API for example), it doesn't looks like this is possible right now

@github-actions
Copy link
Contributor

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@github-actions github-actions bot added the Stale label Mar 13, 2023
@mx-psi mx-psi removed the Stale label Mar 13, 2023
@github-actions
Copy link
Contributor

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@github-actions github-actions bot added the Stale label May 15, 2023
@mx-psi mx-psi removed the Stale label May 15, 2023
codeboten pushed a commit that referenced this issue Jul 12, 2023
Make Datadog exporter source providers run in parallel to reduce start
times. With the new `Chain` implementation, we start checking all
sources in parallel instead of waiting for the previous one to fail.
This makes the Datadog exporter call all cloud provider endpoints in all
cloud providers, so it may increase spurious logs such as those reported
in #24072.

**Link to tracking Issue:** Updates #16442 (at least it should
substantially improve start time in some environments)

---------

Co-authored-by: Yang Song <[email protected]>
Co-authored-by: Alex Boten <[email protected]>
@github-actions
Copy link
Contributor

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@github-actions github-actions bot added the Stale label Jul 17, 2023
@mx-psi mx-psi removed the Stale label Jul 17, 2023
@mackjmr
Copy link
Member

mackjmr commented Jul 17, 2023

Looks like the following PR: #24234 addresses this, so updating to 0.82.0 should reduce the start time. @mx-psi is this correct ?

@mx-psi
Copy link
Member

mx-psi commented Jul 17, 2023

Thanks @mackjmr this PR should improve the situation indeed. We will still keep an eye to see if the start time is reasonable after the change in all environments.

@github-actions
Copy link
Contributor

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@github-actions github-actions bot added the Stale label Sep 18, 2023
@mx-psi
Copy link
Member

mx-psi commented Sep 18, 2023

We have had user reports stating that #24234 significantly improved start times. I will therefore go ahead and close this; if you run into this issue again feel free to comment so that we can reopen

@mx-psi mx-psi closed this as completed Sep 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working exporter/datadog Datadog components priority:p2 Medium Stale
Projects
None yet
Development

No branches or pull requests

4 participants