Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Blog: Blog post for scaling OpenTelemetry Collectors using Ansible #4140

Closed
wants to merge 57 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
57 commits
Select commit Hold shift + click to select a range
5f481b6
Create scaling-opentelemetry-collectors.md
ishanjainn Mar 12, 2024
87e7f5a
update for lint failure
ishanjainn Mar 12, 2024
c40a304
Update content/en/blog/2024/scaling-opentelemetry-collectors.md
ishanjainn Mar 13, 2024
983f5da
Update content/en/blog/2024/scaling-opentelemetry-collectors.md
ishanjainn Mar 13, 2024
714cb43
Update content/en/blog/2024/scaling-opentelemetry-collectors.md
ishanjainn Mar 13, 2024
270ff5e
Update content/en/blog/2024/scaling-opentelemetry-collectors.md
ishanjainn Mar 13, 2024
a640df7
Update content/en/blog/2024/scaling-opentelemetry-collectors.md
ishanjainn Mar 13, 2024
6561ace
Update content/en/blog/2024/scaling-opentelemetry-collectors.md
ishanjainn Mar 13, 2024
0faa4c0
Update content/en/blog/2024/scaling-opentelemetry-collectors.md
ishanjainn Mar 13, 2024
41e7613
Update content/en/blog/2024/scaling-opentelemetry-collectors.md
ishanjainn Mar 13, 2024
c38add7
Update content/en/blog/2024/scaling-opentelemetry-collectors.md
ishanjainn Mar 13, 2024
c172a11
Update content/en/blog/2024/scaling-opentelemetry-collectors.md
ishanjainn Mar 13, 2024
2482bbd
Update content/en/blog/2024/scaling-opentelemetry-collectors.md
ishanjainn Mar 13, 2024
d3e6b35
Update content/en/blog/2024/scaling-opentelemetry-collectors.md
ishanjainn Mar 13, 2024
7d65f4a
indent
ishanjainn Mar 13, 2024
91c3d6d
Merge branch 'main' into patch-4
ishanjainn Mar 13, 2024
9a5360b
Update content/en/blog/2024/scaling-opentelemetry-collectors.md
ishanjainn Mar 13, 2024
6c037f0
Update content/en/blog/2024/scaling-opentelemetry-collectors.md
ishanjainn Mar 13, 2024
ecb5e26
Update content/en/blog/2024/scaling-opentelemetry-collectors.md
ishanjainn Mar 13, 2024
9e8c6f2
Update content/en/blog/2024/scaling-opentelemetry-collectors.md
ishanjainn Mar 13, 2024
1c2c85a
Update content/en/blog/2024/scaling-opentelemetry-collectors.md
ishanjainn Mar 13, 2024
bcb7d03
Update content/en/blog/2024/scaling-opentelemetry-collectors.md
ishanjainn Mar 13, 2024
174e230
Update content/en/blog/2024/scaling-opentelemetry-collectors.md
ishanjainn Mar 13, 2024
c3de73c
Update content/en/blog/2024/scaling-opentelemetry-collectors.md
ishanjainn Mar 13, 2024
eb75dd3
Update content/en/blog/2024/scaling-opentelemetry-collectors.md
ishanjainn Mar 13, 2024
4df80d8
ansible prerequisite
ishanjainn Mar 13, 2024
72be345
partial fix, Config yet to be tested
ishanjainn Mar 13, 2024
631be4e
fix config
ishanjainn Mar 13, 2024
80ccad5
update Inventory
ishanjainn Mar 13, 2024
12a622f
ansible-config updates
ishanjainn Mar 13, 2024
77c776e
update doc with grafana steps
ishanjainn Mar 13, 2024
e44d39b
cspell fixes
ishanjainn Mar 13, 2024
bad97df
linter
ishanjainn Mar 13, 2024
ff5947a
`npm run fix:format`
ishanjainn Mar 13, 2024
b840872
npm run fix:format again
ishanjainn Mar 13, 2024
936eb0f
Merge branch 'main' into patch-4
ishanjainn Mar 13, 2024
9a209f4
Merge branch 'main' into patch-4
ishanjainn Mar 15, 2024
b77c266
remove canonical URL
patcher9 Mar 19, 2024
db991fe
Auto-update registry versions (90129530fc2097d9835f54da9df1ad09ff26ee…
opentelemetrybot Mar 15, 2024
89fb41e
Add Causely to vendors.yaml (#4158)
esara Mar 16, 2024
8ed906d
Auto-update registry versions (0ca8a0a58be7073c4a72f032f44b5b249615d7…
opentelemetrybot Mar 16, 2024
3fd8461
spring starter can now use all sdk autoconfig properties (#4167)
zeitlinger Mar 16, 2024
7fd62b6
Update opentelemetry-java-instrumentation version to v2.2.0 (#4164)
opentelemetrybot Mar 16, 2024
051e607
Add paymentServiceFailure & paymentServiceUnreachable featureflags to…
EislM0203 Mar 16, 2024
b3191c5
Update opentelemetry-specification version to v1.31.0 (#4157)
opentelemetrybot Mar 16, 2024
04d5cad
Troubleshooting added to check available components in the collector …
nerudadhich Mar 16, 2024
faf8784
Demo Docs: Update the name of recommendation cache feature flag (#4159)
cthain Mar 16, 2024
2b3708d
Update kubernetes-deployment.md (#4160)
julianocosta89 Mar 16, 2024
4a8e80c
Update architecture.md (#4162)
julianocosta89 Mar 16, 2024
a89d5be
spring boot build info resource detector (#3999)
zeitlinger Mar 16, 2024
8aeb024
how to enable Resource Providers that are disabled by default (#4138)
zeitlinger Mar 16, 2024
12dcb6d
Bump @opentelemetry/auto-instrumentations-web from 0.36.0 to 0.37.0 (…
dependabot[bot] Mar 16, 2024
ccc2dd3
Create opentelemetry-announced-support-for-profiling (#4173)
cartermp Mar 18, 2024
ac41348
Update author (it was austin, I am just his sockpuppet) (#4175)
cartermp Mar 18, 2024
33995b8
Adds Chronosphere to vendors list (#4179)
subvocal Mar 19, 2024
c24a755
Auto-update registry versions (2f7aab49799161a841fae7f82b536e6df0759a…
opentelemetrybot Mar 19, 2024
db6ea32
Remove mention of spanmetrics processor (#4178)
tiffany76 Mar 19, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .gitmodules
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
[submodule "content-modules/opentelemetry-specification"]
path = content-modules/opentelemetry-specification
url = https://github.com/open-telemetry/opentelemetry-specification.git
spec-pin = v1.30.0
spec-pin = v1.31.0
[submodule "content-modules/community"]
path = content-modules/community
url = https://github.com/open-telemetry/community
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
---
title: OpenTelemetry announces support for profiling
linkTitle: OpenTelemetry announces support for profiling
date: 2024-03-19
# prettier-ignore
cSpell:ignore: stuff probably
author: '[Austin Parker](https://github.com/austinlparker) (Honeycomb)'
---

In 2023, OpenTelemetry announced that it achieved stability for [logs, metrics, and traces](https://www.cncf.io/blog/2023/11/07/opentelemetry-at-kubecon-cloudnativecon-north-america-2023-update/). While this was our initial goal at the formation of the project, fulfilling our vision of enabling built-in observability for cloud native applications requires us to continue evolving with the community. This year, we’re proud to announce that exactly two years after the Profiling SIG was created at KubeCon + CloudNativeCon Europe 2022 in Valencia, we’re taking a big step towards this goal by merging a profiling data model into our specification and working towards a stable implementation this year!

## What is profiling?

Profiling is a method to dynamically inspect the behavior and performance of application code at run-time. Continuous profiling gives insights into resource utilization at a code-level and allows for this profiling data to be stored, queried, and analyzed over time and across different attributes. It’s an important technique for developers and performance engineers to understand exactly what’s happening in their code. OpenTelemetry’s [profiling signal](https://github.com/open-telemetry/oteps/blob/main/text/profiles/0239-profiles-data-model.md) expands upon the work that has been done in this space and, as a first for the industry, connects profiles with other telemetry signals from applications and infrastructure. This allows developers and operators to correlate resource exhaustion or poor user experience across their services with not just the specific service or pod being impacted, but the function or line of code most responsible for it.

We’re thrilled to see the embrace of this vision by the industry, with many organizations coming together to help define the profiling signal. More specifically, the following two donations are in play:

- Elastic has [pledged to donate](https://github.com/open-telemetry/community/issues/1918) their proprietary eBPF-based profiling agent <sup>1</sup>
- Splunk has begun the process of [donating their .NET based profiler](https://github.com/open-telemetry/opentelemetry-dotnet-instrumentation/pull/3196)

These are being donated to the project in order to accelerate the delivery and implementation of OpenTelemetry profiling.

## What does this mean for users?

Profiles will support bi-directional links between themselves and other signals, such as logs, metrics, and traces. You’ll be able to easily jump from resource telemetry to a corresponding profile. For example:

- Metrics to profiles: You will be able to go from a spike in CPU usage or memory usage to the specific pieces of the code which are consuming that resource
- Traces to profiles: You will be able to understand not just the location of latency across your services, but when that latency is caused by pieces of the code it will be reflected in a profile attached to a trace or span
- Logs to profiles: Logs often give the context that something is wrong, but profiling will allow you to go from just tracking something (i.e. Out Of Memory errors) to seeing exactly which parts of the code are using up memory resources

These are just a few and these links work the opposite direction as well, but more generally profiling helps deliver on the promise of observability by making it easier for users to query and understand an entire new dimension about their applications with minimal additional code/effort.

A community in motion

This work would not be possible without the dedicated contributors who work on OpenTelemetry each day. We’ve recently passed a new milestone, with over 1000 unique developers contributing to the project each month, representing over 180 companies. Across our most popular repositories, OpenTelemetry sees over 30 million downloads a month<sup>2</sup>, and new open source projects are adopting our standards at a regular pace, including [Apache Kafka](https://cwiki.apache.org/confluence/display/KAFKA/KIP-714%3A+Client+metrics+and+observability), and [dozens more](https://opentelemetry.io/ecosystem/integrations/). We’re also deepening our integrations with other open source projects in CNCF and out, such as [OpenFeature](https://openfeature.dev) and [OpenSearch](https://github.com/open-telemetry/opentelemetry-collector-contrib/issues/23611), in addition to our existing integrations with Kubernetes, Thanos, Knative, and many more.

2024 promises to be another big year for OpenTelemetry as we continue to implement and stabilize our existing tracing, metrics, and log signals while adding support for profiling, client-side RUM, and more. It’s a great time to get involved – check out our [website](https://opentelemetry.io) to learn more!

<sup>1</sup> Pending due diligence and review by the OpenTelemetry maintainers.
<sup>2</sup> According to public download statistics of our .NET, Java, and Python APIs

229 changes: 229 additions & 0 deletions content/en/blog/2024/scaling-opentelemetry-collectors.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,229 @@
---
title: Manage OpenTelemetry Collectors at Scale with Ansible
linkTitle: OTel Collector with Ansible
date: 2024-03-12
author: '[Ishan Jain](https://github.com/ishanjainn) (Grafana)'
ishanjainn marked this conversation as resolved.
Show resolved Hide resolved
cSpell:ignore: ansible Ishan Jain
---

This guide is focused on scaling the
[OpenTelemetry Collector deployment](/docs/collector/deployment/) across various
Linux hosts by leveraging [Ansible](https://www.ansible.com/), to function both
as [gateways](/docs/collector/deployment/gateway/) and
[agents](/docs/collector/deployment/agent/) within your observability
architecture. Utilizing the OpenTelemetry Collector in this dual capacity
enables a robust collection and forwarding of metrics, traces, and logs to
analysis and visualization platforms.

Here, we outline a strategy for deploying and managing the OpenTelemetry
Collector's scalable instances throughout your infrastructure with Ansible,
enhancing your overall monitoring strategy and data visualization capabilities
in [Grafana](https://grafana.com/).

## Before you begin

To follow this guide, ensure you have:

- Ansible Installed in your system
- Linux hosts along with SSH access to each of these Linux hosts.
- Prometheus for gathering metrics

## Install the Grafana Ansible collection

The
[OpenTelemetry Collector role](https://github.com/grafana/grafana-ansible-collection/tree/main/roles/opentelemetry_collector)
is provided through the
[Grafana Ansible collection](https://docs.ansible.com/ansible/latest/collections/grafana/grafana/)
as of the 3.0.0 release.

To install the Grafana Ansible collection, run this command:

```shell
ansible-galaxy collection install grafana.grafana
```

## Create an Ansible inventory file

Next, you will set up your hosts and create an inventory file.

1. Create your hosts and add public SSH keys to them.

This example uses eight Linux hosts: two Ubuntu hosts, two CentOS hosts, two
Fedora hosts, and two Debian hosts.

2. Create an Ansible inventory file.

The Ansible inventory, which resides in a file named `inventory`, looks
similar to this:

```ini
10.0.0.1 # hostname = ubuntu-01
10.0.0.2 # hostname = ubuntu-02
10.0.0.3 # hostname = centos-01
10.0.0.4 # hostname = centos-02
10.0.0.5 # hostname = debian-01
10.0.0.6 # hostname = debian-02
10.0.0.7 # hostname = fedora-01
10.0.0.8 # hostname = fedora-02
```

> **Note**: If you are copying the above file, remove the comments (#).

3. Create an `ansible.cfg` file within the same directory as `inventory`, with
the following values:

```cfg
[defaults]
inventory = inventory # Path to the inventory file
private_key_file = ~/.ssh/id_rsa # Path to my private SSH Key
remote_user=root
```

## Use the OpenTelemetry Collector Ansible role

Next, you'll define an Ansible playbook to apply your chosen or created
OpenTelemetry Collector role across your hosts.

Create a file named `deploy-opentelemetry.yml` in the same directory as your
`ansible.cfg` and `inventory`.

```yaml
- name: Install OpenTelemetry Collector
hosts: all
become: true

tasks:
- name: Install OpenTelemetry Collector
ansible.builtin.include_role:
name: opentelemetry_collectorr
vars:
otel_collector_receivers:
hostmetrics:
collection_interval: 60s
scrapers:
cpu: {}
disk: {}
load: {}
filesystem: {}
memory: {}
network: {}
paging: {}
process:
mute_process_name_error: true
mute_process_exe_error: true
mute_process_io_error: true
processes: {}

otel_collector_processors:
batch:
resourcedetection:
detectors: [env, system]
timeout: 2s
system:
hostname_sources: [os]
transform/add_resource_attributes_as_metric_attributes:
error_mode: ignore
metric_statements:
- context: datapoint
statements:
- set(attributes["deployment.environment"],
resource.attributes["deployment.environment"])
- set(attributes["service.version"],
resource.attributes["service.version"])

otel_collector_exporters:
prometheusremotewrite:
endpoint: https://<prometheus-url>/api/prom/push
headers:
Authorization: 'Basic <base64-encoded-username:password>'

otel_collector_service:
pipelines:
metrics:
receivers: [hostmetrics]
processors:
[
resourcedetection,
transform/add_resource_attributes_as_metric_attributes,
batch,
]
exporters: [prometheusremotewrite]
```

{{% alert title="Note" %}}

You'll need to adjust the configuration to match the specific telemetry you
intend to collect and where you plan to forward it. This configuration snippet
is a basic example designed for collecting host metrics and forwarded to
Prometheus.

{{% /alert %}}

The previous configuration would provision the OpenTelemetry Collector to
collect host metrics from the Linux host.

## Running the Ansible playbook

Deploy the OpenTelemetry Collector across your hosts by executing:

```sh
ansible-playbook deploy-opentelemetry.yml
```

## Visualizing Metrics in Grafana

Once your OpenTelemetry Collector's start sending metrics to Prometheus, follow
these quick steps to visualize them in [Grafana](https://grafana.com/):

### Setup Grafana

1. **Install Docker**: Make sure Docker is installed on your system. If it's
not, you can find the installation guide at the
[official Docker website](https://docs.docker.com/get-docker/).

2. **Run Grafana Docker Container**: Start a Grafana server with this Docker
command, which fetches the latest Grafana image:

```sh
docker run -d -p 3000:3000 --name=grafana grafana/grafana
```

3. **Access Grafana**: Navigate to `http://localhost:3000` in your web browser.
The default login details are `admin` for both the username and password.

4. **Change Password**: Upon your first login, you will be prompted to set a new
password. Make sure to pick a secure one.

For other installation methods and more detailed instructions, refer to the
[official Grafana documentation](https://grafana.com/docs/grafana/latest/installation/).

### Add Prometheus as a Data Source

1. **Login to Grafana** and navigate to **Connections** > **Data Sources**.
2. Click **Add data source**, and choose **Prometheus**.
3. In the settings, enter your Prometheus URL, for example,
`http://<your_prometheus_host>`, along with any other necessary details, and
then click **Save & Test**.

### Explore metrics

1. Go to the **Explore** page
2. In the Query editor, select your Prometheus data source and enter the below
query

```PromQL
100 - (avg by (cpu) (irate(system_cpu_time{state="idle"}[5m])) * 100)
```

This query calculates the average percentage of CPU time not spent in the
"idle" state, across each CPU core, over the last 5 minutes.

3. Play around with different metrics and start putting together your dashboards
to gain insights into your system's performance.

This guide makes it easier for you to set up the OpenTelemetry Collector on
several Linux machines with the help of Ansible and shows you how to see the
data it collects in Grafana. You can adjust the settings for the OpenTelemetry
Collector and design your Grafana dashboards to meet your own monitoring and
tracking needs. This way, you get exactly the insights you want from your
systems, making your job as a developer a bit easier.
3 changes: 1 addition & 2 deletions content/en/docs/collector/architecture.md
Original file line number Diff line number Diff line change
Expand Up @@ -211,8 +211,7 @@ processor.
Processors can transform the data before forwarding it, such as adding or
removing attributes from spans. They can also drop the data by deciding not to
forward it (for example, the `probabilisticsampler` processor). Or they can
generate new data, as the `spanmetrics` processor does by producing metrics for
spans processed by the pipeline.
generate new data.

The same name of the processor can be referenced in the `processors` key of
multiple pipelines. In this case, the same configuration is used for each of
Expand Down
99 changes: 99 additions & 0 deletions content/en/docs/collector/troubleshooting.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,105 @@ network issues, it can be helpful to send a small amount of data to a collector
configured to output to local logs. For details, see
[Local exporters](https://github.com/open-telemetry/opentelemetry-collector/blob/main/docs/troubleshooting.md#local-exporters).

## Check available components in the Collector

Use the following sub-command to list the available components in a Collector
distribution, including their stability levels. Please note that the output
format may change across versions.

```sh
otelcol components
```

Sample output

```yaml
buildinfo:
command: otelcol
description: OpenTelemetry Collector
version: 0.96.0
receivers:
- name: opencensus
stability:
logs: Undefined
metrics: Beta
traces: Beta
- name: prometheus
stability:
logs: Undefined
metrics: Beta
traces: Undefined
- name: zipkin
stability:
logs: Undefined
metrics: Undefined
traces: Beta
- name: otlp
stability:
logs: Beta
metrics: Stable
traces: Stable
processors:
- name: resource
stability:
logs: Beta
metrics: Beta
traces: Beta
- name: span
stability:
logs: Undefined
metrics: Undefined
traces: Alpha
- name: probabilistic_sampler
stability:
logs: Alpha
metrics: Undefined
traces: Beta
exporters:
- name: otlp
stability:
logs: Beta
metrics: Stable
traces: Stable
- name: otlphttp
stability:
logs: Beta
metrics: Stable
traces: Stable
- name: debug
stability:
logs: Development
metrics: Development
traces: Development
- name: prometheus
stability:
logs: Undefined
metrics: Beta
traces: Undefined
connectors:
- name: forward
stability:
logs-to-logs: Beta
logs-to-metrics: Undefined
logs-to-traces: Undefined
metrics-to-logs: Undefined
metrics-to-metrics: Beta
traces-to-traces: Beta
extensions:
- name: zpages
stability:
extension: Beta
- name: memory_ballast
stability:
extension: Deprecated
- name: health_check
stability:
extension: Beta
- name: pprof
stability:
extension: Beta
```

## Checklist for debugging complex pipelines

It can be difficult to isolate problems when telemetry flows through multiple
Expand Down
Loading
Loading