Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add eks/fargate distribution with 2-replica StatefulSet #346

Merged
merged 14 commits into from
Feb 1, 2022
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
.idea
*.iml
12 changes: 12 additions & 0 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -49,3 +49,15 @@ render:
default helm-charts/splunk-otel-collector; \
mv "$$dir"/splunk-otel-collector/templates/* "$$dir"; \
rm -rf "$$dir"/splunk-otel-collector

# eks/fargate deployment (with recommended gateway)
dir=rendered/manifests/eks-fargate; \
mkdir -p "$$dir"; \
helm template \
--namespace default \
--values rendered/values.yaml \
--output-dir "$$dir" \
--set distribution=eks/fargate,gateway.enabled=true,cloudProvider=aws \
default helm-charts/splunk-otel-collector; \
mv "$$dir"/splunk-otel-collector/templates/* "$$dir"; \
rm -rf "$$dir"/splunk-otel-collector
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -92,6 +92,7 @@ Kubernetes distributions:

- [Vanilla (unmodified version) Kubernetes](https://kubernetes.io)
- [Amazon Elastic Kubernetes Service](https://aws.amazon.com/eks)
including [with Fargate profiles](docs/advanced-configuration.md#eks-fargate-support)
- [Azure Kubernetes Service](https://docs.microsoft.com/en-us/azure/aks)
- [Google Kubernetes Engine](https://cloud.google.com/kubernetes-engine)
including [GKE Autopilot](docs/advanced-configuration.md#gke-autopilot-support)
Expand Down
36 changes: 35 additions & 1 deletion docs/advanced-configuration.md
Original file line number Diff line number Diff line change
Expand Up @@ -43,10 +43,11 @@ Use the `distribution` parameter to provide information about underlying
Kubernetes deployment. This parameter allows the connector to automatically
scrape additional metadata. The supported options are:

- `aks` - Azure AKS
- `eks` - Amazon EKS
- `eks/fargate` - Amazon EKS with Fargate profiles
- `gke` - Google GKE / Standard mode
- `gke/autopilot` - Google GKE / Autopilot mode
- `aks` - Azure AKS
- `openshift` - Red Hat OpenShift

This value can be omitted if none of the values apply.
Expand Down Expand Up @@ -157,6 +158,39 @@ the following line to your custom values.yaml:
priorityClassName: splunk-otel-agent-priority
```

## EKS Fargate support

If you want to run the Splunk OpenTelemetry Collector in [Amazon Elastic Kubernetes Service
with Fargate profiles](https://docs.aws.amazon.com/eks/latest/userguide/fargate.html),
make sure to set the required `distribution` value to `eks/fargate`:

```yaml
distribution: eks/fargate
```

**NOTE:** Fluentd and Native OTel logs collection are not yet automatically configured in EKS with Fargate profiles

This distribution will operate similarly to the `eks` distribution but with the following distinctions:

1. The Collector agent daemonset is not applied since Fargate doesn't support daemonsets. Any desired Collector instances
running as agents must be configured manually as sidecar containers in your custom deployments. This includes any application
logging services like Fluentd. We recommend setting the `gateway.enabled` to `true` and configuring your instrumented
applications to report metrics, traces, and logs to the gateway's `<installed-chart-name>-splunk-otel-collector` service address if no
agent instances are used in your cluster. Any desired agent instances that would run as a daemonset should instead run as sidecar containers in your pods.
dmitryax marked this conversation as resolved.
Show resolved Hide resolved

2. Since Fargate nodes use a VM boundary to prevent access to host-based resources used by other pods, pods are not able to reach their own kubelet. The cluster receiver
for the Fargate distribution has two primary differences between regular `eks` to work around this limitation:
* The configured cluster receiver is deployed as a 2-replica StatefulSet instead of a Deployment and uses a
[Kubernetes Observer extension](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/extension/observer/k8sobserver/README.md)
that discovers the cluster's nodes and, on the second replica, its pods for user-configurable receiver creator additions. It uses this observer to dynamically create
[Kubelet Stats receiver](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/kubeletstatsreceiver/README.md)
instances that will report kubelet metrics for all observed Fargate nodes. The first replica will monitor the cluster with a `k8s_cluster` receiver
and the second will monitor all kubelets except its own (due to an EKS/Fargate networking restriction).

* The first replica's collector will monitor the second's kubelet. This is made possible by a Fargate-specific `splunk-otel-eks-fargate-kubeletstats-receiver-node`
node label. The Collector's ClusterRole for `eks/fargate` will allow the `patch` verb on `nodes` resources for the default API groups to allow the cluster
receiver's init container to add this node label for designated self monitoring.

## Logs collection

The helm chart currently utilizes [fluentd](https://docs.fluentd.org/) for Kubernetes logs
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
#! /usr/bin/bash
set -ex

echo "Downloading yq"
curl -L -o yq https://github.com/mikefarah/yq/releases/download/v4.16.2/yq_linux_amd64
ACTUAL=$(sha256sum yq | awk '{print $1}')
if [ "${ACTUAL}" != "5c911c4da418ae64af5527b7ee36e77effb85de20c2ce732ed14c7f72743084d" ]; then
echo "will not attempt to use yq with unexpected sha256 (${ACTUAL} != 5c911c4da418ae64af5527b7ee36e77effb85de20c2ce732ed14c7f72743084d)"
exit 1
fi
chmod a+x yq

# If we are the first pod (cluster receiver), set the kubelet stats node filter to only follow labelled nodes.
# This node label will be set by the second pod.
if [[ "${K8S_POD_NAME}" == *-0 ]]; then
echo "will configure kubelet stats receiver to follow other StatefulSet replica's node, as well as use cluster receiver."
./yq e '.receivers.receiver_creator.receivers.kubeletstats.rule = .receivers.receiver_creator.receivers.kubeletstats.rule + " && labels[\"splunk-otel-eks-fargate-kubeletstats-receiver-node\"] == \"true\""' /conf/relay.yaml >/splunk-messages/config.yaml
./yq e -i '.extensions.k8s_observer.observe_pods = false' /splunk-messages/config.yaml
exit 0
fi

# Else we are the second pod (wide kubelet stats) label our node to be monitored by the first pod and disable the k8s_cluster receiver.
# Update our config to not monitor ourselves
echo "Labelling our fargate node to denote it hosts the cluster receiver"

# download kubectl (verifying checksum)
curl -o kubectl https://amazon-eks.s3.us-west-2.amazonaws.com/1.20.4/2021-04-12/bin/linux/amd64/kubectl
ACTUAL=$(sha256sum kubectl | awk '{print $1}')
if [ "${ACTUAL}" != "e84ff8c607b2a10f635c312403f9ede40a045404957e55adcf3d663f9e32c630" ]; then
echo "will not attempt to use kubectl with unexpected sha256 (${ACTUAL} != e84ff8c607b2a10f635c312403f9ede40a045404957e55adcf3d663f9e32c630)"
exit 1
fi
chmod a+x kubectl
# label node
./kubectl label nodes $K8S_NODE_NAME splunk-otel-eks-fargate-kubeletstats-receiver-node=true

echo "Disabling k8s_cluster receiver for this instance"
# strip k8s_cluster and its pipeline
./yq e 'del(.service.pipelines.metrics)' /conf/relay.yaml >/splunk-messages/config.yaml
./yq e -i 'del(.receivers.k8s_cluster)' /splunk-messages/config.yaml

# set kubelet stats to not monitor ourselves (all other kubelets)
echo "Ensuring k8s_observer-based kubeletstats receivers won't monitor own node to avoid Fargate network limitation."
./yq e -i '.receivers.receiver_creator.receivers.kubeletstats.rule = .receivers.receiver_creator.receivers.kubeletstats.rule + " && not ( name contains \"${K8S_NODE_NAME}\" )"' /splunk-messages/config.yaml
14 changes: 14 additions & 0 deletions helm-charts/splunk-otel-collector/templates/_helpers.tpl
Original file line number Diff line number Diff line change
Expand Up @@ -308,3 +308,17 @@ compatibility with the old config group name: "otelK8sClusterReceiver".
{{- deepCopy .Values.otelK8sClusterReceiver | mustMergeOverwrite (deepCopy .Values.clusterReceiver) | toYaml }}
{{- end }}
{{- end -}}

{{/*
"clusterReceiverServiceName" for the eks/fargate cluster receiver statefulSet
*/}}
{{- define "splunk-otel-collector.clusterReceiverServiceName" -}}
{{ printf "%s-k8s-cluster-receiver" ( include "splunk-otel-collector.fullname" . ) | trunc 63 | trimSuffix "-" }}
{{- end -}}

{{/*
"clusterReceiverNodeDiscovererScript" for the eks/fargate cluster receiver statefulSet initContainer
*/}}
{{- define "splunk-otel-collector.clusterReceiverNodeDiscovererScript" -}}
{{ printf "%s-cr-node-discoverer-script" ( include "splunk-otel-collector.fullname" . ) | trunc 63 | trimSuffix "-" }}
{{- end -}}
Original file line number Diff line number Diff line change
Expand Up @@ -86,6 +86,14 @@ rules:
- get
- list
- watch
{{- if eq (include "splunk-otel-collector.distribution" .) "eks/fargate" }}
- apiGroups:
- ""
resources:
- nodes
verbs:
- patch
{{- end }}
{{- with .Values.rbac.customRules }}
{{ toYaml . }}
{{- end }}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -69,7 +69,7 @@ resourcedetection:
- env
{{- if hasPrefix "gke" (include "splunk-otel-collector.distribution" .) }}
- gke
{{- else if eq (include "splunk-otel-collector.distribution" .) "eks" }}
{{- else if hasPrefix "eks" (include "splunk-otel-collector.distribution" .) }}
- eks
{{- else if eq (include "splunk-otel-collector.distribution" .) "aks" }}
- aks
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,14 @@ extensions:
memory_ballast:
size_mib: ${SPLUNK_BALLAST_SIZE_MIB}

{{- if eq (include "splunk-otel-collector.distribution" .) "eks/fargate" }}
# k8s_observer w/ pod and node detection for eks/fargate deployment
k8s_observer:
auth_type: serviceAccount
observe_pods: true
observe_nodes: true
{{- end }}

receivers:
# Prometheus receiver scraping metrics from the pod itself, both otel and fluentd
prometheus/k8s_cluster_receiver:
Expand Down Expand Up @@ -42,6 +50,26 @@ receivers:
- reason: FailedCreate
involvedObjectKind: Job
{{- end }}
{{- if eq (include "splunk-otel-collector.distribution" .) "eks/fargate" }}
# dynamically created kubeletstats receiver to report all Fargate "node" kubelet stats
# with exception of collector "node's" own since Fargate forbids connection.
dmitryax marked this conversation as resolved.
Show resolved Hide resolved
receiver_creator:
receivers:
kubeletstats:
rule: type == "k8s.node" && name contains "fargate"
config:
auth_type: serviceAccount
collection_interval: 10s
endpoint: "`endpoint`:`kubelet_endpoint_port`"
extra_metadata_labels:
- container.id
metric_groups:
- container
- pod
- node
watch_observers:
- k8s_observer
{{- end }}

processors:
{{- include "splunk-otel-collector.otelMemoryLimiterConfig" . | nindent 2 }}
Expand Down Expand Up @@ -80,12 +108,6 @@ processors:
- action: insert
key: metric_source
value: kubernetes
# XXX: Added so that Smart Agent metrics and OTel metrics don't map to the same MTS identity
# (same metric and dimension names and values) after mappings are applied. This would be
# the case if somebody uses the same cluster name from Smart Agent and OTel in the same org.
- action: insert
key: receiver
value: k8scluster
- action: upsert
key: k8s.cluster.name
value: {{ .Values.clusterName }}
Expand All @@ -95,6 +117,15 @@ processors:
value: {{ .value }}
{{- end }}

resource/k8s_cluster:
dmitryax marked this conversation as resolved.
Show resolved Hide resolved
attributes:
# XXX: Added so that Smart Agent metrics and OTel metrics don't map to the same MTS identity
# (same metric and dimension names and values) after mappings are applied. This would be
# the case if somebody uses the same cluster name from Smart Agent and OTel in the same org.
- action: insert
key: receiver
value: k8scluster

exporters:
{{- if eq (include "splunk-otel-collector.o11yMetricsEnabled" $) "true" }}
signalfx:
Expand Down Expand Up @@ -125,11 +156,27 @@ service:
telemetry:
metrics:
address: 0.0.0.0:8889
{{- if eq (include "splunk-otel-collector.distribution" .) "eks/fargate" }}
extensions: [health_check, memory_ballast, k8s_observer]
{{- else }}
extensions: [health_check, memory_ballast]
{{- end }}
pipelines:
# k8s metrics pipeline
metrics:
receivers: [k8s_cluster]
processors: [memory_limiter, batch, resource, resource/k8s_cluster]
exporters:
{{- if (eq (include "splunk-otel-collector.o11yMetricsEnabled" .) "true") }}
- signalfx
{{- end }}
{{- if (eq (include "splunk-otel-collector.platformMetricsEnabled" $) "true") }}
- splunk_hec/platform_metrics
{{- end }}

{{- if eq (include "splunk-otel-collector.distribution" .) "eks/fargate" }}
metrics/eks:
receivers: [receiver_creator]
processors: [memory_limiter, batch, resource]
exporters:
{{- if (eq (include "splunk-otel-collector.o11yMetricsEnabled" .) "true") }}
Expand All @@ -138,6 +185,7 @@ service:
{{- if (eq (include "splunk-otel-collector.platformMetricsEnabled" $) "true") }}
- splunk_hec/platform_metrics
{{- end }}
{{- end }}

{{- if or (eq (include "splunk-otel-collector.splunkO11yEnabled" $) "true") (eq (include "splunk-otel-collector.platformMetricsEnabled" $) "true") }}
# Pipeline for metrics collected about the collector pod itself.
Expand Down Expand Up @@ -174,3 +222,18 @@ service:
{{- end }}
{{- end }}
{{- end }}

{{/*
Pod anti-affinity to prevent eks/fargate replicas from being on same node
*/}}
{{- define "splunk-otel-collector.clusterReceiverPodAntiAffinity" -}}
dmitryax marked this conversation as resolved.
Show resolved Hide resolved
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: component
operator: In
values:
- otel-k8s-cluster-receiver
topologyKey: "kubernetes.io/hostname"
{{- end }}
Original file line number Diff line number Diff line change
@@ -1,5 +1,8 @@
{{ $agent := fromYaml (include "splunk-otel-collector.agent" .) }}
{{ if $agent.enabled }}
{{/*
Fargate doesn't support daemonsets so never use for that platform
*/}}
{{- if and $agent.enabled (ne (include "splunk-otel-collector.distribution" .) "eks/fargate") }}
apiVersion: v1
kind: ConfigMap
metadata:
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
{{ $clusterReceiver := fromYaml (include "splunk-otel-collector.clusterReceiver" .) }}
{{ if and $clusterReceiver.enabled (eq (include "splunk-otel-collector.metricsEnabled" .) "true") (eq (include "splunk-otel-collector.distribution" .) "eks/fargate") }}
apiVersion: v1
kind: ConfigMap
metadata:
name: {{ template "splunk-otel-collector.clusterReceiverNodeDiscovererScript" . }}
labels:
{{- include "splunk-otel-collector.commonLabels" . | nindent 4 }}
app: {{ template "splunk-otel-collector.name" . }}
chart: {{ template "splunk-otel-collector.chart" . }}
release: {{ .Release.Name }}
heritage: {{ .Release.Service }}
data:
script: |
{{- (.Files.Get "scripts/init-eks-fargate-cluster-receiver.sh") | nindent 4 }}
{{- end }}
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
{{ $gateway := fromYaml (include "splunk-otel-collector.gateway" .) }}
{{ if $gateway.enabled }}
{{ if or $gateway.enabled (eq (include "splunk-otel-collector.distribution" .) "eks/fargate") }}
rmfitzpatrick marked this conversation as resolved.
Show resolved Hide resolved
apiVersion: v1
kind: ConfigMap
metadata:
Expand Down
Original file line number Diff line number Diff line change
@@ -1,5 +1,8 @@
{{ $agent := fromYaml (include "splunk-otel-collector.agent" .) }}
{{- if $agent.enabled }}
{{/*
Fargate doesn't support daemonsets so never use for that platform
*/}}
{{- if and $agent.enabled (ne (include "splunk-otel-collector.distribution" .) "eks/fargate") }}
apiVersion: apps/v1
kind: DaemonSet
metadata:
Expand Down
Loading