Skip to content

Commit

Permalink
add eks/fargate StatefulSet distribution
Browse files Browse the repository at this point in the history
  • Loading branch information
Ryan Fitzpatrick committed Jan 20, 2022
1 parent 8058aa2 commit f30c258
Show file tree
Hide file tree
Showing 33 changed files with 1,201 additions and 9 deletions.
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
.idea
*.iml
12 changes: 12 additions & 0 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -49,3 +49,15 @@ render:
default helm-charts/splunk-otel-collector; \
mv "$$dir"/splunk-otel-collector/templates/* "$$dir"; \
rm -rf "$$dir"/splunk-otel-collector

# eks/fargate deployment (with recommended gateway)
dir=rendered/manifests/eks-fargate; \
mkdir -p "$$dir"; \
helm template \
--namespace default \
--values rendered/values.yaml \
--output-dir "$$dir" \
--set distribution=eks/fargate,gateway.enabled=true,cloudProvider=aws \
default helm-charts/splunk-otel-collector; \
mv "$$dir"/splunk-otel-collector/templates/* "$$dir"; \
rm -rf "$$dir"/splunk-otel-collector
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -92,6 +92,7 @@ Kubernetes distributions:

- [Vanilla (unmodified version) Kubernetes](https://kubernetes.io)
- [Amazon Elastic Kubernetes Service](https://aws.amazon.com/eks)
including [with Fargate profiles](docs/advanced-configuration.md#eks-fargate-support)
- [Azure Kubernetes Service](https://docs.microsoft.com/en-us/azure/aks)
- [Google Kubernetes Engine](https://cloud.google.com/kubernetes-engine)
including [GKE Autopilot](docs/advanced-configuration.md#gke-autopilot-support)
Expand Down
33 changes: 32 additions & 1 deletion docs/advanced-configuration.md
Original file line number Diff line number Diff line change
Expand Up @@ -43,10 +43,11 @@ Use the `distribution` parameter to provide information about underlying
Kubernetes deployment. This parameter allows the connector to automatically
scrape additional metadata. The supported options are:

- `aks` - Azure AKS
- `eks` - Amazon EKS
- `eks/fargate` - Amazon EKS with Fargate profiles
- `gke` - Google GKE / Standard mode
- `gke/autopilot` - Google GKE / Autopilot mode
- `aks` - Azure AKS
- `openshift` - Red Hat OpenShift

This value can be omitted if none of the values apply.
Expand Down Expand Up @@ -121,6 +122,36 @@ the following line to your custom values.yaml:
priorityClassName: splunk-otel-agent-priority
```

## EKS Fargate support

If you want to run the Splunk OpenTelemetry Collector in [Amazon Elastic Kubernetes Service
with Fargate profiles](https://docs.aws.amazon.com/eks/latest/userguide/fargate.html),
make sure to set the required `distribution` value to `eks/fargate`:

```yaml
distribution: eks/fargate
```

**NOTE:** Fluentd and Native OTel logs collection are not yet automatically configured in EKS with Fargate profiles

This distribution will operate similarly to the `eks` distribution but with the following distinctions:

1. The Collector agent daemonset is not applied since Fargate doesn't support daemonsets. Any desired Collector instances
running as agents must be configured manually as sidecar containers in your custom deployments. This includes any application
logging services like Fluentd. We recommend setting the `gateway.enabled` to `true` and configuring your instrumented
applications to report metrics, traces, and logs to the gateway's `<installed-chart-name>-splunk-otel-collector` service address if no
agent instances are used in your cluster.
2. The Collector's ClusterRole for `eks/fargate` will allow the `patch` verb on `nodes` resources for the default API groups. This is to allow
the Cluster Receiver's init container to add a `splunk-otel-is-eks-fargate-cluster-receiver-node` node label for self monitoring. This label is currently
required for reporting kubelet and pod metrics for the cluster receiver StatefulSet described below.
3. The configured Cluster Receiver is deployed as a 2-replica StatefulSet is configured with the
[Kubernetes Observer extension](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/extension/observer/k8sobserver/README.md)
that discovers the cluster's nodes and pods. It uses this to dynamically create
[Kubelet Stats receiver](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/kubeletstatsreceiver/README.md)
instances that will report kubelet metrics for all observed Fargate nodes, distributed across replicas. The first replica will monitor all kubelets
except its own (due to an EKS/Fargate networking restriction) and the second will monitor the first replica's. This is made possible by the Fargate-specific
deployment label mentioned above. The second replica will also have a k8s_cluster receiver instance.

## Logs collection

The helm chart currently utilizes [fluentd](https://docs.fluentd.org/) for Kubernetes logs
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
set -ex
if [ -f /splunk-messages/environ ]; then
. /splunk-messages/environ
fi
/otelcol $@
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
#! /usr/bin/bash
set -ex

# If we are the first pod (cluster receiver), set the kubelet stats node filter to only follow labelled nodes.
if [[ "${K8S_POD_NAME}" == *-0 ]]; then
echo "will configure kubelet stats receiver to follow node ${FIRST_CR_REPLICA_NODE_NAME}, as well as use cluster receiver."
echo "export CR_KUBELET_STATS_NODE_FILTER='&& labels[\"splunk-otel-is-eks-fargate-cluster-receiver-node\"] == \"true\"'" >/splunk-messages/environ
cat /splunk-messages/environ

# copy config to meet container command args
cp /conf/relay.yaml /splunk-messages/config.yaml
exit 0
fi

# Else we are the second pod (wide kubelet stats) label our node to be monitored by the first pod and disable the k8s_cluster receiver.
# Update our config to not monitor ourselves
echo "Labelling our fargate node to denote it hosts the cluster receiver"

# download kubectl (verifying checksum)
curl -o kubectl https://amazon-eks.s3.us-west-2.amazonaws.com/1.16.15/2020-11-02/bin/linux/amd64/kubectl
curl -o kubectl.sha256 https://amazon-eks.s3.us-west-2.amazonaws.com/1.16.15/2020-11-02/bin/linux/amd64/kubectl.sha256
ACTUAL=$(sha256sum kubectl | awk '{print $1}')
EXPECTED=$(cat kubectl.sha256 | awk '{print $1}')
if [ "${ACTUAL}" != "${EXPECTED}" ]; then
echo "will not attempt to use kubectl with unexpected sha256 (${ACTUAL} != ${EXPECTED})"
exit 1
fi
chmod a+x kubectl
# label node
./kubectl label nodes $K8S_NODE_NAME splunk-otel-is-eks-fargate-cluster-receiver-node=true

echo "Disabling k8s_cluster receiver for this instance"
# download yq to strip k8s_cluster receiver
curl -L -o yq https://github.com/mikefarah/yq/releases/download/v4.16.2/yq_linux_amd64
ACTUAL=$(sha256sum yq | awk '{print $1}')
if [ "${ACTUAL}" != "5c911c4da418ae64af5527b7ee36e77effb85de20c2ce732ed14c7f72743084d" ]; then
echo "will not attempt to use yq with unexpected sha256 (${ACTUAL} != 5c911c4da418ae64af5527b7ee36e77effb85de20c2ce732ed14c7f72743084d)"
exit 1
fi
chmod a+x yq
# strip k8s_cluster
./yq e 'del(.service.pipelines.metrics.receivers[0])' /conf/relay.yaml >/splunk-messages/config.yaml
./yq e -i 'del(.receivers.k8s_cluster)' /splunk-messages/config.yaml

# set kubelet stats to not monitor ourselves (all other kubelets)
echo "EKS kubelet stats receiver node lookup not applicable for $K8S_POD_NAME. Ensuring it won't monitor itself to avoid Fargate network limitation."
echo "export CR_KUBELET_STATS_NODE_FILTER='&& not ( name contains \"${K8S_NODE_NAME}\" )'" >/splunk-messages/environ
30 changes: 30 additions & 0 deletions helm-charts/splunk-otel-collector/templates/_helpers.tpl
Original file line number Diff line number Diff line change
Expand Up @@ -308,3 +308,33 @@ compatibility with the old config group name: "otelK8sClusterReceiver".
{{- deepCopy .Values.otelK8sClusterReceiver | mustMergeOverwrite (deepCopy .Values.clusterReceiver) | toYaml }}
{{- end }}
{{- end -}}

{{/*
"clusterReceiverServiceName" for the eks/fargate cluster receiver statefulSet
*/}}
{{- define "splunk-otel-collector.clusterReceiverServiceName" -}}
{{ printf "%s-k8s-cluster-receiver" ( include "splunk-otel-collector.fullname" . ) | trunc 63 | trimSuffix "-" }}
{{- end -}}

{{/*
"clusterReceiverNodeDiscovererScript" for the eks/fargate cluster receiver statefulSet initContainer
*/}}
{{- define "splunk-otel-collector.clusterReceiverNodeDiscovererScript" -}}
{{ printf "%s-cr-node-discoverer-script" ( include "splunk-otel-collector.fullname" . ) | trunc 63 | trimSuffix "-" }}
{{- end -}}

{{/*
"eksFargateClusterReceiverScript" for the eks/fargate cluster receiver statefulSet run command
*/}}
{{- define "splunk-otel-collector.eksFargateClusterReceiverScript" -}}
{{ printf "%s-fargate-cr-script" ( include "splunk-otel-collector.fullname" . ) | trunc 63 | trimSuffix "-" }}
{{- end -}}

{{/*
"clusterReceiverNodeDiscovererInitContainerEnabled" that's based on ...
*/}}
{{- define "splunk-otel-collector.clusterReceiverNodeDiscovererInitContainerEnabled" -}}
{{- $clusterReceiver := fromYaml (include "splunk-otel-collector.clusterReceiver" .) }}
{{- $o11yMetricsEnabled := (include "splunk-otel-collector.o11yMetricsEnabled" .) }}
{{- and (eq (toString $clusterReceiver.enabled) "true") (eq (toString $o11yMetricsEnabled) "true") (eq (include "splunk-otel-collector.distribution" .) "eks/fargate") -}}
{{- end -}}
8 changes: 8 additions & 0 deletions helm-charts/splunk-otel-collector/templates/clusterRole.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -86,6 +86,14 @@ rules:
- get
- list
- watch
{{- if eq (include "splunk-otel-collector.clusterReceiverNodeDiscovererInitContainerEnabled" .) "true" }}
- apiGroups:
- ""
resources:
- nodes
verbs:
- patch
{{- end }}
{{- with .Values.rbac.customRules }}
{{ toYaml . }}
{{- end }}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -69,7 +69,7 @@ resourcedetection:
- env
{{- if hasPrefix "gke" (include "splunk-otel-collector.distribution" .) }}
- gke
{{- else if eq (include "splunk-otel-collector.distribution" .) "eks" }}
{{- else if hasPrefix "eks" (include "splunk-otel-collector.distribution" .) }}
- eks
{{- else if eq (include "splunk-otel-collector.distribution" .) "aks" }}
- aks
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,14 @@ extensions:
memory_ballast:
size_mib: ${SPLUNK_BALLAST_SIZE_MIB}

{{- if eq (include "splunk-otel-collector.distribution" .) "eks/fargate" }}
# k8s_observer w/ pod and node detection for eks/fargate deployment
k8s_observer:
auth_type: serviceAccount
observe_pods: true
observe_nodes: true
{{- end }}

receivers:
# Prometheus receiver scraping metrics from the pod itself, both otel and fluentd
prometheus/k8s_cluster_receiver:
Expand Down Expand Up @@ -42,6 +50,26 @@ receivers:
- reason: FailedCreate
involvedObjectKind: Job
{{- end }}
{{- if eq (include "splunk-otel-collector.distribution" .) "eks/fargate" }}
# dynamically created kubeletstats receiver to report all Fargate "node" kubelet stats
# with exception of collector "node's" own since Fargate forbids connection.
receiver_creator:
receivers:
kubeletstats:
rule: type == "k8s.node" && name contains "fargate" ${CR_KUBELET_STATS_NODE_FILTER}
config:
auth_type: serviceAccount
collection_interval: 10s
endpoint: "`endpoint`:`kubelet_endpoint_port`"
extra_metadata_labels:
- container.id
metric_groups:
- container
- pod
- node
watch_observers:
- k8s_observer
{{- end }}

processors:
{{- include "splunk-otel-collector.otelMemoryLimiterConfig" . | nindent 2 }}
Expand Down Expand Up @@ -122,11 +150,20 @@ exporters:
{{- end }}

service:
{{- if eq (include "splunk-otel-collector.distribution" .) "eks/fargate" }}
extensions: [health_check, memory_ballast, k8s_observer]
{{- else }}
extensions: [health_check, memory_ballast]
{{- end }}
pipelines:
# k8s metrics pipeline
metrics:
{{- if eq (include "splunk-otel-collector.distribution" .) "eks/fargate" }}
receivers: [k8s_cluster, receiver_creator]
{{- else }}
receivers: [k8s_cluster]
{{- end }}

processors: [memory_limiter, batch, resource]
exporters:
{{- if (eq (include "splunk-otel-collector.o11yMetricsEnabled" .) "true") }}
Expand Down Expand Up @@ -171,3 +208,30 @@ service:
{{- end }}
{{- end }}
{{- end }}

{{- define "splunk-otel-collector.clusterReceiverInitContainers" -}}
{{- if eq (include "splunk-otel-collector.clusterReceiverNodeDiscovererInitContainerEnabled" .) "true" }}
- name: cluster-receiver-node-discoverer
image: public.ecr.aws/amazonlinux/amazonlinux:latest
imagePullPolicy: IfNotPresent
command: [ "bash", "-c", "/splunk-scripts/lookup-eks-fargate-receiver-node.sh"]
securityContext:
runAsUser: 0
env:
- name: K8S_POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: K8S_NODE_NAME
valueFrom:
fieldRef:
fieldPath: spec.nodeName
volumeMounts:
- name: {{ template "splunk-otel-collector.clusterReceiverNodeDiscovererScript" . }}
mountPath: /splunk-scripts
- name: messages
mountPath: /splunk-messages
- mountPath: /conf
name: collector-configmap
{{- end -}}
{{- end -}}
Original file line number Diff line number Diff line change
@@ -1,5 +1,8 @@
{{ $agent := fromYaml (include "splunk-otel-collector.agent" .) }}
{{ if $agent.enabled }}
{{/*
Fargate doesn't support daemonsets so never use for that platform
*/}}
{{- if and $agent.enabled (ne (include "splunk-otel-collector.distribution" .) "eks/fargate") }}
apiVersion: v1
kind: ConfigMap
metadata:
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
{{ $clusterReceiver := fromYaml (include "splunk-otel-collector.clusterReceiver" .) }}
{{ if and $clusterReceiver.enabled (eq (include "splunk-otel-collector.metricsEnabled" .) "true") (eq (include "splunk-otel-collector.distribution" .) "eks/fargate") }}
apiVersion: v1
kind: ConfigMap
metadata:
name: {{ template "splunk-otel-collector.clusterReceiverNodeDiscovererScript" . }}
labels:
{{- include "splunk-otel-collector.commonLabels" . | nindent 4 }}
app: {{ template "splunk-otel-collector.name" . }}
chart: {{ template "splunk-otel-collector.chart" . }}
release: {{ .Release.Name }}
heritage: {{ .Release.Service }}
data:
script: |
{{- (.Files.Get "scripts/lookup-eks-fargate-receiver-node.sh") | nindent 4 }}
{{- end }}
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
{{ $clusterReceiver := fromYaml (include "splunk-otel-collector.clusterReceiver" .) }}
{{ if and $clusterReceiver.enabled (eq (include "splunk-otel-collector.metricsEnabled" .) "true") (eq (include "splunk-otel-collector.distribution" .) "eks/fargate") }}
apiVersion: v1
kind: ConfigMap
metadata:
name: {{ template "splunk-otel-collector.eksFargateClusterReceiverScript" . }}
labels:
{{- include "splunk-otel-collector.commonLabels" . | nindent 4 }}
app: {{ template "splunk-otel-collector.name" . }}
chart: {{ template "splunk-otel-collector.chart" . }}
release: {{ .Release.Name }}
heritage: {{ .Release.Service }}
data:
script: |
{{- (.Files.Get "scripts/eks-fargate-otelcol-with-env.sh") | nindent 4 }}
{{- end }}
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
{{ $gateway := fromYaml (include "splunk-otel-collector.gateway" .) }}
{{ if $gateway.enabled }}
{{ if or $gateway.enabled (eq (include "splunk-otel-collector.distribution" .) "eks/fargate") }}
apiVersion: v1
kind: ConfigMap
metadata:
Expand Down
5 changes: 4 additions & 1 deletion helm-charts/splunk-otel-collector/templates/daemonset.yaml
Original file line number Diff line number Diff line change
@@ -1,5 +1,8 @@
{{ $agent := fromYaml (include "splunk-otel-collector.agent" .) }}
{{- if $agent.enabled }}
{{/*
Fargate doesn't support daemonsets so never use for that platform
*/}}
{{- if and $agent.enabled (ne (include "splunk-otel-collector.distribution" .) "eks/fargate") }}
apiVersion: apps/v1
kind: DaemonSet
metadata:
Expand Down
Loading

0 comments on commit f30c258

Please sign in to comment.