You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When running versions of the operator >v1.15.1 under heavy logging load deleted files are being held open. Previous versions do not have this issue.
This results in rising disk consumption until the node is full. Cycling the log-router pods releases all the deleted files and the spare is cleared.
Reproduction steps
When using the default configuration (kubernetes.conf) on a cluster with 4 worker nodes (m5.4xlarge)
<source>
@type tail
@id in_tail_container_logs
path /var/log/containers/*.log
pos_file /var/log/log-router-fluentd-containers.log.pos
pos_file_compaction_interval 1h
tag kubernetes.*
read_from_head true
read_bytes_limit_per_second 8192
<parse>
@type multiline
# cri-o
format1 /^(?<partials>([^\n]+ (stdout|stderr) P [^\n]+\n)*)/
format2 /(?<time>[^\n]+) (?<stream>stdout|stderr) F (?<log>[^\n]*)/
# docker
format3 /|(?<json>{.*})/
time_format %Y-%m-%dT%H:%M:%S.%N%:z
</parse></source>
Using a workload like:
apiVersion: apps/v1
kind: Deployment
metadata:
name: logging
labels:
app: logging
spec:
replicas: 15
selector:
matchLabels:
app: logging
template:
metadata:
annotations:
seccomp.security.alpha.kubernetes.io/pod: runtime/default
app: logging
name: logging-pod
namespace: log-router
spec:
containers:
- command:
- /app/talk.sh
name: logging
image: busybox
imagePullPolicy: IfNotPresent
resources:
limits:
cpu: 100m
memory: 128Mi
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /app/talk.sh
name: logging
subPath: talk.sh
readOnly: true
restartPolicy: Always
securityContext:
runAsGroup: 1000
runAsNonRoot: true
runAsUser: 1000
terminationGracePeriodSeconds: 30
tolerations:
- effect: NoExecute
key: node.kubernetes.io/not-ready
operator: Exists
tolerationSeconds: 300
- effect: NoExecute
key: node.kubernetes.io/unreachable
operator: Exists
tolerationSeconds: 300
volumes:
- name: logging
configMap:
defaultMode: 0755
name: logging
---
apiVersion: v1
kind: ConfigMap
metadata:
name: logging
data:
talk.sh: |
#!/bin/bash
while true
do echo "Bread-and-butter pickles are a marinated variety of pickled cucumber in a solution of vinegar, sugar, and spices. They may simply be chilled as refrigerator pickles or canned. Their name and their broad popularity in the United States are attributed to Omar and Cora Fanning, who were Illinois cucumber farmers that started selling sweet and sour pickles in the 1920s. They filed for the trademark Fannings Bread and Butter Pickles in 1923 (though the recipe and similar recipes are probably much older).[4] The story to the name is that the Fannings survived rough years by making the pickles with their surplus of undersized cucumbers and bartering them with their grocer for staples such as bread and butter.[5] Their taste is often much sweeter than other types of pickle, due to the sweeter brine they are marinated in, but they differ from sweet pickles in that they are spiced with cilantro and other spices"
done
---
Which simply outputs a large text statement repeatedly will cause escalating disk pressure.
Thanks for opening this issue @josephmcasey. We have actually seen this one ourselves but it's an interesting issue and has some compliance and security implications for changing how it operates today. Not sure what the previous action was but "open_on_every_update" was not added until 0.14.12 and prior to 1.15 the fluentd version was 1.12 so that may explain that part.
if this is enabled by default and you are running a default containerd config, then I only need to generate a little over 50Mi of logs in a pod to cause fluentd to start dropping logs and not sending them off. This can easily be used for an attacker to hide their tracks if it is defaulted to enabled and should not be enabled without caution and understanding that it will drop logs rather quickly under load.
The ideal solution would be to only drop logs if the volume is almost out of space and set configurable thresholds of when to start leveraging this feature instead of just having it on all the time. If there are short bursts in logs, the local disk should have space to buffer until it can send them off and then this isn't a problem. This issue really only happens when there is an extreme amount of logs for an extended amount of time, which is arguably a different issue entirely and rather an issue for the application but KFO should have some safeguards to try and prevent it from causing issues on the host.
Describe the bug
When running versions of the operator >v1.15.1 under heavy logging load deleted files are being held open. Previous versions do not have this issue.
This results in rising disk consumption until the node is full. Cycling the log-router pods releases all the deleted files and the spare is cleared.
Reproduction steps
Using a workload like:
Which simply outputs a large text statement repeatedly will cause escalating disk pressure.
lsof +L1
on a node shows the open filesThe logs of the
fluentd
container:Expected behavior
The logs should be deleted released upon rotation.
Additional context
No response
The text was updated successfully, but these errors were encountered: