Kubernetes uses the underlying container runtime logging, which does not persist logs for stopped and destroyed containers. This makes it difficult to investigate issues in the very common case of not running containers. Gardener provides a solution to this problem for the managed cluster components by introducing its own logging stack.
- A Fluent-bit daemonset which works like a log collector and custom Golang plugin which spreads log messages to their Vali instances.
- One Vali Statefulset in the
garden
namespace which contains logs for the seed cluster and one per shoot namespace which contains logs for shoot's controlplane. - One Plutono Deployment in
garden
namespace and two Deployments per shoot namespace (one exposed to the end users and one for the operators). Plutono is the UI component used in the logging stack.
Container log rotation in Kubernetes describes a subtile but important implementation detail depending on the type of the used high-level container runtime. When the used container runtime is not CRI compliant (such as dockershim
), then the kubelet
does not provide any rotation or retention implementations, hence leaving those aspects to the downstream components. When the used container runtime is CRI compliant (such as containerd
), then the kubelet
provides the necessary implementation with two configuration options:
ContainerLogMaxSize
for rotationContainerLogMaxFiles
for retention
In this case, it is possible to configure the containerLogMaxSize
and containerLogMaxFiles
fields in the Shoot specification. Both fields are optional and if nothing is specified, then the kubelet
rotates on the size 100M
. Those fields are part of provider's workers definition. Here is an example:
spec:
provider:
workers:
- cri:
name: containerd
kubernetes:
kubelet:
# accepted values are of resource.Quantity
containerLogMaxSize: 150Mi
containerLogMaxFiles: 10
The values of the containerLogMaxSize
and containerLogMaxFiles
fields need to be considered with care since container log files claim disk space from the host. On the opposite side, log rotations on too small sizes may result in frequent rotations which can be missed by other components (log shippers) observing these rotations.
In the majority of the cases, the defaults should do just fine. Custom configuration might be of use under rare conditions.
The logging stack is extended to scrape logs from the systemd services of each shoots' nodes and from all Gardener components in the shoot kube-system
namespace. These logs are exposed only to the Gardener operators.
Also, in the shoot control plane an event-logger
pod is deployed, which scrapes events from the shoot kube-system
namespace and shoot control-plane
namespace in the seed. The event-logger
logs the events to the standard output. Then the fluent-bit
gets these events as container logs and sends them to the Vali in the shoot control plane (similar to how it works for any other control plane component).
The logs are accessible via Plutono. To access them:
-
Authenticate via basic auth to gain access to Plutono.
The secret containing the credentials is stored in the project namespace following the naming pattern<shoot-name>.monitoring
. In this secret you can also find the Plutono URL in theplutono-url
annotation. For Gardener operators, the credentials are also stored in the control-plane (shoot--<project-name>--<shoot-name>
) namespace in theobservability-ingress-users-<hash>
secret in the seed. -
Plutono contains several dashboards that aim to facilitate the work of operators and users. From the
Explore
tab, users and operators have unlimited abilities to extract and manipulate logs.
Note: Gardener Operators are people part of the Gardener team with operator permissions, not operators of the end-user cluster!
If you click on the Log browser >
button, you will see all of the available labels.
Clicking on the label, you can see all of its available values for the given period of time you have specified.
If you are searching for logs for the past one hour, do not expect to see labels or values for which there were no logs for that period of time.
By clicking on a value, Plutono automatically eliminates all other labels and/or values with which no valid log stream can be made.
After choosing the right labels and their values, click on the Show logs
button.
This will build Log query
and execute it.
This approach is convenient when you don't know the labels names or they values.
Once you feel comfortable, you can start to use the LogQL language to search for logs.
Next to the Log browser >
button is the place where you can type log queries.
Examples:
-
If you want to get logs for
calico-node-<hash>
pod in the clusterkube-system
: The name of the node on whichcalico-node
was running is known, but not the hash suffix of thecalico-node
pod. Also we want to search for errors in the logs.{pod_name=~"calico-node-.+", nodename="ip-10-222-31-182.eu-central-1.compute.internal"} |~ "error"
Here, you will get as much help as possible from the Plutono by giving you suggestions and auto-completion.
-
If you want to get the logs from
kubelet
systemd service of a given node and search for a pod name in the logs:{unit="kubelet.service", nodename="ip-10-222-31-182.eu-central-1.compute.internal"} |~ "pod name"
Note: Under
unit
label there is only thedocker
,containerd
,kubelet
andkernel
logs.
-
If you want to get the logs from
gardener-node-agent
systemd service of a given node and search for a string in the logs:{job="systemd-combine-journal",nodename="ip-10-222-31-182.eu-central-1.compute.internal"} | unpack | unit="gardener-node-agent.service"
Note:
{job="systemd-combine-journal",nodename="<node name>"}
stream pack all logs from systemd services exceptdocker
,containerd
,kubelet
, andkernel
. To filter those log by unit, you have to unpack them first.
- Retrieving events:
-
If you want to get the events from the shoot
kube-system
namespace generated bykubelet
and related to thenode-problem-detector
:{job="event-logging"} | unpack | origin_extracted="shoot",source="kubelet",object=~".*node-problem-detector.*"
-
If you want to get the events generated by MCM in the shoot control plane in the seed:
{job="event-logging"} | unpack | origin_extracted="seed",source=~".*machine-controller-manager.*"
Note: In order to group events by origin, one has to specify
origin_extracted
because theorigin
label is reserved for all of the logs from the seed and theevent-logger
resides in the seed, so all of its logs are coming as they are only from the seed. The actual origin is embedded in the unpacked event. When unpacked, the embeddedorigin
becomesorigin_extracted
.
Exposing logs for a new component to the User's Plutono is described in the How to Expose Logs to the Users section.
The Fluent-bit configurations can be found on pkg/component/observability/logging/fluentoperator/customresources
There are six different specifications:
- FluentBit: Defines the fluent-bit DaemonSet specifications
- ClusterFluentBitConfig: Defines the labelselectors of the resources which fluent-bit will match
- ClusterInput: Defines the location of the input stream of the logs
- ClusterOutput: Defines the location of the output source (Vali for example)
- ClusterFilter: Defines filters which match specific keys
- ClusterParser: Defines parsers which are used by the filters
The Vali configurations can be found on charts/seed-bootstrap/charts/vali/templates/vali-configmap.yaml
The main specifications there are:
- Index configuration: Currently the following one is used:
schema_config:
configs:
- from: 2018-04-15
store: boltdb
object_store: filesystem
schema: v11
index:
prefix: index_
period: 24h
from
: Is the date from which logs collection is started. Using a date in the past is okay.store
: The DB used for storing the index.object_store
: Where the data is stored.schema
: Schema version which should be used (v11 is currently recommended).index.prefix
: The prefix for the index.index.period
: The period for updating the indices.
Adding a new index happens with new config block definition. The from
field should start from the current day + previous index.period
and should not overlap with the current index. The prefix
also should be different.
schema_config:
configs:
- from: 2018-04-15
store: boltdb
object_store: filesystem
schema: v11
index:
prefix: index_
period: 24h
- from: 2020-06-18
store: boltdb
object_store: filesystem
schema: v11
index:
prefix: index_new_
period: 24h
- chunk_store_config Configuration
chunk_store_config:
max_look_back_period: 336h
chunk_store_config.max_look_back_period
should be the same as the retention_period
- table_manager Configuration
table_manager:
retention_deletes_enabled: true
retention_period: 336h
table_manager.retention_period
is the living time for each log message. Vali will keep messages for (table_manager.retention_period
- index.period
) time due to specification in the Vali implementation.
This is the Vali configuration that Plutono uses:
- name: vali
type: vali
access: proxy
url: http://logging.{{ .Release.Namespace }}.svc:3100
jsonData:
maxLines: 5000
name
: Is the name of the datasource.type
: Is the type of the datasource.access
: Should be set to proxy.url
: Vali's urlsvc
: Vali's portjsonData.maxLines
: The limit of the log messages which Plutono will show to the users.
Decrease this value if the browser works slowly!