Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Elastic Agent writes data into its own container #4260

Closed
pebrc opened this issue Feb 19, 2021 · 8 comments
Closed

Elastic Agent writes data into its own container #4260

pebrc opened this issue Feb 19, 2021 · 8 comments
Assignees
Labels
>bug Something isn't working v1.4.1

Comments

@pebrc
Copy link
Collaborator

pebrc commented Feb 19, 2021

We mount /usr/lib/$K8S_NS/$NAME/agent-data to /usr/share/data but Elastic Agent uses /usr/share/elastic-agent/data

DataMountPath = "/usr/share/data"

This means that Elastic Agent is writing data into its own container. That includes all binaries it install as part of any configured packages (Metricbeat and Filebeat for example). This also means that it will lose its identity and runtime state on container restarts.

The intention behind using the hostPath volume was to create a persistent store for Agent identity and runtime state. We are doing the same thing for Beats.

@pebrc pebrc added >bug Something isn't working v1.4.1 labels Feb 19, 2021
@pebrc pebrc self-assigned this Feb 19, 2021
@pebrc
Copy link
Collaborator Author

pebrc commented Feb 19, 2021

This turns out to be non-trivial to fix. HostPath volumes are mounted noexec but Elastic Agent does try to install and execute the managed programs (typically Filebeat and Metricbeat) from its data directory.

Persistent Volumes (I tried the the GKE ones) are mounted without the noexec flag but we don't have a good way of using them from a DaemonSet. Also I am not sure if we can rely on the mount flags being the same across all persistent volume plugins.

cc @david-kow

@pebrc
Copy link
Collaborator Author

pebrc commented Feb 22, 2021

The filesystem structure is:
/usr/share/ealstic-agent => home --path.home
/usr/share/elastic-agent/data => data directory derived from home
/usr/share/elastic-agent/data/elastic-agent-${hash}/run => runtime state derived from home
/usr/share/elastic-agent/data/elastic-agent-${hash}/install => install location for the programs to execute agent.download.install_path
/usr/share/elastic-agent/data/elastic-agent-${hash}/download => download location for packages to install agent.download.target_directory

The problem is that the install path by default is a subdirectory of the data directory which we want to have on the host via hostPath. But all we need to persist Agent identity across restarts is the runtime state in run.

@david-kow suggested to override the latter two download settings to have Elastic Agent install into a directory inside the container in which we can execute and keep the data directory on the hostPath volume. Unfortunately Agent still creates a tmp folder below the data directory and then tries to copy via rename from there to the install directory which is now on a different filesystem and fails subsequently.

2021-02-22T15:00:30.935Z    ERROR    log/reporter.go:36    2021-02-22T15:00:30Z: type: 'ERROR': sub_type: 'FAILED' message: Application: metricbeat--7.11.0[1ecf5613-accd-4c8c-930e-e432c726a43d]: State changed to FAILED: rename /usr/share/elastic-agent/var/lib/data/tmp/elastic-agent-install004569386/metricbeat-7.11.0-linux-x86_64 /usr/share/elastic-agent/data/install/metricbeat-7.11.0-linux-x86_64: invalid cross-device link              

The only approach I got working was to to bind mount on top of the run directory which unfortunately requires the operator to know the git hash from which Elastic Agent was build because it is part of the directory name:
/usr/share/elastic-agent/data/elastic-agent-84c4d4/run But this has the massive disadvantage that we need to figure out a way to find the hash of the Elastic Agent build ahead of deploy time.

@pebrc
Copy link
Collaborator Author

pebrc commented Feb 23, 2021

I raised an issue against the Beats/Elastic Agent repository. Until this is fixed in Elastic Agent which could take a few releases.

We can do two things:

  1. Document the limitation and wait for new Elastic Agent releases
  2. Implement a workaround. My only idea for that so far is as follows:

We bind mount the hostPath volume to /usr/share/elastic-agent/data/elastic-agent-${hash}/run
We figure out the hash by running a k8s job/pod to inspect the Elastic Agent Docker container

apiVersion: batch/v1
kind: Job
metadata:
  name: agent-inspect
spec:
  template:
    spec:
      containers:
      - name: agent
        image: docker.elastic.co/beats/elastic-agent:7.11.1
        command: ["ls",  "/usr/share/elastic-agent/data"]
      restartPolicy: Never

We keep that information around in memory or even in a ConfigMap keyed by Elastic Agent version and use it to construct the volume mount.

It's not great but should do the trick. Open to alternative suggestions

@idanmo
Copy link
Collaborator

idanmo commented Feb 23, 2021

Can we somehow avoid storing the hash information somewhere by running a fancy one-liner script in an init container that would create a symlink from /usr/share/elastic-agent/data/elastic-agent-${hash}/run to somewhere in the mounted volume?

@pebrc
Copy link
Collaborator Author

pebrc commented Feb 24, 2021

Can we somehow avoid storing the hash information somewhere by running a fancy one-liner script in an init container that would create a symlink from /usr/share/elastic-agent/data/elastic-agent-${hash}/run to somewhere in the mounted volume?

Cross device/filesystem symlinks are not possible afaik

@pebrc
Copy link
Collaborator Author

pebrc commented Feb 25, 2021

I prototyped a workaround here. There is another idea that uses a custom entrypoint to symlink the run directory to the hostPath volume. It has the advantage that it does not need another Pod to be spun up temporarily but otherwise shares the same drawback as the Pod based approach:

  • potentially makes the operator incompatible with future versions of Agent
  • we don't have a good way of retracting the workaround when the issue is fixed in Agent

Alternatively we could just document the limitation and explain how users can mount a hostPath volume themselves to the right path for the time being.

@david-kow
Copy link
Contributor

Alternatively we could just document the limitation and explain how users can mount a hostPath volume themselves to the right path for the time being.

I think I'm leaning towards this. We can even keep this documentation updated with right paths for any future Elastic Agent versions released without the fix.

@pebrc
Copy link
Collaborator Author

pebrc commented Apr 19, 2021

Closing this, we added documentation, adjusted the recipes. A fix is merged in Elastic Agent and will ship with 7.13. I have raised #4260 to integrate the Agent fix into ECK's Agent controller

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug Something isn't working v1.4.1
Projects
None yet
Development

No branches or pull requests

3 participants