Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Heartbeat] Enable Heartbeat to run when elastic-agent container is executed as root #30869

Merged
merged 4 commits into from
Apr 25, 2022

Conversation

emilioalvap
Copy link
Collaborator

@emilioalvap emilioalvap commented Mar 16, 2022

Since it's a bugfix, I'm including elastic-agent changes to be backported to 8.2 and I will create a separate PR on elastic-agent repo for 8.3.

What does this PR do?

This PR:

  • Includes group write permission in runtime directories,
  • adapts umask on docker containers,
  • restricts heartbeat setuid to containerized instances.

This PR works in tandem with elastic/elastic-agent#202.

Why is it important?

Heartbeat comes with setuid functionality to be able to run npm as a non-root user during synthetics checks inside an elastic-agent container. Without it, npm would complain when executed as root.

When elastic-agent is executed as root, all beats (filebeat, metricbeat, ...) run as root and heartbeat will run as user specified on BEAT_SETUID_AS env variable, elastic-agent by default. This user needs permissions to write to local directories and we enable that by making the user belong to root group. But some of the directories that the user need access to, that are created during runtime, do not allow for group write permission, meaning heartbeat won't be able to start and it will eventually report as degraded:

2022-02-02T22:14:03.608Z        INFO    log/reporter.go:40      2022-02-02T22:14:03Z - message: Application: heartbeat--7.16.3[34930eab-508d-45c6-b839-5787ec1ca7eb]: State changed to DEGRADED: Missed last check-in - type: 'STATE' - sub_type: 'RUNNING'
2022-02-02T22:14:03.608Z        WARN    status/reporter.go:236  Elastic Agent status changed to: 'degraded'

These are the directories:

$ ls -al state/data/{logs,tmp,run} | grep default
drwxr-xr-x 2 root root   4096 Mar 10 18:00 default
drwxr-x--- 5 root root 4096 Mar 10 13:29 default
drwxr-xr-x 3 root root 4096 Mar 10 13:28 default

Most of our own documented examples to run heartbeat and elastic-agent in k8s are configured to run as root, check here and here for reference.

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

How to test this PR locally

  • Build beats and elastic-agent locally:
x-pack/heartbeat $  env PLATFORMS="+all linux/amd64" mage package
...
x-pack/filebeat $  env PLATFORMS="+all linux/amd64" mage package
...
x-pack/metricbeat $  env PLATFORMS="+all linux/amd64" mage package
...
x-pack/elastic-agent $  env PLATFORMS="+all linux/amd64" mage dev:package
...
  • Run the container locally as root:
docker run -u root --rm --name agent --env FLEET_ENROLL=1 --env FLEET_URL=<fleet url> --env FLEET_ENROLLMENT_TOKEN=<token> -it docker.elastic.co/beats/elastic-agent:8.2.0
  • Attach to the container separately and verify that required directories have group write permissions:
$ docker exec -u root -it agent /bin/bash
root@f9a201aef696:/usr/share/elastic-agent# ls -al state/data/{tmp,logs,run} | grep default
drwxrwx--- 2 root root 4096 Mar 16 15:02 default
drwxrwx--- 6 root root 4096 Mar 16 15:02 default
drwxrwx--- 5 root root 4096 Mar 16 15:02 default

Related issues

Screenshots

image

@emilioalvap emilioalvap added bug Team:obs-ds-hosted-services Label for the Observability Hosted Services team release-note:fix The content should be included as a fix backport-v8.1.0 Automated backport with mergify labels Mar 16, 2022
@emilioalvap emilioalvap requested review from a team as code owners March 16, 2022 16:19
@botelastic botelastic bot added needs_team Indicates that the issue/PR needs a Team:* label and removed needs_team Indicates that the issue/PR needs a Team:* label labels Mar 16, 2022
@mergify
Copy link
Contributor

mergify bot commented Mar 16, 2022

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b heartbeat-30171 upstream/heartbeat-30171
git merge upstream/main
git push upstream heartbeat-30171

@elasticmachine
Copy link
Collaborator

Pinging @elastic/uptime (Team:Uptime)

@cmacknz cmacknz requested a review from a team March 16, 2022 17:50
@@ -85,7 +85,7 @@ func (paths *Path) InitPaths(cfg *Path) error {
}

// make sure the data path exists
err = os.MkdirAll(paths.Data, 0750)
err = os.MkdirAll(paths.Data, 0770)
Copy link
Member

@cmacknz cmacknz Mar 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not familiar with the directory/paths/permissions code yet, but my only thought is if it is possible to be more conservative with the permissions changes since this only affects heartbeat (as far as I know).

Should we make this change (and the umask change in the Dockerfile above) apply only to heartbeat, instead of every beat as it does now?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I would also prefer a more localized change for heartbeat.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To clarify, the issue fixed here applies to heartbeat running inside a elastic-agent container.

Path initialization is peformed by the first beat that kicks off inside elastic-agent. AFAIK, there's no way to specify a order. In that case, the alternative would be to move path initialization inside elastic-agent, before beats are spawned.

As for the entrypoint, I've followed the guideline here: #29708. As it stands, all umask operations are performed outside libbeat code. With this, we can restrict the change to docker containers without affecting other type of installs (systemd and others). These install methods still have the umask specified in the service configuration template (0027) and will continue to create files with the same permission level as it is today (0750).

There's also to note, some of the directories that are being generated from inside elastic-agent today are attempted with a higher permission level than the default mask allows. Here are a few examples:

$ grep mkdirat /tmp/agent.* | grep "state/data" | grep default\"
/tmp/agent.379:mkdirat(AT_FDCWD, "/usr/share/elastic-agent/state/data/logs/default", 0775) = 0
/tmp/agent.379:mkdirat(AT_FDCWD, "/usr/share/elastic-agent/state/data/tmp/default", 0775) = 0

If this permission level can be considered a security concern, I'm guessing we shouldn't be relying on an environmental umask to filer them out.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation! I don't have major concerns about this change.

@rdner thoughts on this? You were the last person to touch anything to do with umask.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can confirm that this particular line should not affect other kinds of distribution since they all have their umask set in their respective scripts which would overwrite the permission set for this directory.

However, I have some concerns that we now have a different umask for all beats running under elastic-agent because of the change in the Docker template. If that's okay for everyone I'm okay with that too.

The whole umask story started with this security issue which was primarily about the access by all, I guess giving more access to the group is fine #14005

@ph ph requested a review from blakerouse March 16, 2022 17:55
@elasticmachine
Copy link
Collaborator

elasticmachine commented Mar 16, 2022

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Start Time: 2022-04-25T15:40:12.945+0000

  • Duration: 46 min 27 sec

Test stats 🧪

Test Results
Failed 0
Passed 3978
Skipped 915
Total 4893

💚 Flaky test report

Tests succeeded.

🤖 GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • /package : Generate the packages and run the E2E tests.

  • /beats-tester : Run the installation tests with beats-tester.

  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

@cmacknz cmacknz added the Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team label Mar 21, 2022
@elasticmachine
Copy link
Collaborator

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

@cmacknz
Copy link
Member

cmacknz commented Mar 21, 2022

LGTM, someone from the control plane team (@blakerouse?) should approve this though as I think this affects them the most.

@andrewvc
Copy link
Contributor

@Mergifyio update

@mergify
Copy link
Contributor

mergify bot commented Mar 21, 2022

update

✅ Branch has been successfully updated

Copy link
Contributor

@andrewvc andrewvc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, tested locally. I think we have unrelated E2E test failures however

if localUserName := os.Getenv("BEAT_SETUID_AS"); localUserName != "" && syscall.Geteuid() == 0 {
sysInfo, err := sysinfo.Host()
isContainer := false
if err == nil && sysInfo.Info().Containerized != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm just thinking about the distinction between this running in a container and running in our container. I think this check is fine given that we also check BEAT_SETUID_AS which wouldn't be in a custom container most likely.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I concur, this check does not eliminate completely the risk of misusing this feature.

@andrewvc
Copy link
Contributor

@Mergifyio update

@mergify
Copy link
Contributor

mergify bot commented Mar 23, 2022

update

✅ Branch has been successfully updated

Copy link
Contributor

@ph ph left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the change I am ok with that change @blakerouse Any concerns on your side?

@andrewvc
Copy link
Contributor

@emilioalvap looks like the new linter we installed is now complaining about some stuff:

[](https://beats-ci.elastic.co/blue/organizations/jenkins/Beats%2Fbeats/detail/PR-30869/4/pipeline#step-1615-log-28)[](https://beats-ci.elastic.co/blue/organizations/jenkins/Beats%2Fbeats/detail/PR-30869/4/pipeline#step-1615-log-29)[](https://beats-ci.elastic.co/blue/organizations/jenkins/Beats%2Fbeats/detail/PR-30869/4/pipeline#step-1615-log-30)[](https://beats-ci.elastic.co/blue/organizations/jenkins/Beats%2Fbeats/detail/PR-30869/4/pipeline#step-1615-log-31)[](https://beats-ci.elastic.co/blue/organizations/jenkins/Beats%2Fbeats/detail/PR-30869/4/pipeline#step-1615-log-32)[](https://beats-ci.elastic.co/blue/organizations/jenkins/Beats%2Fbeats/detail/PR-30869/4/pipeline#step-1615-log-33)[](https://beats-ci.elastic.co/blue/organizations/jenkins/Beats%2Fbeats/detail/PR-30869/4/pipeline#step-1615-log-34)[](https://beats-ci.elastic.co/blue/organizations/jenkins/Beats%2Fbeats/detail/PR-30869/4/pipeline#step-1615-log-35)[](https://beats-ci.elastic.co/blue/organizations/jenkins/Beats%2Fbeats/detail/PR-30869/4/pipeline#step-1615-log-36)[](https://beats-ci.elastic.co/blue/organizations/jenkins/Beats%2Fbeats/detail/PR-30869/4/pipeline#step-1615-log-37)[](https://beats-ci.elastic.co/blue/organizations/jenkins/Beats%2Fbeats/detail/PR-30869/4/pipeline#step-1615-log-38)[](https://beats-ci.elastic.co/blue/organizations/jenkins/Beats%2Fbeats/detail/PR-30869/4/pipeline#step-1615-log-39)[](https://beats-ci.elastic.co/blue/organizations/jenkins/Beats%2Fbeats/detail/PR-30869/4/pipeline#step-1615-log-40)[](https://beats-ci.elastic.co/blue/organizations/jenkins/Beats%2Fbeats/detail/PR-30869/4/pipeline#step-1615-log-41)[](https://beats-ci.elastic.co/blue/organizations/jenkins/Beats%2Fbeats/detail/PR-30869/4/pipeline#step-1615-log-42)[](https://beats-ci.elastic.co/blue/organizations/jenkins/Beats%2Fbeats/detail/PR-30869/4/pipeline#step-1615-log-43)res (e.g. ..." (godox)

[2022-03-23T17:08:49.991Z] // TODO: Support other architectures (e.g. arm)

[2022-03-23T17:08:49.991Z] heartbeat/security/security.go:61:17: Error return value is not checked (errcheck)

[2022-03-23T17:08:49.991Z] 	setCapabilities()

[2022-03-23T17:08:49.991Z] 	               ^

[2022-03-23T17:08:49.991Z] heartbeat/security/security.go:77:2: ST1003: var localUserUid should be localUserUID (stylecheck)

[2022-03-23T17:08:49.991Z] 	localUserUid, err := strconv.Atoi(localUser.Uid)

[2022-03-23T17:08:49.991Z] 	^

[2022-03-23T17:08:49.991Z] heartbeat/security/security.go:81:2: ST1003: var localUserGid should be localUserGID (stylecheck)

[2022-03-23T17:08:49.991Z] 	localUserGid, err := strconv.Atoi(localUser.Gid)

[2022-03-23T17:08:49.991Z] 	^

[2022-03-23T17:08:49.991Z] heartbeat/security/security.go:106:44: `hygeine` is a misspelling of `hygiene` (misspell)

[2022-03-23T17:08:49.991Z] 	// This may not be necessary, but is good hygeine, we do some shelling out to node/npm etc.

[2022-03-23T17:08:49.991Z] 	                                          ^

[2022-03-23T17:08:49.991Z] libbeat/paths/paths.go:90:70: non-wrapping format verb for fmt.Errorf. Use `%w` to format errors (errorlint)

[2022-03-23T17:08:49.991Z] 		return fmt.Errorf("Failed to create data path %s: %v", paths.Data, err)

[2022-03-23T17:08:49.991Z] 		                                                                   ^

@mergify
Copy link
Contributor

mergify bot commented Mar 28, 2022

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b heartbeat-30171 upstream/heartbeat-30171
git merge upstream/main
git push upstream heartbeat-30171

@mergify
Copy link
Contributor

mergify bot commented Apr 6, 2022

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b heartbeat-30171 upstream/heartbeat-30171
git merge upstream/main
git push upstream heartbeat-30171

@emilioalvap emilioalvap removed the backport-v8.1.0 Automated backport with mergify label Apr 12, 2022
@emilioalvap
Copy link
Collaborator Author

/package

1 similar comment
@emilioalvap
Copy link
Collaborator Author

/package

@emilioalvap
Copy link
Collaborator Author

/test

@emilioalvap
Copy link
Collaborator Author

@Mergifyio update

@mergify
Copy link
Contributor

mergify bot commented Apr 13, 2022

update

✅ Branch has been successfully updated

@mergify
Copy link
Contributor

mergify bot commented Apr 19, 2022

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b heartbeat-30171 upstream/heartbeat-30171
git merge upstream/main
git push upstream heartbeat-30171

@emilioalvap
Copy link
Collaborator Author

/test

@emilioalvap
Copy link
Collaborator Author

@Mergifyio update

@mergify
Copy link
Contributor

mergify bot commented Apr 25, 2022

update

☑️ Nothing to do

  • -closed [:pushpin: update requirement]
  • #commits-behind>0 [:pushpin: update requirement]

@emilioalvap
Copy link
Collaborator Author

/package

@emilioalvap
Copy link
Collaborator Author

/test

@emilioalvap emilioalvap merged commit 8906734 into elastic:main Apr 25, 2022
mergify bot pushed a commit that referenced this pull request Apr 25, 2022
…xecuted as root (#30869)

* Include group write permission in runtime directories, adapt umask on docker containers and restrict heartbeat setuid to containerized instances

* Add CHANGELOG entries

* Fix linter issues

(cherry picked from commit 8906734)
emilioalvap added a commit that referenced this pull request Apr 25, 2022
…xecuted as root (#30869) (#31406)

* Include group write permission in runtime directories, adapt umask on docker containers and restrict heartbeat setuid to containerized instances

* Add CHANGELOG entries

* Fix linter issues

(cherry picked from commit 8906734)

Co-authored-by: Emilio Alvarez Piñeiro <[email protected]>
kush-elastic pushed a commit to kush-elastic/beats that referenced this pull request May 2, 2022
…xecuted as root (elastic#30869)

* Include group write permission in runtime directories, adapt umask on docker containers and restrict heartbeat setuid to containerized instances

* Add CHANGELOG entries

* Fix linter issues
chrisberkhout pushed a commit that referenced this pull request Jun 1, 2023
…xecuted as root (#30869)

* Include group write permission in runtime directories, adapt umask on docker containers and restrict heartbeat setuid to containerized instances

* Add CHANGELOG entries

* Fix linter issues
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-v8.2.0 Automated backport with mergify bug release-note:fix The content should be included as a fix Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team Team:obs-ds-hosted-services Label for the Observability Hosted Services team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants