Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Elastic Agent] Fix invalid log level sent to endpoint #25854

Merged
merged 6 commits into from
May 28, 2021

Conversation

michalpristas
Copy link
Contributor

@michalpristas michalpristas commented May 25, 2021

What does this PR do?

The issue is described here #25583 and happens only on low spec environments.
Difficult to reproduce locally. QA is usually successful with this.

I believe there's a race in read-write files in between enroll, run where FS is a bit slower than our process.
Similar issue was fixed here: #24504
Solving this by syncing config files after each write so changes are present on Filesystem immediately.

Why is it important?

Without this agent reports empty log level as desired to endpoint

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

@michalpristas michalpristas added bug needs_backport PR is waiting to be backported to other branches. v7.13.0 Team:Elastic-Agent Label for the Agent team backport-v7.13.0 Automated backport with mergify labels May 25, 2021
@michalpristas michalpristas self-assigned this May 25, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/agent (Team:Agent)

@botelastic botelastic bot added needs_team Indicates that the issue/PR needs a Team:* label and removed needs_team Indicates that the issue/PR needs a Team:* label labels May 25, 2021
@elasticmachine
Copy link
Collaborator

elasticmachine commented May 25, 2021

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview

Expand to view the summary

Build stats

  • Build Cause: Pull request #25854 updated

  • Start Time: 2021-05-26T09:56:55.822+0000

  • Duration: 88 min 9 sec

  • Commit: cdc79e6

Test stats 🧪

Test Results
Failed 0
Passed 6908
Skipped 16
Total 6924

Trends 🧪

Image of Build Times

Image of Tests

💚 Flaky test report

Tests succeeded.

Expand to view the summary

Test stats 🧪

Test Results
Failed 0
Passed 6908
Skipped 16
Total 6924

return nil
}

f, err := os.OpenFile(s.syncPath, os.O_RDWR, 0777)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want to be more strict add O_DIRECT.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i like that

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like not available everywhere so i'll skip direct

@urso
Copy link

urso commented May 25, 2021

It looks like the attempted fix is to add a fsync at the end of writing the file. In general we should always use fsync for files that might be persisted between restart, but I'm not sure this will fix a fundamental race condition... If a file is not synced yet, it would still be present in the systems caches, which would allow other processes to read the contents. Have you managed to validate that extra fsync fix potential races in the windows 7 filesystem?

func (d *DiskStore) Save(in io.Reader) error {
tmpFile := d.target + ".tmp"

fd, err := os.OpenFile(tmpFile, os.O_CREATE|os.O_TRUNC|os.O_WRONLY, perms)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might want to open your file with O_DIRECT.

errors.M(errors.MetaKeyPath, r.target))
}

fd, err := os.OpenFile(r.target, os.O_CREATE|os.O_WRONLY, perms)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when do you close this fd ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you have a bright eye, i found one missing i failed copy as well

// low spec windows environment where agent is faster
// than filesystem and what we read is different(stale)
// than what we just wrote.
type SyncOnSaveStore struct {
Copy link

@urso urso May 25, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure we really need this type/functionality. The DiskStore did not fsync the file before closing it. Important files that are supposed to be valid even in case of crashes should always be synced before closing.

Not requiring additonal Wrappers/Decorators like this in order to function correctly, also reduces the chance of errors in the future (the less you need to know for using a type/interface correctly, the better).

@ruflin
Copy link
Contributor

ruflin commented May 26, 2021

@michalpristas Can you update the PR description on how your PR solves the issue?

@urso
Copy link

urso commented May 28, 2021

Also the PR/issue states that the log level was not passed correctly, with all the new fsyncs it looks like there might have been other problems that are fixed by that change (due to broken/invalid configuration file being passed). If so, please update the changelog entry and the PR description.

@michalpristas michalpristas merged commit a1c768e into elastic:master May 28, 2021
michalpristas added a commit to michalpristas/beats that referenced this pull request May 28, 2021
[Elastic Agent] Fix invalid log level sent to endpoint (elastic#25854)
mergify bot pushed a commit that referenced this pull request May 28, 2021
[Elastic Agent] Fix invalid log level sent to endpoint (#25854)

(cherry picked from commit a1c768e)
michalpristas added a commit to michalpristas/beats that referenced this pull request May 28, 2021
[Elastic Agent] Fix invalid log level sent to endpoint (elastic#25854)
michalpristas added a commit that referenced this pull request May 31, 2021
[Elastic Agent] Fix invalid log level sent to endpoint (#25854)
michalpristas added a commit that referenced this pull request May 31, 2021
…5993)

Cherry-pick #25854 to 7.x: Fix invalid log level sent to endpoint  (#25993)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-v7.13.0 Automated backport with mergify bug needs_backport PR is waiting to be backported to other branches. Team:Elastic-Agent Label for the Agent team v7.13.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants