Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Fleet] Task to publish Agent metrics #168435

Merged
merged 27 commits into from
Oct 18, 2023

Conversation

juliaElastic
Copy link
Contributor

@juliaElastic juliaElastic commented Oct 10, 2023

Summary

Closes https://github.com/elastic/ingest-dev/issues/2396

Added a new kibana task that publishes Agent metrics every minute to data streams installed by fleet_server package.

Opened the pr for review, there are a few things to finalize, but the core logic won't change much.

To test locally:

  • Install fleet_server package 1.4.0 from this pr to get the mappings
  • Start kibana locally, wait for a few minutes for the metrics task to run (every minute)
  • Go to discover, metrics-* index pattern, filter on data_stream.dataset: fleet_server.*
  • Expect data to be populated in fleet_server.agent_status and fleet_server.agent_versions datasets.
image

Checklist

@juliaElastic juliaElastic added the release_note:feature Makes this part of the condensed release notes label Oct 10, 2023
@juliaElastic juliaElastic self-assigned this Oct 10, 2023
@apmmachine
Copy link
Contributor

🤖 GitHub comments

Expand to view the GitHub comments

Just comment with:

  • /oblt-deploy : Deploy a Kibana instance using the Observability test environments.
  • /oblt-deploy-serverless : Deploy a serverless Kibana instance using the Observability test environments.
  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

@juliaElastic juliaElastic marked this pull request as ready for review October 11, 2023 12:41
@juliaElastic juliaElastic requested review from a team as code owners October 11, 2023 12:41
@kpollich kpollich self-requested a review October 11, 2023 12:43
@juliaElastic
Copy link
Contributor Author

juliaElastic commented Oct 11, 2023

Some FTR errors happening due to kibana_system not having access to delete the new fleet_server data streams. I might have to add delete privilege as well.


└- ✖ fail: Dev Tools Search Profiler Editor No indices "before all" hook for "returns error if profile is executed with no valid indices"
--
  | │      ResponseError: illegal_argument_exception
  | │ 	Root causes:
  | │ 		illegal_argument_exception: index [.ds-metrics-fleet_server.agent_status-default-2023.10.11-000001] is the write index for data stream [metrics-fleet_server.agent_status-default] and cannot be deleted


Raised a pr to add delete privilege (and read if we want to add tests to read back the metrics): elastic/elasticsearch#100684

EDIT: even after adding the delete privilege it doesn't seem to work, will have to debug why

It seems I left out delete_index privilege, as the FTR tests try to delete the index. Opened a pr to fix it and give all privilege: elastic/elasticsearch#100764

@botelastic botelastic bot added the Team:Fleet Team label for Observability Data Collection Fleet team label Oct 11, 2023
@elasticmachine
Copy link
Contributor

Pinging @elastic/fleet (Team:Fleet)

@kpollich
Copy link
Member

Let me know if this is good enough for now until we have more general support for "asset only" packages.

This looks good to me for now. Thanks for wiring that up!

Copy link
Member

@kpollich kpollich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great to me - thanks for addressing previous review comments 🚀

}));
} catch (error) {
if (error.statusCode === 404) {
appContextService.getLogger().debug('Index .fleet-agents does not exist yet.');
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this worth logging at another level so we can see it in serverless dashboards, etc? Not sure how common this is or if it's helpful in production debugging to know when we're swallowing these errors.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to avoid logging too much, as this task runs every minute. I think we can leave as debug for now and change later if needed. Probably we would realize anyway if there are no agents.

@nchaulet
Copy link
Member

nchaulet commented Oct 12, 2023

@juliaElastic I think even if we allow to install fleet_server we should probably not to create an agent policy with fleet server inside in serverless, edit: actually not sure it will really cause a problem as it's not possible to add fleet server hosts

Copy link
Member

@nchaulet nchaulet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🚀

Copy link
Contributor

@jloleysens jloleysens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Config changes LGTM

@juliaElastic
Copy link
Contributor Author

@elastic/response-ops Hey team, sorry for the direct ping, could you review this pr?

@juliaElastic juliaElastic requested a review from a team as a code owner October 17, 2023 12:41
@juliaElastic juliaElastic requested a review from kobelb October 18, 2023 08:53
Comment on lines 140 to 141
// disable fleet task that writes to metrics.fleet_server.* data streams, impacting functional tests
`--xpack.task_manager.unsafe.exclude_task_types=${JSON.stringify(['Fleet-Metrics-Task'])}`,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please explain the impact on functional tests? What happens without this setting?

Asking because we don't want to add this kind of configuration to serverless tests outside the feature flag testing. The reason for that is, that when we create a real serverless project in MKI, this setting would not be applied but the tests would still need to pass there (the config files are only controlling the local setup).
So if serverless tests are failing without this setting, then they would most probably still fail when run as part of our release gates on an MKI project and thus we'd need a different solution here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that the same test is still failing even when disabling this task, so it's not the root cause of the issue, I can revert it.

It seems that my changes are not related to the test failing, should I skip it to pass the build? It already has an open issue: #166592

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reverted the config change and skipped the failing test

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a different failure this time than reported in #166592.
Also, I don't see this test failing on the main branch recently, so there's a good chance the failure is really related this the changes in this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found this label though saying it fails in MKI:

describe('Cases persistable attachments', function () {
// security_exception: action [indices:data/write/delete/byquery] is unauthorized for user [elastic] with effective roles [superuser] on restricted indices [.kibana_alerting_cases], this action is granted by the index privileges [delete,write,all]
this.tags(['failsOnMKI']);

Copy link
Member

@cnasikas cnasikas Oct 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pheyos The failures on this PR seem unrelated to the issue. FWIW, I am working on this PR #168924 to fix some issues around the Cases Serverless tests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also found the same (?) test skipped under security:

// Failing
// Issue: https://github.com/elastic/kibana/issues/165135
describe.skip('Cases persistable attachments', () => {

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is skipped for different reasons. I am not in favor of skipping tests outside of the standard process but given that I am working on them I am okay with this being skipped.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, in that case we can move on here.

@juliaElastic juliaElastic requested a review from pheyos October 18, 2023 09:23
Copy link
Member

@pheyos pheyos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test config changes LGTM

@kibana-ci
Copy link
Collaborator

💛 Build succeeded, but was flaky

Failed CI Steps

Metrics [docs]

✅ unchanged

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

cc @juliaElastic

Copy link
Member

@cnasikas cnasikas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we discussed skipping the test is fine! Let's wait for the @elastic/response-ops-execution to take a look on the PR.

@cnasikas cnasikas self-requested a review October 18, 2023 11:15
Copy link
Contributor

@mikecote mikecote left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New task LGTM from @elastic/response-ops-execution's perspective 👍

@juliaElastic juliaElastic merged commit 0350f17 into elastic:main Oct 18, 2023
@kibanamachine kibanamachine added v8.12.0 backport:skip This commit does not require backporting labels Oct 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport:skip This commit does not require backporting release_note:feature Makes this part of the condensed release notes Team:Fleet Team label for Observability Data Collection Fleet team v8.12.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.