Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Synthetics] Monitor History heatmap will not fill up if monitor has > 10k tests in given span #180076

Closed
justinkambic opened this issue Apr 4, 2024 · 5 comments · Fixed by #184177
Assignees
Labels
bug Fixes for quality problems that affect the customer experience Synthetics Team:obs-ux-management Observability Management User Experience Team

Comments

@justinkambic
Copy link
Contributor

Kibana version:

8.8.0

Elasticsearch version:

8.8.0

Server OS version:

Elastic Cloud

Browser version:

Chrome

Browser OS version:

macOS

Original install method (e.g. download page, yum, from source, etc.):

Elastic Cloud

Describe the bug:

As a user with a monitor containing > 20k checks in the past 7 days, I expect to see a heatmap indicating the success/fail ratio for all buckets in the heatmap viz. Instead, I only see it partially filling the heatmap because the query that builds the data for the chart caps at 10k documents.

Steps to reproduce:

  1. Create a synthetics monitor with a high frequency that will result in > 10k checks for a timespan
  2. Set the monitor's history page span to a wide enough range that over 10k docs would be logged
  3. View that the heatmap is not completely filled

Expected behavior:

The heatmap should fill

Screenshots (if relevant):

image

Errors in browser console (if relevant):

Provide logs and/or server output (if relevant):

Any additional context:

The route responsible for this is ping_statuses.

@justinkambic justinkambic added bug Fixes for quality problems that affect the customer experience Team:obs-ux-infra_services Observability Infrastructure & Services User Experience Team labels Apr 4, 2024
@elasticmachine
Copy link
Contributor

Pinging @elastic/obs-ux-infra_services-team (Team:obs-ux-infra_services)

@yomduf
Copy link

yomduf commented Apr 11, 2024

Hello @justinkambic, is this issue addressed and a fix planned in a future version ?

@gjelenc
Copy link

gjelenc commented Apr 11, 2024

Same problem in Kibana 8.12.2....

@justinkambic
Copy link
Contributor Author

@gjelenc yes this needs to be fixed generally, it will be an issue even in the latest versions of the stack.

@justinkambic justinkambic self-assigned this May 23, 2024
@smith smith added Team:obs-ux-management Observability Management User Experience Team and removed Team:obs-ux-infra_services Observability Infrastructure & Services User Experience Team labels Jul 8, 2024
@elasticmachine
Copy link
Contributor

Pinging @elastic/obs-ux-management-team (Team:obs-ux-management)

shahzad31 pushed a commit to shahzad31/kibana that referenced this issue Sep 6, 2024
…e doc counts (elastic#184177)

## Summary

Resolves elastic#180076.

### User-facing updates

#### We can now display the heatmap for "very large" document counts

##### This PR

_note: The first bucket in this image is empty because that's when my
monitor actually started._

<img width="2083" alt="image"
src="https://github.com/elastic/kibana/assets/18429259/72daf378-586e-4c7d-a1b0-a05bf95fb09a">

##### Previously

<img width="1420" alt="image"
src="https://github.com/elastic/kibana/assets/18429259/94f46913-debc-4bce-9794-600241fc4d6f">

#### We display appropriate bucketed data

This is a weird behavior I only noticed once I had made the new query.
For some reason, the chart didn't display all the data it should. This
may be related to the initial issue, but even in ranges less than 10k
docs I was seeing this behavior. I manually checked this by querying for
the bucket spans and ensuring the right number of docs were showing up,
which is what I'd expect, because it's highly unlikely there would be a
bug like this in the date histogram agg.

##### This PR

<img width="1420" alt="image"
src="https://github.com/elastic/kibana/assets/18429259/8620f709-dd4f-4a52-b39d-a3ff778ff7b4">

##### Previously

<img width="1420" alt="image"
src="https://github.com/elastic/kibana/assets/18429259/745fa36d-db35-46a4-be1d-330bed799884">

### Why are you doing this?

We have noticed in a few SDH that in cases where users are querying very
large document tallies, like a month's worth of data for a monitor that
runs every minute, they're not seeing their status heatmap fill up.

~This patch would introduce a multi-stage query mechanism to the pings
query endpoint. The PR is in draft state, so the procedure performs the
query in this manner all the time, but we may want to add a new param to
the endpoint that allows the client to opt into this behavior, or simply
cap at the default 10k documents like it does today.~

~I intend to do a good amount of benchmark testing on this fix locally
and in Cloud before I try to merge it. We may also need to explore
imposing a hard upper limit on the number of stages we are willing to
execute. Right now the query is unbounded and a user could try to query
datasets that would result in Kibana retrieving and trying to respond
with hundreds of MB of data.~

### What changed

This PR modifies the backend query to rely on a Date Histogram agg,
rather than querying the documents directly for this purpose. This agg
uses the same interval that the client calculates based on the width of
the page. I have kept the existing time bin logic on the client side,
and modified the function that builds the data structure to pour the
buckets from the histogram into the bins that we serve the chart that
the user sees.

One drawback is because the buckets are computed by Elasticsearch, we
need to make API calls to retrieve more data, whereas on resizes
previously, all the data was already present and would get re-bucketed
on the client. I think this is acceptable because of the ability to
handle larger datasets and given that the query should be very fast
still.

Additionally:

- I have made additional modifications to the frontend logic to prevent
unnecessary API calls. Before, the component would always make at least
two calls on an initial load. I have also switched from `lodash`
`throttle` to `useDebounce` from `react-use`, as this seems to more
properly drop intermediate resize events where we would want to skip
hitting the API.
- I have also changed the render logic. We used to use a default value
to build out an initial set of time bins. Unfortunately, with all these
changes in place, this results in a weird behavior where we show an
empty scale while waiting for data, and the bins then snap to the proper
size once the element's width has been determined and the bins get
scaled out properly. Instead of this, I skip rendering the chart until
the initial width is available via a `ref` applied to the wrapper
element. At that point, we send the init width to the hook and use that
value for our initial query and bin calculations, so we only ever show
bins on the charts that will correspond to the bucketed data we
eventually display.
- I added a progress indicator to the wrapper element so the user
receives an indication that the data is being refreshed.
- I added a quiet fetch similar to how the overview page works, so that
after an initial render the chart will not have its data dropped until
we have new data to replace it with. Along those lines, the hook now
watches the page location and if it changes (i.e. to from History to
Overview), state management will destroy the heatmap data so we don't
have a flicker of other data during a new fetch.

##  Testing this PR

Unfortunately, to truly test this PR you'd need a monitor with over 10k
documents, as that's the criteria specified in the initial issue. I did
this by running a monitor over about 10 days on a 1-run-per-minute
schedule. You could create a cloud deployment from this PR, or create
dummy documents in some way (this can be hard to verify though). I am
pretty confident this works and fixes the issue based on my real-world
testing.

---------

Co-authored-by: shahzad31 <[email protected]>
(cherry picked from commit 50f8bd6)

# Conflicts:
#	x-pack/plugins/observability_solution/synthetics/common/constants/synthetics/rest_api.ts
#	x-pack/plugins/observability_solution/synthetics/public/apps/synthetics/state/root_effect.ts
#	x-pack/plugins/observability_solution/synthetics/server/routes/index.ts
#	x-pack/plugins/observability_solution/synthetics/server/routes/pings/get_ping_statuses.ts
crespocarlos pushed a commit to crespocarlos/kibana that referenced this issue Sep 11, 2024
…e doc counts (elastic#184177)

## Summary

Resolves elastic#180076.

### User-facing updates

#### We can now display the heatmap for "very large" document counts

##### This PR

_note: The first bucket in this image is empty because that's when my
monitor actually started._

<img width="2083" alt="image"
src="https://github.com/elastic/kibana/assets/18429259/72daf378-586e-4c7d-a1b0-a05bf95fb09a">

##### Previously

<img width="1420" alt="image"
src="https://github.com/elastic/kibana/assets/18429259/94f46913-debc-4bce-9794-600241fc4d6f">

#### We display appropriate bucketed data

This is a weird behavior I only noticed once I had made the new query.
For some reason, the chart didn't display all the data it should. This
may be related to the initial issue, but even in ranges less than 10k
docs I was seeing this behavior. I manually checked this by querying for
the bucket spans and ensuring the right number of docs were showing up,
which is what I'd expect, because it's highly unlikely there would be a
bug like this in the date histogram agg.

##### This PR

<img width="1420" alt="image"
src="https://github.com/elastic/kibana/assets/18429259/8620f709-dd4f-4a52-b39d-a3ff778ff7b4">


##### Previously

<img width="1420" alt="image"
src="https://github.com/elastic/kibana/assets/18429259/745fa36d-db35-46a4-be1d-330bed799884">


### Why are you doing this?

We have noticed in a few SDH that in cases where users are querying very
large document tallies, like a month's worth of data for a monitor that
runs every minute, they're not seeing their status heatmap fill up.

~This patch would introduce a multi-stage query mechanism to the pings
query endpoint. The PR is in draft state, so the procedure performs the
query in this manner all the time, but we may want to add a new param to
the endpoint that allows the client to opt into this behavior, or simply
cap at the default 10k documents like it does today.~

~I intend to do a good amount of benchmark testing on this fix locally
and in Cloud before I try to merge it. We may also need to explore
imposing a hard upper limit on the number of stages we are willing to
execute. Right now the query is unbounded and a user could try to query
datasets that would result in Kibana retrieving and trying to respond
with hundreds of MB of data.~

### What changed

This PR modifies the backend query to rely on a Date Histogram agg,
rather than querying the documents directly for this purpose. This agg
uses the same interval that the client calculates based on the width of
the page. I have kept the existing time bin logic on the client side,
and modified the function that builds the data structure to pour the
buckets from the histogram into the bins that we serve the chart that
the user sees.

One drawback is because the buckets are computed by Elasticsearch, we
need to make API calls to retrieve more data, whereas on resizes
previously, all the data was already present and would get re-bucketed
on the client. I think this is acceptable because of the ability to
handle larger datasets and given that the query should be very fast
still.

Additionally:

- I have made additional modifications to the frontend logic to prevent
unnecessary API calls. Before, the component would always make at least
two calls on an initial load. I have also switched from `lodash`
`throttle` to `useDebounce` from `react-use`, as this seems to more
properly drop intermediate resize events where we would want to skip
hitting the API.
- I have also changed the render logic. We used to use a default value
to build out an initial set of time bins. Unfortunately, with all these
changes in place, this results in a weird behavior where we show an
empty scale while waiting for data, and the bins then snap to the proper
size once the element's width has been determined and the bins get
scaled out properly. Instead of this, I skip rendering the chart until
the initial width is available via a `ref` applied to the wrapper
element. At that point, we send the init width to the hook and use that
value for our initial query and bin calculations, so we only ever show
bins on the charts that will correspond to the bucketed data we
eventually display.
- I added a progress indicator to the wrapper element so the user
receives an indication that the data is being refreshed.
- I added a quiet fetch similar to how the overview page works, so that
after an initial render the chart will not have its data dropped until
we have new data to replace it with. Along those lines, the hook now
watches the page location and if it changes (i.e. to from History to
Overview), state management will destroy the heatmap data so we don't
have a flicker of other data during a new fetch.

##  Testing this PR

Unfortunately, to truly test this PR you'd need a monitor with over 10k
documents, as that's the criteria specified in the initial issue. I did
this by running a monitor over about 10 days on a 1-run-per-minute
schedule. You could create a cloud deployment from this PR, or create
dummy documents in some way (this can be hard to verify though). I am
pretty confident this works and fixes the issue based on my real-world
testing.

---------

Co-authored-by: shahzad31 <[email protected]>
justinkambic added a commit that referenced this issue Sep 17, 2024
…m data (#192508)

## Summary

Recently, [we fixed](#184177) an
[issue](#180076) where the
heatmap data on the detail and monitor history pages would not fill up.

A side effect of this fix was a new regression that caused certain rarer
cases to see the page crash because of an unhandled case of calling a
function on a potentially-null object; our histogram data used to
populate this heatmap can be `undefined` in certain cases.

This patch introduces a change that will handle this case, and adds unit
tests for the module in question.


### Checklist

Delete any items that are not applicable to this PR.

- [ ] Any text added follows [EUI's writing
guidelines](https://elastic.github.io/eui/#/guidelines/writing), uses
sentence case text and includes [i18n
support](https://github.com/elastic/kibana/blob/main/packages/kbn-i18n/README.md)
- [ ]
[Documentation](https://www.elastic.co/guide/en/kibana/master/development-documentation.html)
was added for features that require explanation or tutorials
- [ ] [Unit or functional
tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)
were updated or added to match the most common scenarios
- [ ] [Flaky Test
Runner](https://ci-stats.kibana.dev/trigger_flaky_test_runner/1) was
used on any tests changed
- [ ] Any UI touched in this PR is usable by keyboard only (learn more
about [keyboard accessibility](https://webaim.org/techniques/keyboard/))
- [ ] Any UI touched in this PR does not create any new axe failures
(run axe in browser:
[FF](https://addons.mozilla.org/en-US/firefox/addon/axe-devtools/),
[Chrome](https://chrome.google.com/webstore/detail/axe-web-accessibility-tes/lhdoppojpmngadmnindnejefpokejbdd?hl=en-US))
- [ ] If a plugin configuration key changed, check if it needs to be
allowlisted in the cloud and added to the [docker
list](https://github.com/elastic/kibana/blob/main/src/dev/build/tasks/os_packages/docker_generator/resources/base/bin/kibana-docker)
- [ ] This renders correctly on smaller devices using a responsive
layout. (You can test this [in your
browser](https://www.browserstack.com/guide/responsive-testing-on-local-server))
- [ ] This was checked for [cross-browser
compatibility](https://www.elastic.co/support/matrix#matrix_browsers)


### Risk Matrix

Delete this section if it is not applicable to this PR.

Before closing this PR, invite QA, stakeholders, and other developers to
identify risks that should be tested prior to the change/feature
release.

When forming the risk matrix, consider some of the following examples
and how they may potentially impact the change:

| Risk | Probability | Severity | Mitigation/Notes |

|---------------------------|-------------|----------|-------------------------|
| Multiple Spaces&mdash;unexpected behavior in non-default Kibana Space.
| Low | High | Integration tests will verify that all features are still
supported in non-default Kibana Space and when user switches between
spaces. |
| Multiple nodes&mdash;Elasticsearch polling might have race conditions
when multiple Kibana nodes are polling for the same tasks. | High | Low
| Tasks are idempotent, so executing them multiple times will not result
in logical error, but will degrade performance. To test for this case we
add plenty of unit tests around this logic and document manual testing
procedure. |
| Code should gracefully handle cases when feature X or plugin Y are
disabled. | Medium | High | Unit tests will verify that any feature flag
or plugin combination still results in our service operational. |
| [See more potential risk
examples](https://github.com/elastic/kibana/blob/main/RISK_MATRIX.mdx) |


### For maintainers

- [ ] This was checked for breaking API changes and was [labeled
appropriately](https://www.elastic.co/guide/en/kibana/master/contributing.html#kibana-release-notes-process)

---------

Co-authored-by: Shahzad <[email protected]>
kibanamachine pushed a commit to kibanamachine/kibana that referenced this issue Sep 17, 2024
…m data (elastic#192508)

## Summary

Recently, [we fixed](elastic#184177) an
[issue](elastic#180076) where the
heatmap data on the detail and monitor history pages would not fill up.

A side effect of this fix was a new regression that caused certain rarer
cases to see the page crash because of an unhandled case of calling a
function on a potentially-null object; our histogram data used to
populate this heatmap can be `undefined` in certain cases.

This patch introduces a change that will handle this case, and adds unit
tests for the module in question.

### Checklist

Delete any items that are not applicable to this PR.

- [ ] Any text added follows [EUI's writing
guidelines](https://elastic.github.io/eui/#/guidelines/writing), uses
sentence case text and includes [i18n
support](https://github.com/elastic/kibana/blob/main/packages/kbn-i18n/README.md)
- [ ]
[Documentation](https://www.elastic.co/guide/en/kibana/master/development-documentation.html)
was added for features that require explanation or tutorials
- [ ] [Unit or functional
tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)
were updated or added to match the most common scenarios
- [ ] [Flaky Test
Runner](https://ci-stats.kibana.dev/trigger_flaky_test_runner/1) was
used on any tests changed
- [ ] Any UI touched in this PR is usable by keyboard only (learn more
about [keyboard accessibility](https://webaim.org/techniques/keyboard/))
- [ ] Any UI touched in this PR does not create any new axe failures
(run axe in browser:
[FF](https://addons.mozilla.org/en-US/firefox/addon/axe-devtools/),
[Chrome](https://chrome.google.com/webstore/detail/axe-web-accessibility-tes/lhdoppojpmngadmnindnejefpokejbdd?hl=en-US))
- [ ] If a plugin configuration key changed, check if it needs to be
allowlisted in the cloud and added to the [docker
list](https://github.com/elastic/kibana/blob/main/src/dev/build/tasks/os_packages/docker_generator/resources/base/bin/kibana-docker)
- [ ] This renders correctly on smaller devices using a responsive
layout. (You can test this [in your
browser](https://www.browserstack.com/guide/responsive-testing-on-local-server))
- [ ] This was checked for [cross-browser
compatibility](https://www.elastic.co/support/matrix#matrix_browsers)

### Risk Matrix

Delete this section if it is not applicable to this PR.

Before closing this PR, invite QA, stakeholders, and other developers to
identify risks that should be tested prior to the change/feature
release.

When forming the risk matrix, consider some of the following examples
and how they may potentially impact the change:

| Risk | Probability | Severity | Mitigation/Notes |

|---------------------------|-------------|----------|-------------------------|
| Multiple Spaces&mdash;unexpected behavior in non-default Kibana Space.
| Low | High | Integration tests will verify that all features are still
supported in non-default Kibana Space and when user switches between
spaces. |
| Multiple nodes&mdash;Elasticsearch polling might have race conditions
when multiple Kibana nodes are polling for the same tasks. | High | Low
| Tasks are idempotent, so executing them multiple times will not result
in logical error, but will degrade performance. To test for this case we
add plenty of unit tests around this logic and document manual testing
procedure. |
| Code should gracefully handle cases when feature X or plugin Y are
disabled. | Medium | High | Unit tests will verify that any feature flag
or plugin combination still results in our service operational. |
| [See more potential risk
examples](https://github.com/elastic/kibana/blob/main/RISK_MATRIX.mdx) |

### For maintainers

- [ ] This was checked for breaking API changes and was [labeled
appropriately](https://www.elastic.co/guide/en/kibana/master/contributing.html#kibana-release-notes-process)

---------

Co-authored-by: Shahzad <[email protected]>
(cherry picked from commit da2f7f6)
kibanamachine added a commit that referenced this issue Sep 17, 2024
…stogram data (#192508) (#193154)

# Backport

This will backport the following commits from `main` to `8.x`:
- [[Synthetics] Fix issue where heatmap UI crashes on undefined
histogram data (#192508)](#192508)

<!--- Backport version: 9.4.3 -->

### Questions ?
Please refer to the [Backport tool
documentation](https://github.com/sqren/backport)

<!--BACKPORT [{"author":{"name":"Justin
Kambic","email":"[email protected]"},"sourceCommit":{"committedDate":"2024-09-17T12:08:51Z","message":"[Synthetics]
Fix issue where heatmap UI crashes on undefined histogram data
(#192508)\n\n## Summary\r\n\r\nRecently, [we
fixed](#184177)
an\r\n[issue](#180076) where
the\r\nheatmap data on the detail and monitor history pages would not
fill up.\r\n\r\nA side effect of this fix was a new regression that
caused certain rarer\r\ncases to see the page crash because of an
unhandled case of calling a\r\nfunction on a potentially-null object;
our histogram data used to\r\npopulate this heatmap can be `undefined`
in certain cases.\r\n\r\nThis patch introduces a change that will handle
this case, and adds unit\r\ntests for the module in
question.\r\n\r\n\r\n### Checklist\r\n\r\nDelete any items that are not
applicable to this PR.\r\n\r\n- [ ] Any text added follows [EUI's
writing\r\nguidelines](https://elastic.github.io/eui/#/guidelines/writing),
uses\r\nsentence case text and includes
[i18n\r\nsupport](https://github.com/elastic/kibana/blob/main/packages/kbn-i18n/README.md)\r\n-
[
]\r\n[Documentation](https://www.elastic.co/guide/en/kibana/master/development-documentation.html)\r\nwas
added for features that require explanation or tutorials\r\n- [ ] [Unit
or
functional\r\ntests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)\r\nwere
updated or added to match the most common scenarios\r\n- [ ] [Flaky
Test\r\nRunner](https://ci-stats.kibana.dev/trigger_flaky_test_runner/1)
was\r\nused on any tests changed\r\n- [ ] Any UI touched in this PR is
usable by keyboard only (learn more\r\nabout [keyboard
accessibility](https://webaim.org/techniques/keyboard/))\r\n- [ ] Any UI
touched in this PR does not create any new axe failures\r\n(run axe in
browser:\r\n[FF](https://addons.mozilla.org/en-US/firefox/addon/axe-devtools/),\r\n[Chrome](https://chrome.google.com/webstore/detail/axe-web-accessibility-tes/lhdoppojpmngadmnindnejefpokejbdd?hl=en-US))\r\n-
[ ] If a plugin configuration key changed, check if it needs to
be\r\nallowlisted in the cloud and added to the
[docker\r\nlist](https://github.com/elastic/kibana/blob/main/src/dev/build/tasks/os_packages/docker_generator/resources/base/bin/kibana-docker)\r\n-
[ ] This renders correctly on smaller devices using a
responsive\r\nlayout. (You can test this [in
your\r\nbrowser](https://www.browserstack.com/guide/responsive-testing-on-local-server))\r\n-
[ ] This was checked for
[cross-browser\r\ncompatibility](https://www.elastic.co/support/matrix#matrix_browsers)\r\n\r\n\r\n###
Risk Matrix\r\n\r\nDelete this section if it is not applicable to this
PR.\r\n\r\nBefore closing this PR, invite QA, stakeholders, and other
developers to\r\nidentify risks that should be tested prior to the
change/feature\r\nrelease.\r\n\r\nWhen forming the risk matrix, consider
some of the following examples\r\nand how they may potentially impact
the change:\r\n\r\n| Risk | Probability | Severity | Mitigation/Notes
|\r\n\r\n|---------------------------|-------------|----------|-------------------------|\r\n|
Multiple Spaces&mdash;unexpected behavior in non-default Kibana
Space.\r\n| Low | High | Integration tests will verify that all features
are still\r\nsupported in non-default Kibana Space and when user
switches between\r\nspaces. |\r\n| Multiple nodes&mdash;Elasticsearch
polling might have race conditions\r\nwhen multiple Kibana nodes are
polling for the same tasks. | High | Low\r\n| Tasks are idempotent, so
executing them multiple times will not result\r\nin logical error, but
will degrade performance. To test for this case we\r\nadd plenty of unit
tests around this logic and document manual testing\r\nprocedure. |\r\n|
Code should gracefully handle cases when feature X or plugin Y
are\r\ndisabled. | Medium | High | Unit tests will verify that any
feature flag\r\nor plugin combination still results in our service
operational. |\r\n| [See more potential
risk\r\nexamples](https://github.com/elastic/kibana/blob/main/RISK_MATRIX.mdx)
|\r\n\r\n\r\n### For maintainers\r\n\r\n- [ ] This was checked for
breaking API changes and was
[labeled\r\nappropriately](https://www.elastic.co/guide/en/kibana/master/contributing.html#kibana-release-notes-process)\r\n\r\n---------\r\n\r\nCo-authored-by:
Shahzad
<[email protected]>","sha":"da2f7f6dae265e29713e5c6586c542499b03bcd6","branchLabelMapping":{"^v9.0.0$":"main","^v8.16.0$":"8.x","^v(\\d+).(\\d+).\\d+$":"$1.$2"}},"sourcePullRequest":{"labels":["bug","release_note:fix","v9.0.0","ci:project-deploy-observability","Team:obs-ux-management","v8.16.0"],"title":"[Synthetics]
Fix issue where heatmap UI crashes on undefined histogram
data","number":192508,"url":"https://github.com/elastic/kibana/pull/192508","mergeCommit":{"message":"[Synthetics]
Fix issue where heatmap UI crashes on undefined histogram data
(#192508)\n\n## Summary\r\n\r\nRecently, [we
fixed](#184177)
an\r\n[issue](#180076) where
the\r\nheatmap data on the detail and monitor history pages would not
fill up.\r\n\r\nA side effect of this fix was a new regression that
caused certain rarer\r\ncases to see the page crash because of an
unhandled case of calling a\r\nfunction on a potentially-null object;
our histogram data used to\r\npopulate this heatmap can be `undefined`
in certain cases.\r\n\r\nThis patch introduces a change that will handle
this case, and adds unit\r\ntests for the module in
question.\r\n\r\n\r\n### Checklist\r\n\r\nDelete any items that are not
applicable to this PR.\r\n\r\n- [ ] Any text added follows [EUI's
writing\r\nguidelines](https://elastic.github.io/eui/#/guidelines/writing),
uses\r\nsentence case text and includes
[i18n\r\nsupport](https://github.com/elastic/kibana/blob/main/packages/kbn-i18n/README.md)\r\n-
[
]\r\n[Documentation](https://www.elastic.co/guide/en/kibana/master/development-documentation.html)\r\nwas
added for features that require explanation or tutorials\r\n- [ ] [Unit
or
functional\r\ntests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)\r\nwere
updated or added to match the most common scenarios\r\n- [ ] [Flaky
Test\r\nRunner](https://ci-stats.kibana.dev/trigger_flaky_test_runner/1)
was\r\nused on any tests changed\r\n- [ ] Any UI touched in this PR is
usable by keyboard only (learn more\r\nabout [keyboard
accessibility](https://webaim.org/techniques/keyboard/))\r\n- [ ] Any UI
touched in this PR does not create any new axe failures\r\n(run axe in
browser:\r\n[FF](https://addons.mozilla.org/en-US/firefox/addon/axe-devtools/),\r\n[Chrome](https://chrome.google.com/webstore/detail/axe-web-accessibility-tes/lhdoppojpmngadmnindnejefpokejbdd?hl=en-US))\r\n-
[ ] If a plugin configuration key changed, check if it needs to
be\r\nallowlisted in the cloud and added to the
[docker\r\nlist](https://github.com/elastic/kibana/blob/main/src/dev/build/tasks/os_packages/docker_generator/resources/base/bin/kibana-docker)\r\n-
[ ] This renders correctly on smaller devices using a
responsive\r\nlayout. (You can test this [in
your\r\nbrowser](https://www.browserstack.com/guide/responsive-testing-on-local-server))\r\n-
[ ] This was checked for
[cross-browser\r\ncompatibility](https://www.elastic.co/support/matrix#matrix_browsers)\r\n\r\n\r\n###
Risk Matrix\r\n\r\nDelete this section if it is not applicable to this
PR.\r\n\r\nBefore closing this PR, invite QA, stakeholders, and other
developers to\r\nidentify risks that should be tested prior to the
change/feature\r\nrelease.\r\n\r\nWhen forming the risk matrix, consider
some of the following examples\r\nand how they may potentially impact
the change:\r\n\r\n| Risk | Probability | Severity | Mitigation/Notes
|\r\n\r\n|---------------------------|-------------|----------|-------------------------|\r\n|
Multiple Spaces&mdash;unexpected behavior in non-default Kibana
Space.\r\n| Low | High | Integration tests will verify that all features
are still\r\nsupported in non-default Kibana Space and when user
switches between\r\nspaces. |\r\n| Multiple nodes&mdash;Elasticsearch
polling might have race conditions\r\nwhen multiple Kibana nodes are
polling for the same tasks. | High | Low\r\n| Tasks are idempotent, so
executing them multiple times will not result\r\nin logical error, but
will degrade performance. To test for this case we\r\nadd plenty of unit
tests around this logic and document manual testing\r\nprocedure. |\r\n|
Code should gracefully handle cases when feature X or plugin Y
are\r\ndisabled. | Medium | High | Unit tests will verify that any
feature flag\r\nor plugin combination still results in our service
operational. |\r\n| [See more potential
risk\r\nexamples](https://github.com/elastic/kibana/blob/main/RISK_MATRIX.mdx)
|\r\n\r\n\r\n### For maintainers\r\n\r\n- [ ] This was checked for
breaking API changes and was
[labeled\r\nappropriately](https://www.elastic.co/guide/en/kibana/master/contributing.html#kibana-release-notes-process)\r\n\r\n---------\r\n\r\nCo-authored-by:
Shahzad
<[email protected]>","sha":"da2f7f6dae265e29713e5c6586c542499b03bcd6"}},"sourceBranch":"main","suggestedTargetBranches":["8.x"],"targetPullRequestStates":[{"branch":"main","label":"v9.0.0","branchLabelMappingKey":"^v9.0.0$","isSourceBranch":true,"state":"MERGED","url":"https://github.com/elastic/kibana/pull/192508","number":192508,"mergeCommit":{"message":"[Synthetics]
Fix issue where heatmap UI crashes on undefined histogram data
(#192508)\n\n## Summary\r\n\r\nRecently, [we
fixed](#184177)
an\r\n[issue](#180076) where
the\r\nheatmap data on the detail and monitor history pages would not
fill up.\r\n\r\nA side effect of this fix was a new regression that
caused certain rarer\r\ncases to see the page crash because of an
unhandled case of calling a\r\nfunction on a potentially-null object;
our histogram data used to\r\npopulate this heatmap can be `undefined`
in certain cases.\r\n\r\nThis patch introduces a change that will handle
this case, and adds unit\r\ntests for the module in
question.\r\n\r\n\r\n### Checklist\r\n\r\nDelete any items that are not
applicable to this PR.\r\n\r\n- [ ] Any text added follows [EUI's
writing\r\nguidelines](https://elastic.github.io/eui/#/guidelines/writing),
uses\r\nsentence case text and includes
[i18n\r\nsupport](https://github.com/elastic/kibana/blob/main/packages/kbn-i18n/README.md)\r\n-
[
]\r\n[Documentation](https://www.elastic.co/guide/en/kibana/master/development-documentation.html)\r\nwas
added for features that require explanation or tutorials\r\n- [ ] [Unit
or
functional\r\ntests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)\r\nwere
updated or added to match the most common scenarios\r\n- [ ] [Flaky
Test\r\nRunner](https://ci-stats.kibana.dev/trigger_flaky_test_runner/1)
was\r\nused on any tests changed\r\n- [ ] Any UI touched in this PR is
usable by keyboard only (learn more\r\nabout [keyboard
accessibility](https://webaim.org/techniques/keyboard/))\r\n- [ ] Any UI
touched in this PR does not create any new axe failures\r\n(run axe in
browser:\r\n[FF](https://addons.mozilla.org/en-US/firefox/addon/axe-devtools/),\r\n[Chrome](https://chrome.google.com/webstore/detail/axe-web-accessibility-tes/lhdoppojpmngadmnindnejefpokejbdd?hl=en-US))\r\n-
[ ] If a plugin configuration key changed, check if it needs to
be\r\nallowlisted in the cloud and added to the
[docker\r\nlist](https://github.com/elastic/kibana/blob/main/src/dev/build/tasks/os_packages/docker_generator/resources/base/bin/kibana-docker)\r\n-
[ ] This renders correctly on smaller devices using a
responsive\r\nlayout. (You can test this [in
your\r\nbrowser](https://www.browserstack.com/guide/responsive-testing-on-local-server))\r\n-
[ ] This was checked for
[cross-browser\r\ncompatibility](https://www.elastic.co/support/matrix#matrix_browsers)\r\n\r\n\r\n###
Risk Matrix\r\n\r\nDelete this section if it is not applicable to this
PR.\r\n\r\nBefore closing this PR, invite QA, stakeholders, and other
developers to\r\nidentify risks that should be tested prior to the
change/feature\r\nrelease.\r\n\r\nWhen forming the risk matrix, consider
some of the following examples\r\nand how they may potentially impact
the change:\r\n\r\n| Risk | Probability | Severity | Mitigation/Notes
|\r\n\r\n|---------------------------|-------------|----------|-------------------------|\r\n|
Multiple Spaces&mdash;unexpected behavior in non-default Kibana
Space.\r\n| Low | High | Integration tests will verify that all features
are still\r\nsupported in non-default Kibana Space and when user
switches between\r\nspaces. |\r\n| Multiple nodes&mdash;Elasticsearch
polling might have race conditions\r\nwhen multiple Kibana nodes are
polling for the same tasks. | High | Low\r\n| Tasks are idempotent, so
executing them multiple times will not result\r\nin logical error, but
will degrade performance. To test for this case we\r\nadd plenty of unit
tests around this logic and document manual testing\r\nprocedure. |\r\n|
Code should gracefully handle cases when feature X or plugin Y
are\r\ndisabled. | Medium | High | Unit tests will verify that any
feature flag\r\nor plugin combination still results in our service
operational. |\r\n| [See more potential
risk\r\nexamples](https://github.com/elastic/kibana/blob/main/RISK_MATRIX.mdx)
|\r\n\r\n\r\n### For maintainers\r\n\r\n- [ ] This was checked for
breaking API changes and was
[labeled\r\nappropriately](https://www.elastic.co/guide/en/kibana/master/contributing.html#kibana-release-notes-process)\r\n\r\n---------\r\n\r\nCo-authored-by:
Shahzad
<[email protected]>","sha":"da2f7f6dae265e29713e5c6586c542499b03bcd6"}},{"branch":"8.x","label":"v8.16.0","branchLabelMappingKey":"^v8.16.0$","isSourceBranch":false,"state":"NOT_CREATED"}]}]
BACKPORT-->

Co-authored-by: Justin Kambic <[email protected]>
markov00 pushed a commit to markov00/kibana that referenced this issue Sep 18, 2024
…m data (elastic#192508)

## Summary

Recently, [we fixed](elastic#184177) an
[issue](elastic#180076) where the
heatmap data on the detail and monitor history pages would not fill up.

A side effect of this fix was a new regression that caused certain rarer
cases to see the page crash because of an unhandled case of calling a
function on a potentially-null object; our histogram data used to
populate this heatmap can be `undefined` in certain cases.

This patch introduces a change that will handle this case, and adds unit
tests for the module in question.


### Checklist

Delete any items that are not applicable to this PR.

- [ ] Any text added follows [EUI's writing
guidelines](https://elastic.github.io/eui/#/guidelines/writing), uses
sentence case text and includes [i18n
support](https://github.com/elastic/kibana/blob/main/packages/kbn-i18n/README.md)
- [ ]
[Documentation](https://www.elastic.co/guide/en/kibana/master/development-documentation.html)
was added for features that require explanation or tutorials
- [ ] [Unit or functional
tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)
were updated or added to match the most common scenarios
- [ ] [Flaky Test
Runner](https://ci-stats.kibana.dev/trigger_flaky_test_runner/1) was
used on any tests changed
- [ ] Any UI touched in this PR is usable by keyboard only (learn more
about [keyboard accessibility](https://webaim.org/techniques/keyboard/))
- [ ] Any UI touched in this PR does not create any new axe failures
(run axe in browser:
[FF](https://addons.mozilla.org/en-US/firefox/addon/axe-devtools/),
[Chrome](https://chrome.google.com/webstore/detail/axe-web-accessibility-tes/lhdoppojpmngadmnindnejefpokejbdd?hl=en-US))
- [ ] If a plugin configuration key changed, check if it needs to be
allowlisted in the cloud and added to the [docker
list](https://github.com/elastic/kibana/blob/main/src/dev/build/tasks/os_packages/docker_generator/resources/base/bin/kibana-docker)
- [ ] This renders correctly on smaller devices using a responsive
layout. (You can test this [in your
browser](https://www.browserstack.com/guide/responsive-testing-on-local-server))
- [ ] This was checked for [cross-browser
compatibility](https://www.elastic.co/support/matrix#matrix_browsers)


### Risk Matrix

Delete this section if it is not applicable to this PR.

Before closing this PR, invite QA, stakeholders, and other developers to
identify risks that should be tested prior to the change/feature
release.

When forming the risk matrix, consider some of the following examples
and how they may potentially impact the change:

| Risk | Probability | Severity | Mitigation/Notes |

|---------------------------|-------------|----------|-------------------------|
| Multiple Spaces&mdash;unexpected behavior in non-default Kibana Space.
| Low | High | Integration tests will verify that all features are still
supported in non-default Kibana Space and when user switches between
spaces. |
| Multiple nodes&mdash;Elasticsearch polling might have race conditions
when multiple Kibana nodes are polling for the same tasks. | High | Low
| Tasks are idempotent, so executing them multiple times will not result
in logical error, but will degrade performance. To test for this case we
add plenty of unit tests around this logic and document manual testing
procedure. |
| Code should gracefully handle cases when feature X or plugin Y are
disabled. | Medium | High | Unit tests will verify that any feature flag
or plugin combination still results in our service operational. |
| [See more potential risk
examples](https://github.com/elastic/kibana/blob/main/RISK_MATRIX.mdx) |


### For maintainers

- [ ] This was checked for breaking API changes and was [labeled
appropriately](https://www.elastic.co/guide/en/kibana/master/contributing.html#kibana-release-notes-process)

---------

Co-authored-by: Shahzad <[email protected]>
justinkambic added a commit that referenced this issue Oct 8, 2024
## Summary

We addressed #180076 recently
with these two PRs:

- #184177
- #192508

We were seeing a strange error that was difficult to repro, so we put in
a best-effort patch that was still ineffective. The reason this issue
happens is because in the code it's possible to divide by 0, which
yields a value of `Infinity`, which at some point causes our interval
value supplied to the server route to be an empty string.

This patch will make it so that we never pass a value of 0 to be used in
the calculation of bucket sizes in this hook.
kibanamachine pushed a commit to kibanamachine/kibana that referenced this issue Oct 8, 2024
## Summary

We addressed elastic#180076 recently
with these two PRs:

- elastic#184177
- elastic#192508

We were seeing a strange error that was difficult to repro, so we put in
a best-effort patch that was still ineffective. The reason this issue
happens is because in the code it's possible to divide by 0, which
yields a value of `Infinity`, which at some point causes our interval
value supplied to the server route to be an empty string.

This patch will make it so that we never pass a value of 0 to be used in
the calculation of bucket sizes in this hook.

(cherry picked from commit 560d561)
kibanamachine pushed a commit to kibanamachine/kibana that referenced this issue Oct 8, 2024
## Summary

We addressed elastic#180076 recently
with these two PRs:

- elastic#184177
- elastic#192508

We were seeing a strange error that was difficult to repro, so we put in
a best-effort patch that was still ineffective. The reason this issue
happens is because in the code it's possible to divide by 0, which
yields a value of `Infinity`, which at some point causes our interval
value supplied to the server route to be an empty string.

This patch will make it so that we never pass a value of 0 to be used in
the calculation of bucket sizes in this hook.

(cherry picked from commit 560d561)
kibanamachine added a commit that referenced this issue Oct 8, 2024
# Backport

This will backport the following commits from `main` to `8.x`:
- [[Synthetics] Fix ping heatmap payload
(#195107)](#195107)

<!--- Backport version: 9.4.3 -->

### Questions ?
Please refer to the [Backport tool
documentation](https://github.com/sqren/backport)

<!--BACKPORT [{"author":{"name":"Justin
Kambic","email":"[email protected]"},"sourceCommit":{"committedDate":"2024-10-08T14:46:30Z","message":"[Synthetics]
Fix ping heatmap payload (#195107)\n\n## Summary\r\n\r\nWe addressed
#180076 recently\r\nwith these
two PRs:\r\n\r\n- https://github.com/elastic/kibana/pull/184177\r\n-
https://github.com/elastic/kibana/pull/192508\r\n\r\nWe were seeing a
strange error that was difficult to repro, so we put in\r\na best-effort
patch that was still ineffective. The reason this issue\r\nhappens is
because in the code it's possible to divide by 0, which\r\nyields a
value of `Infinity`, which at some point causes our interval\r\nvalue
supplied to the server route to be an empty string.\r\n\r\nThis patch
will make it so that we never pass a value of 0 to be used in\r\nthe
calculation of bucket sizes in this
hook.","sha":"560d561e21fe51020954bd9e3246f238ffa026ba","branchLabelMapping":{"^v9.0.0$":"main","^v8.16.0$":"8.x","^v(\\d+).(\\d+).\\d+$":"$1.$2"}},"sourcePullRequest":{"labels":["bug","release_note:fix","v9.0.0","backport:prev-major","ci:project-deploy-observability","Team:obs-ux-management","v8.16.0","v8.15.3"],"title":"[Synthetics]
Fix ping heatmap
payload","number":195107,"url":"https://github.com/elastic/kibana/pull/195107","mergeCommit":{"message":"[Synthetics]
Fix ping heatmap payload (#195107)\n\n## Summary\r\n\r\nWe addressed
#180076 recently\r\nwith these
two PRs:\r\n\r\n- https://github.com/elastic/kibana/pull/184177\r\n-
https://github.com/elastic/kibana/pull/192508\r\n\r\nWe were seeing a
strange error that was difficult to repro, so we put in\r\na best-effort
patch that was still ineffective. The reason this issue\r\nhappens is
because in the code it's possible to divide by 0, which\r\nyields a
value of `Infinity`, which at some point causes our interval\r\nvalue
supplied to the server route to be an empty string.\r\n\r\nThis patch
will make it so that we never pass a value of 0 to be used in\r\nthe
calculation of bucket sizes in this
hook.","sha":"560d561e21fe51020954bd9e3246f238ffa026ba"}},"sourceBranch":"main","suggestedTargetBranches":["8.x","8.15"],"targetPullRequestStates":[{"branch":"main","label":"v9.0.0","branchLabelMappingKey":"^v9.0.0$","isSourceBranch":true,"state":"MERGED","url":"https://github.com/elastic/kibana/pull/195107","number":195107,"mergeCommit":{"message":"[Synthetics]
Fix ping heatmap payload (#195107)\n\n## Summary\r\n\r\nWe addressed
#180076 recently\r\nwith these
two PRs:\r\n\r\n- https://github.com/elastic/kibana/pull/184177\r\n-
https://github.com/elastic/kibana/pull/192508\r\n\r\nWe were seeing a
strange error that was difficult to repro, so we put in\r\na best-effort
patch that was still ineffective. The reason this issue\r\nhappens is
because in the code it's possible to divide by 0, which\r\nyields a
value of `Infinity`, which at some point causes our interval\r\nvalue
supplied to the server route to be an empty string.\r\n\r\nThis patch
will make it so that we never pass a value of 0 to be used in\r\nthe
calculation of bucket sizes in this
hook.","sha":"560d561e21fe51020954bd9e3246f238ffa026ba"}},{"branch":"8.x","label":"v8.16.0","branchLabelMappingKey":"^v8.16.0$","isSourceBranch":false,"state":"NOT_CREATED"},{"branch":"8.15","label":"v8.15.3","branchLabelMappingKey":"^v(\\d+).(\\d+).\\d+$","isSourceBranch":false,"state":"NOT_CREATED"}]}]
BACKPORT-->

Co-authored-by: Justin Kambic <[email protected]>
justinkambic added a commit to justinkambic/kibana that referenced this issue Oct 14, 2024
…e doc counts (elastic#184177)

## Summary

Resolves elastic#180076.

### User-facing updates

#### We can now display the heatmap for "very large" document counts

##### This PR

_note: The first bucket in this image is empty because that's when my
monitor actually started._

<img width="2083" alt="image"
src="https://github.com/elastic/kibana/assets/18429259/72daf378-586e-4c7d-a1b0-a05bf95fb09a">

##### Previously

<img width="1420" alt="image"
src="https://github.com/elastic/kibana/assets/18429259/94f46913-debc-4bce-9794-600241fc4d6f">

#### We display appropriate bucketed data

This is a weird behavior I only noticed once I had made the new query.
For some reason, the chart didn't display all the data it should. This
may be related to the initial issue, but even in ranges less than 10k
docs I was seeing this behavior. I manually checked this by querying for
the bucket spans and ensuring the right number of docs were showing up,
which is what I'd expect, because it's highly unlikely there would be a
bug like this in the date histogram agg.

##### This PR

<img width="1420" alt="image"
src="https://github.com/elastic/kibana/assets/18429259/8620f709-dd4f-4a52-b39d-a3ff778ff7b4">

##### Previously

<img width="1420" alt="image"
src="https://github.com/elastic/kibana/assets/18429259/745fa36d-db35-46a4-be1d-330bed799884">

### Why are you doing this?

We have noticed in a few SDH that in cases where users are querying very
large document tallies, like a month's worth of data for a monitor that
runs every minute, they're not seeing their status heatmap fill up.

~This patch would introduce a multi-stage query mechanism to the pings
query endpoint. The PR is in draft state, so the procedure performs the
query in this manner all the time, but we may want to add a new param to
the endpoint that allows the client to opt into this behavior, or simply
cap at the default 10k documents like it does today.~

~I intend to do a good amount of benchmark testing on this fix locally
and in Cloud before I try to merge it. We may also need to explore
imposing a hard upper limit on the number of stages we are willing to
execute. Right now the query is unbounded and a user could try to query
datasets that would result in Kibana retrieving and trying to respond
with hundreds of MB of data.~

### What changed

This PR modifies the backend query to rely on a Date Histogram agg,
rather than querying the documents directly for this purpose. This agg
uses the same interval that the client calculates based on the width of
the page. I have kept the existing time bin logic on the client side,
and modified the function that builds the data structure to pour the
buckets from the histogram into the bins that we serve the chart that
the user sees.

One drawback is because the buckets are computed by Elasticsearch, we
need to make API calls to retrieve more data, whereas on resizes
previously, all the data was already present and would get re-bucketed
on the client. I think this is acceptable because of the ability to
handle larger datasets and given that the query should be very fast
still.

Additionally:

- I have made additional modifications to the frontend logic to prevent
unnecessary API calls. Before, the component would always make at least
two calls on an initial load. I have also switched from `lodash`
`throttle` to `useDebounce` from `react-use`, as this seems to more
properly drop intermediate resize events where we would want to skip
hitting the API.
- I have also changed the render logic. We used to use a default value
to build out an initial set of time bins. Unfortunately, with all these
changes in place, this results in a weird behavior where we show an
empty scale while waiting for data, and the bins then snap to the proper
size once the element's width has been determined and the bins get
scaled out properly. Instead of this, I skip rendering the chart until
the initial width is available via a `ref` applied to the wrapper
element. At that point, we send the init width to the hook and use that
value for our initial query and bin calculations, so we only ever show
bins on the charts that will correspond to the bucketed data we
eventually display.
- I added a progress indicator to the wrapper element so the user
receives an indication that the data is being refreshed.
- I added a quiet fetch similar to how the overview page works, so that
after an initial render the chart will not have its data dropped until
we have new data to replace it with. Along those lines, the hook now
watches the page location and if it changes (i.e. to from History to
Overview), state management will destroy the heatmap data so we don't
have a flicker of other data during a new fetch.

##  Testing this PR

Unfortunately, to truly test this PR you'd need a monitor with over 10k
documents, as that's the criteria specified in the initial issue. I did
this by running a monitor over about 10 days on a 1-run-per-minute
schedule. You could create a cloud deployment from this PR, or create
dummy documents in some way (this can be hard to verify though). I am
pretty confident this works and fixes the issue based on my real-world
testing.

---------

Co-authored-by: shahzad31 <[email protected]>
(cherry picked from commit 50f8bd6)

# Conflicts:
#	x-pack/plugins/observability_solution/synthetics/common/constants/synthetics/rest_api.ts
#	x-pack/plugins/observability_solution/synthetics/public/apps/synthetics/state/root_effect.ts
#	x-pack/plugins/observability_solution/synthetics/server/routes/index.ts
#	x-pack/plugins/observability_solution/synthetics/server/routes/pings/get_ping_statuses.ts
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Fixes for quality problems that affect the customer experience Synthetics Team:obs-ux-management Observability Management User Experience Team
Projects
None yet
5 participants