[APM] MVP for new service landing page experience #300

alex-fedotyev · 2020-07-21T22:22:45Z

Summary of the problem (If there are multiple problems or use cases, prioritize them)
This MVP would be first step in improving service landing page.
Main goal is to introduce more actionable troubleshooting workflows and leverage more data points about the service performance.

User stories

As service owner, I would like to see key details about my service like version, framework, runtime, platform or cloud to validate current state of the service.
As App Ops, I need to have visibility into timeline of anomalies, alerts and change events correlated to the service KPIs to better identify and isolate issues.
As App Ops, I need to understand impact of downstream services and backends onto the service in question in order to troubleshoot isolate problem root cause and increase MTTR.
As App Ops, I need to have visibility into how service is performing across its infrastructure in order to isolate issues related to specific instances, geos, deployments, etc.

List known (technical) restrictions and requirements

Able to scale to different time-frames from 15 minutes to multiple weeks.
Scalable to list multiple dependencies (10+).
Scalable to list multiple service instances (10 to 100s).

If in doubt, don’t hesitate to reach out to the #observability-design Slack channel.

Design issue: https://www.figma.com/proto/WkQsIVDmiYuHkvcXbzYBtg/268-%2F-Service-landing-page?node-id=513%3A2599&viewport=2040%2C-1389%2C0.5&scaling=min-zoom

The text was updated successfully, but these errors were encountered:

elasticmachine · 2020-07-21T22:22:46Z

Pinging @elastic/observability-design (design)

alex-fedotyev · 2020-07-21T22:25:59Z

Quick mock of the service page:

Dependencies view:

Service instances view:

felixbarny · 2020-07-30T09:22:41Z

Related but probably out of scope:
Another nice addition could be adding dots on the transaction duration chart, representing exemplars of transactions. When clicking on such a dot, they would take the user to the transaction details for that particular instance of a transaction. This makes the process of selecting a representative trace, given a timestamp and duration much simpler compared to selecting a distribution bucket and toggling through the 10 examples of that bucket.
Another use case this would help with is a dev workflow where a dev would like the find the request they just made.

graphaelli · 2020-07-31T15:37:03Z

@felixbarny I agree using the transaction duration chart to zoom in on the interesting transactions would be an improvement. Currently, it's simple to zoom in on a particular spike (x-axis/time filter) and drill into the interesting ones but y-axis filtering (on transaction duration) must be done manually with the kuery bar.

I'm interested to hear @formgeist's thoughts on the dots approach as part of the overall design for this issue.

formgeist · 2020-09-10T14:12:39Z

Design update Sep 10, 2020

These first mocks are primarily focused around delivering an MVP overview experience for the service. I've made a walk-through of the concepts and variations, so please have a watch 🍿

→ Loom video walk-through
→ Figma prototypes [click the Overview link in the tab navigation to switch between the two variations]

cc @cyrille-leclerc @nehaduggal @alex-fedotyev

nehaduggal · 2020-09-10T17:06:24Z

Love this iteration. The overview page now allows you to view the event timeline to identify if there's an issue and then directly allows a user to figure out the when and where the potential issue could be(with app KPIs, slowest transactions, errors and dependencies view). The only thing that I am missing from this view is infrastructure. If we could add infrastructure/instance based KPIs that would complete this story and give users a good reference point on where to look for potential issues.

graphaelli · 2020-09-10T17:16:00Z

I also really like the direction this is taking, particularly the time overlays for hour over hour, week over week - understanding what's normal, even if not anomalous is great.

I miss the time spent by span type. Knowing that my application as a whole is spending most of its time in the DB or Application code is extremely valuable from the first glance.

Also, Is transaction duration represented twice here, timeline and in its own chart?

formgeist · 2020-09-17T14:49:05Z

Design update Sep 17, 2020

Apologies for the late response, but first of all thank you for the feedback! I've been working on some enhancements and changes based on these suggestions and other feedback that I have received from the team.

→ Figma prototype

There are a lot of changes, probably too many to mention since there's a lot of little tweaks, but here are some highlights;

Added the Time spent by span type breakdown as a chart alongside the dependencies table.
Updated a lot of the charts to reduce oversaturation of information and colors.
- We have a new Traffic chart that no longer groups our transactions per min. by HTTP status code, but instead is a pure average chart.
- The Timeline (new name pending) has been aligned with the duration (now called Latency) chart underneath for consistency.
We have some new names that in general should be made consistent through-out the rest of the app. Look forward to hear what you think of those.

There's plenty of remaining tasks, but eager to hear any feedback on this.

axw · 2020-09-18T02:29:21Z

I think this is looking great -- I really like the timeline chart. ++ on having a time scrubber eventually.

Do I understand correctly that the shaded area in the timeline is throughput? Or is it the comparison ("A week ago") latency? This isn't intuitive to me, but maybe I'm just dense. If it's throughput, perhaps it would be helpful to use the same style in the "Traffic" chart?

(BTW, this is the kind of view I had in mind for CPU/heap profiling too. Timeline in the top with interesting events, and details below focused on a selected time range.)

A small detail I noticed in the Figma prototype:

Inside the Cloud Provider details at the top, there's Machine Type and Availability Zone. I would expect it to be very common to have multiple Availability Zones, and perhaps multiple Machine/Instance Types. In theory multiple cloud providers, but that's going to be less common. How will we capture this mix of details at the service level?

alex-fedotyev · 2020-09-18T02:33:15Z

@axw - thanks for the feedback!

I think that almost any information displayed under the top service info icons could have multiple values, like service running on different JVM versions (canary deployment), two agent versions monitoring separate instances of the same service or multiple cloud providers, anything!

formgeist · 2020-10-19T12:01:30Z

Design update Oct 19, 2020

It's been a little while since we've provided an update, but we've been iterating the layout and design of the overview page quite a bit since receiving more feedback and discussions around the components in the view.

We're planning on moving to implementation for the layout very soon, so we're focusing on the parts of the layout that we already have and should be able to build out the majority of the design in a first iteration. Then comes the new components such as the Dependencies table and the updated Span type breakdown chart along with the comparison data that is considered a feature in itself.

We've previously mentioned a History component which was a container for all the relevant service events (anomalies, annotations, and deployments) combined with a separate visualization of the latency metrics. We've decided to only visualize the existing latency metric chart and use that for hosting the events. The feedback we received on the History component was confusing to most of the users we presented it to. The concept of having two separate time ranges to control was too difficult to grok. So we've dropped it for the MVP and allowed the latency metric chart to host those events by giving it the full width at the top of the view.

As we reviewed the initial draft, we decided we needed to give the tables some more space and re-arranged the charts and tables so the related ones are displayed in the same row. We imagine that this layout is a good template for adding more in future iterations. Secondly, there many new controls that allow the user to show/hide comparison time range data and change the latency metric aggregation (which in turn changes it in the tables as well).

We're working on completing the outstanding design tasks in order to finalize it for implementation, since it should start in the coming weeks. Let me know if you have any feedback or questions.

cc @alex-fedotyev

formgeist · 2020-10-22T07:30:13Z

Closing this design issue as all of the requirements have been addressed in this first iteration. I've created implementation issues for the UI dev team to proceed with the building of the view.

elastic/kibana#81147
elastic/kibana#81135
elastic/kibana#81120

If there's a need for additional design, we'll open new issues to handle those requirements.

alex-fedotyev added design Team:apm labels Jul 21, 2020

katrin-freihofner added the [zube]: Backlog label Jul 27, 2020

katrin-freihofner assigned formgeist Jul 29, 2020

felixbarny mentioned this issue Jul 31, 2020

Mitigate outliers in transaction duration histogram visualization #305

Open

formgeist added [zube]: Ready and removed [zube]: Backlog labels Aug 5, 2020

formgeist added [zube]: In Progress and removed [zube]: Ready labels Aug 12, 2020

formgeist changed the title ~~[APM] Design MVP for new service landing page experience~~ [APM] MVP for new service landing page experience Aug 19, 2020

formgeist mentioned this issue Sep 22, 2020

[APM] Service overview: MVP interactive timeline component to contain event annotations and metric aggregation visualization #343

Closed

This was referenced Oct 20, 2020

[APM] Service overview: Research Dependencies table implementation elastic/kibana#81120

Closed

[APM] Service overview [META] elastic/kibana#81135

Closed

Meta issue: [APM] Service overview: Introduce time-series comparison elastic/kibana#81147

Closed

formgeist closed this as completed Oct 22, 2020

zube bot added [zube]: Done and removed [zube]: In Progress labels Oct 22, 2020

alex-fedotyev mentioned this issue Oct 28, 2020

[Metrics UI] Enhanced host details - Processes elastic/kibana#80307

Closed

bmorelli25 mentioned this issue Oct 29, 2020

[APM] docs: Update documentation to reflect new Service Overview elastic/kibana#82055

Closed

alex-fedotyev mentioned this issue Nov 3, 2020

[APM] Service overview: Research service instances visualization using scatter plot chart elastic/kibana#82397

Closed

dgieselaar mentioned this issue Nov 11, 2020

[APM] Service overview: dependencies table elastic/kibana#83152

Closed

katrin-freihofner removed the [zube]: Done label Nov 23, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[APM] MVP for new service landing page experience #300

[APM] MVP for new service landing page experience #300

alex-fedotyev commented Jul 21, 2020 •

edited by cyrille-leclerc

Loading

elasticmachine commented Jul 21, 2020

alex-fedotyev commented Jul 21, 2020 •

edited

Loading

felixbarny commented Jul 30, 2020

graphaelli commented Jul 31, 2020

formgeist commented Sep 10, 2020

nehaduggal commented Sep 10, 2020

graphaelli commented Sep 10, 2020 •

edited

Loading

formgeist commented Sep 17, 2020

axw commented Sep 18, 2020

alex-fedotyev commented Sep 18, 2020

formgeist commented Oct 19, 2020

formgeist commented Oct 22, 2020

[APM] MVP for new service landing page experience #300

[APM] MVP for new service landing page experience #300

Comments

alex-fedotyev commented Jul 21, 2020 • edited by cyrille-leclerc Loading

elasticmachine commented Jul 21, 2020

alex-fedotyev commented Jul 21, 2020 • edited Loading

felixbarny commented Jul 30, 2020

graphaelli commented Jul 31, 2020

formgeist commented Sep 10, 2020

nehaduggal commented Sep 10, 2020

graphaelli commented Sep 10, 2020 • edited Loading

formgeist commented Sep 17, 2020

axw commented Sep 18, 2020

alex-fedotyev commented Sep 18, 2020

formgeist commented Oct 19, 2020

formgeist commented Oct 22, 2020

alex-fedotyev commented Jul 21, 2020 •

edited by cyrille-leclerc

Loading

alex-fedotyev commented Jul 21, 2020 •

edited

Loading

graphaelli commented Sep 10, 2020 •

edited

Loading