Skip to content
This repository has been archived by the owner on Nov 1, 2022. It is now read-only.

Telemetry probe to measure frequency of reloading active tabs from killed content processes #9366

Closed
dblohm7 opened this issue Jan 8, 2021 · 12 comments
Assignees
Labels
<engine-gecko> Component: browser-engine-gecko
Milestone

Comments

@dblohm7
Copy link

dblohm7 commented Jan 8, 2021

In GV we have some telemetry to measure lifetime for content processes and such, however in this particular case I believe that AC is the right layer in the stack for this probe.

As I scale up the number of content processes permitted by GV, we want to answer the following question:

How often is a tab's content process killed by the OS, only for the tab's session to be reloaded shortly thereafter?

I'm thinking that a timer might be the best option, since then we get both timings and frequency and can just Analyze All The Things afterward.

┆Issue is synchronized with this Jira Task

@dblohm7
Copy link
Author

dblohm7 commented Jan 8, 2021

I'm also thinking that we would want to cancel any such timers as soon as the entire app is backgrounded. Thoughts?

@csadilek
Copy link
Contributor

This seems great to have, yes.

Talked to @agi today about handling low memory conditions and his advice for the future (multi-e10s) was to just rely on letting the OS kill processes instead of reacting to onTrimMemory and close sessions ourselves. I think if we go down that path this telemetry would also be super useful.

How often is a tab's content process killed by the OS, only for the tab's session to be reloaded shortly thereafter?

Can you share some details or a doc on how content process allocation works? I am wondering how this would be implemented. Would it "simply" be a case of recording a timestamp when onKill was called on the gecko session content delegate and another one whenever the session renders a page again?

Wondering what @pocmo thinks too.

@dblohm7
Copy link
Author

dblohm7 commented Jan 22, 2021

Can you share some details or a doc on how content process allocation works?

Sure.

Allocation

Allocation is governed by the dom.ipc.processCount Gecko pref. This is currently set to 3 on Nightly but 1 on Beta and Release. For the remainder of this discussion I'm going to assume the Nightly case.

The process count is then used as a number of buckets to slot GeckoSessions (ie, tabs) into content processes. The heuristic for selecting a bucket for a new tab is to always choose the bucket with the minimum number of existing tabs. When all buckets are of equal size, this appears to be round-robin.

For example, suppose I am running Nightly and have not opened any sessions yet. Opening 3 tabs (let's assume simple integral, monotonically-increasing session ids) will result in 3 content processes:

Process Name Set of Tab IDs hosted in each process
:tab0 {0}
:tab1 {1}
:tab2 {2}

When opening a fourth tab, we must now re-use a content process:

Process Name Set of Tab IDs hosted in each process
:tab0 {0, 3}
:tab1 {1}
:tab2 {2}

If I were to just continue opening new sessions, it would appear to be round-robin.

Now suppose that we closed the GeckoSession that was being hosted in :tab2:

Process Name Set of Tab IDs hosted in each process
:tab0 {0, 3}
:tab1 {1}
:tab2 {}

When the next GeckoSession is opened, it will be slotted into :tab2 because that process is now the one with the minimum count:

Process Name Set of Tab IDs hosted in each process
:tab0 {0, 3}
:tab1 {1}
:tab2 {4}

Prioritization

The process hosting the currently active GeckoSession is prioritized differently depending on whether Fenix itself is foreground or not. If foregrounded, the content process also receives FOREGROUND priority. If Fenix is backgrounded, the active session's content process receives BACKGROUND priority. The other content processes that are hosting inactive sessions receive IDLE priority.

GeckoView specifies these priority levels in such a way that causes Android to prefer killing lower-priority processes before it goes after higher priority processes.

Tying this in with AC and the telemetry probe

Last summer, after some discussions with @pocmo, he opened #7820 to ensure that killed sessions were lazily restored, which was great. Generally speaking, we're not particularly interested in data about inactive sessions whose IDLE content processes were killed.

What we are interested in, however, is the rate that content processes are killed when those processes were hosting the active session, causing AC to require reloading that session in short order. This data would essentially track bug 1682319 for us.

@dblohm7 dblohm7 changed the title Telemetry probe to measure frequency of reloading tabs from killed content processes Telemetry probe to measure frequency of reloading active tabs from killed content processes Jan 22, 2021
@dblohm7
Copy link
Author

dblohm7 commented Jan 22, 2021

(And we also care about this in the single content process case, because I believe that one effective mitigation to the bug 1682319 problem may be multi-e10s itself!)

@pocmo
Copy link
Contributor

pocmo commented Feb 2, 2021

@dblohm7 Trying to map this to the events we are seeing at the AC level, ... you want to track in telemetry how often we see onKill() getting invoked for the currently selected tab? Is this the right approach? Also happy to chat/meet about this, if that makes things simpler. :)

@agi
Copy link
Contributor

agi commented Feb 3, 2021

I think something that would be interesting to track is:

  • number of times the user switches to a tab and
    • the tab is not loaded
    • the tab is loaded
  • number of times the user taps on the app and
    • fenix cold starts
    • fenix warm starts but tab is not loaded
    • fenix warm starts and tab is loaded
  • number of times we receive onTrimMemory and
    • we ignore it
    • we unload tabs (and how many)
    • what level
  • number of times we get killed in the background using
  • number of alive tabs when
    • going to background
    • going back to foreground and how long we've been backgrounded (so we can plot tab retention with 1,5,10,30 minutes)

This metrics could help us track how memory/performance affects retaining tabs/getting killed and how often our users get their tabs reloaded. Also it can help us know how many tabs are actually alive in users devices (e.g. when thinking about memory/performance tradeoffs based on number of tabs)

@agi
Copy link
Contributor

agi commented Feb 3, 2021

FWIW we should also track the onKill x process priority, but we can do that in GeckoView itself I think, we don't need AC for that.

@dblohm7
Copy link
Author

dblohm7 commented Feb 3, 2021

@dblohm7 Trying to map this to the events we are seeing at the AC level, ... you want to track in telemetry how often we see onKill() getting invoked for the currently selected tab? Is this the right approach? Also happy to chat/meet about this, if that makes things simpler. :)

@pocmo Yes, that sounds correct to me.

@pocmo pocmo self-assigned this Feb 4, 2021
@fluffyemily
Copy link

I think something that would be interesting to track is:

* number of times the user switches to a tab and
  
  * the tab is not loaded
  * the tab is loaded


* number of times the user taps on the app and
  
  * fenix cold starts
  * fenix warm starts but tab is not loaded
  * fenix warm starts and tab is loaded

* number of times we receive onTrimMemory and
  
  * we ignore it
  * we unload tabs (and how many)
  * what level

* number of times we get killed in the background using
  
  * `onKill` (for content process) and
  * `getHistoricalProcessExitReasons` for main process https://developer.android.com/reference/kotlin/android/app/ActivityManager#gethistoricalprocessexitreasons (API30 only I think)

* number of alive tabs when
  
  * going to background
  * going back to foreground and how long we've been backgrounded (so we can plot tab retention with 1,5,10,30 minutes)

This metrics could help us track how memory/performance affects retaining tabs/getting killed and how often our users get their tabs reloaded. Also it can help us know how many tabs are actually alive in users devices (e.g. when thinking about memory/performance tradeoffs based on number of tabs)

@agi could you raise a separate issue for these other metrics so we don't block on them?

@agi
Copy link
Contributor

agi commented Feb 4, 2021

Sure: #9624

pocmo added a commit to pocmo/android-components that referenced this issue Feb 5, 2021
… session killed" and track engine session lifetime.

* Once we link an `EngineSession` to a `Session` we track the time.
* The separate `BrowserAction` allows us to write a Middleware for this event.
* I was unhappy with SystemClock requiring the Android stdlib and therefore making mocking a pain, or
  requiring the slow Robolectric test runner. I ended up with this wrapper class, that seems to work
  well in Fenix when writing unit tests.

The next step is to write a Middleware in Fenix that looks at those events and records metrics in Glean.
I will open a PR for that soon.
pocmo added a commit to pocmo/android-components that referenced this issue Feb 5, 2021
… session killed" and track engine session lifetime.

* Once we link an `EngineSession` to a `Session` we track the time.
* The separate `BrowserAction` allows us to write a Middleware for this event.
* I was unhappy with SystemClock requiring the Android stdlib and therefore making mocking a pain, or
  requiring the slow Robolectric test runner. I ended up with this wrapper class, that seems to work
  well in Fenix when writing unit tests.

The next step is to write a Middleware in Fenix that looks at those events and records metrics in Glean.
I will open a PR for that soon.
@pocmo pocmo added the <engine-gecko> Component: browser-engine-gecko label Feb 5, 2021
mergify bot pushed a commit that referenced this issue Feb 9, 2021
…d" and track engine session lifetime.

* Once we link an `EngineSession` to a `Session` we track the time.
* The separate `BrowserAction` allows us to write a Middleware for this event.
* I was unhappy with SystemClock requiring the Android stdlib and therefore making mocking a pain, or
  requiring the slow Robolectric test runner. I ended up with this wrapper class, that seems to work
  well in Fenix when writing unit tests.

The next step is to write a Middleware in Fenix that looks at those events and records metrics in Glean.
I will open a PR for that soon.
@pocmo
Copy link
Contributor

pocmo commented Feb 19, 2021

Required patches landed in AC and Fenix and I just triggered a new Fenix Nightly. By next Monday we should start to see some data for Nightly.

@AndiAJ
Copy link

AndiAJ commented Mar 3, 2021

Hi, verified as fixed on latest master using a Pixel 2 API 28 (Android 9) Emulator.

Had 2 open tabs (one in foreground, one in background)
Killed both processes using this method

Properly generated metrics ping 2206d3eb-f974-4899-a6d9-d3c28f9a1552

"labeled_counter": {
          "engine_tab.kills": {
            "background": 1,
            "foreground": 1
          }

"timing_distribution": {
          "engine_tab.kill_background_age": {
            "sum": 109716000000,
            "values": {
              "105979920938": 1,
              "115571923290": 0
            }
          },
          "engine_tab.kill_foreground_age": {
            "sum": 105926000000,
            "values": {
              "105979920938": 0,
              "97184015999": 1
            }
          }

Logcat

@AndiAJ AndiAJ closed this as completed Mar 16, 2021
@gabrielluong gabrielluong added this to the 74.0.0 milestone Mar 18, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
<engine-gecko> Component: browser-engine-gecko
Projects
None yet
Development

No branches or pull requests

7 participants