Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

App not responding #740

Closed
Tracked by #750
vanessa-chang opened this issue Aug 9, 2023 · 14 comments
Closed
Tracked by #750

App not responding #740

vanessa-chang opened this issue Aug 9, 2023 · 14 comments
Assignees
Labels
bug Something isn't working

Comments

@vanessa-chang
Copy link

App is easily freezed when testing the app version app version: 6.31-379.

It's noted that the app is easily freezed in some scanarios:
eg. After access the external link opened in the default browser.
When downloading an individual content that takes a long time (#739)

Test the latest 6.32-380
Reproduce steps:

  1. Install the app and download the Artist pack
  2. Get into the Wikihow channel
  3. Access a content
  4. Roll down to the end of the page and access an external link
  5. After it's opened by the default browser and go back to the app
  6. Check if the app is still working

Result:
The app is not responding.
This issue cannot be reproduced in the Windows app.

Log:
app freezes.txt

@vanessa-chang vanessa-chang added the bug Something isn't working label Aug 9, 2023
@erikos
Copy link
Contributor

erikos commented Aug 15, 2023

I have reproduced this as well. I was testing the individual download of content items and the app froze.

Image

I restarted the app - and it froze right away again.

Image

@starnight
Copy link
Contributor

starnight commented Aug 16, 2023

I can reproduce this on ASUS CM3000DVA Chromebook (arm).

I check the log while I follow the reproduce steps.

Looks like the Endless Key's kolibri server process is killed when the user browses the external link with a browser!?!?

08-16 15:23:20.289  3122  3122 I org.endlessos.Key: root: onActivityPaused
...
08-16 15:23:34.125  3122  3122 I org.endlessos.Key: root: onActivityResumed
08-16 15:23:34.132  3122  3122 I PythonActivity: displayLoadingWebView
08-16 15:23:34.137  3122  3122 I org.endlessos.Key: kolibri.utils.server: Bus state: STOP
08-16 15:23:34.149  3122  3259 D org.endlessos.Key: urllib3.connectionpool: https://studio.learningequality.org:443 "GET /content/storage/4/1/41d78f4dcc74ab22f8dc477b9669b414.jpeg HTTP/1.1" 206 165886
08-16 15:23:34.310  3122  3253 D org.endlessos.Key: urllib3.connectionpool: https://studio.learningequality.org:443 "HEAD /content/storage/2/8/28e83f6e5bd60f5a84cf2500e91738ad.jpg HTTP/1.1" 200 0
08-16 15:23:34.359  3122  3274 D org.endlessos.Key: urllib3.connectionpool: https://studio.learningequality.org:443 "HEAD /content/storage/b/1/b19decdd50e31df3d5457c059b4237f8.png HTTP/1.1" 200 0
08-16 15:23:34.577  3122  3122 I org.endlessos.Key: kolibri.utils.server: HTTP Server kolibri.utils.server.Server(('0.0.0.0', 50885)) shut down
08-16 15:23:34.875  3122  3258 D org.endlessos.Key: urllib3.connectionpool: https://studio.learningequality.org:443 "GET /content/storage/8/3/83ff82423c86560e86fbbf14b6f63902.png HTTP/1.1" 206 9623
08-16 15:23:35.032  3122  3122 I org.endlessos.Key: kolibri.utils.server: HTTP Server kolibri.utils.server.Server(('0.0.0.0', 60819)) shut down
08-16 15:23:35.038  3122  3122 I org.endlessos.Key: kolibri.core.tasks.worker: Asking job schedulers to shut down.
08-16 15:23:35.090  3122  3122 I org.endlessos.Key: kolibri.core.tasks.worker: Canceling job id f97db1226ef543a7b36f111d078a4f73.
08-16 15:23:35.092  3122  3122 I org.endlessos.Key: kolibri.core.tasks.worker: Canceling job id 754d3a5052b74a00bef8a72fc368bad5.
08-16 15:23:35.094  3122  3231 D org.endlessos.Key: kolibri: JOBCHECKER shut down event received; closing.
08-16 15:23:35.111  3122  3122 I org.endlessos.Key: kolibri.core.tasks.worker: Canceling job id fc8f33c7ea9f44cd93be7a692c23fd33.
08-16 15:23:35.122  3122  3122 I org.endlessos.Key: kolibri.core.tasks.worker: Canceling job id ad298fbb3c8d45c59cca0ed72a835f06.

Then, when I go back to Endless Key app, it cannot find the kolibri server process. Because, it is killed.

08-16 15:23:40.991  3122  3127 I g.endlessos.Ke: Thread[6,tid=3127,WaitingInMainSignalCatcherLoop,Thread*=0xb40000730c4ce800,peer=0x136000b0,"Signal Catcher"]: reacting to s
ignal 3
08-16 15:23:40.991  3122  3127 I g.endlessos.Ke: 
08-16 15:23:41.286  3122  3127 W g.endlessos.Ke: sched_getscheduler(3231): No such process
08-16 15:23:41.286  3122  3127 W g.endlessos.Ke: sched_getparam(3231, &sp): No such process
08-16 15:23:41.294  3122  3127 W g.endlessos.Ke: sched_getscheduler(3236): No such process
08-16 15:23:41.294  3122  3127 W g.endlessos.Ke: sched_getparam(3236, &sp): No such process
08-16 15:23:41.297  3122  3127 W g.endlessos.Ke: sched_getscheduler(3235): No such process
08-16 15:23:41.297  3122  3127 W g.endlessos.Ke: sched_getparam(3235, &sp): No such process

log.txt

@erikos
Copy link
Contributor

erikos commented Aug 16, 2023

Very likely related to the thumbnail download going on in the background - so moving this to the in progress column.

@dbnicholson
Copy link
Member

Actually, this appears to be more that the server is being stopped by the activity lifecycle hooks, but they don't block. Then it gets resumed before the server hasn't finished stopping and the whole thing blows up. This goes back to https://phabricator.endlessm.com/T34138. What makes this particularly bad is that Kolibri tries to cancel running tasks but it's really slow about it.

I honestly don't know what to do here. I actually ran into this same thing when playing with chaquopy. From what I could tell, only onCreate and onDestroy actually block other lifecycle events. But according to the docs, you'll only ever get to stopped before you're killed in the background. And part of having the server stop when the activity is stopped is so that it properly closes the databases without corrupting them.

In my chaquopy experiment I put the server in a bound service and it worked great. The activity could cycle quickly through states or be killed completely and the server would be unaffected until the activity was unbound either actively or by being killed. So, it would be great to have the server as a bound service, which is the opposite of what I suggested almost exactly a year to the day. Unfortunately, python-for-android makes it very hard to do any of that.

The other part that would be good would be to put the task workers in another service, possibly using WorkManager for scheduling the tasks. Learning Equality is actually doing that now, but talking with Richard it took quite a bit of hacking in python-for-android to get it working.

All that to say that this is going to be hard to fix this correctly with python-for-android. With chaquopy where you actually control the whole app it would be much easier.

@dbnicholson
Copy link
Member

I think in the very short term what to investigate is why it takes kolibri so long to cancel the download tasks. I think if the server can be stopped quickly this will be much less of an issue.

However, the issue with the app freezing outside of a lifecycle change is likely different.

@starnight
Copy link
Contributor

Due to Android activity's life cycle issue, moving kolibri server as a bound service is a correct solution.

@dbnicholson
Copy link
Member

Apparently, you can't cancel running tasks in Kolibri, so the server (really the ServicesPlugin) will not stop until any running tasks have completed.

Alrighty then. What we ultimately want is to not have the tasks running from the server bus. Ideally that would be a separate Android bound service that could run and stop independently of the server. As mentioned above, getting that going in p4a would be rough. I did it for the WorkManager stuff, but then Richard had to make a bunch more nontrivial changes to get it to work right.

Maybe what we can do is start 2 process buses from the python code. One runs just the server and the other runs just the workers. It's all still in the same process with the main activity, but then we can stop just the server bus in the onActivityStop callback and only stop the worker bus in the onActivityDestroyed callback. Sometimes that won't be called and Android will just kill it, but that's the same as happens now. But at least the server bus can be handled gracefully most of the time. That's the most important thing.

@rtibbles
Copy link

rtibbles commented Aug 17, 2023

Some tasks can be cancelled, and generally for long running tasks it is preferable that they be marked as such. Tasks that implement cancellation should (and in the case of Kolibri's built in content import tasks, do) check for cancellation during their run time code to ensure interruptability.

There is currently a bug (as linked by @dbnicholson) because these tasks are incorrectly canceled only via their futures objects during worker shutdown, and not leveraging the more extended Kolibri mechanism for this.

On the whole though - it is precisely this sort of behaviour that prompted us to do the refactor to use WorkManager as the task executor - it completely avoids having to cancel and restart tasks based on app teardown, because you delegate execution of the task to Android.

Previously, the only alternative was to force a persistent foreground notification, which meant that the server and task runner never stopped - this could be ameliorated by adding an exit button to the persistent notification, but is still not great for battery life - and I have no idea how it interacts with Chrome OS.

@dbnicholson
Copy link
Member

On the whole though - it is precisely this sort of behaviour that prompted us to do the refactor to use WorkManager as the task executor - it completely avoids having to cancel and restart tasks based on app teardown, because you delegate execution of the task to Android.

+1000. I didn't get why you wanted WorkManager at the time, but I see it now. Even without WorkManager, having the workers run persistently in a separate bound service would be much more robust. I'm trying to avoid rearchitecting the app like you did at the moment because we're about to change focus away from Android for a bit. What I really want to do is stop applying duct tape to python-for-android and just port the whole thing to chaquopy. This would all be achievable there in a sane way.

dbnicholson added a commit to endlessm/kolibri-installer-android that referenced this issue Aug 17, 2023
Running Kolibri tasks cannot be cancelled[1] and shutting down the
process bus running them will block until the running tasks complete.
This means the Kolibri server can't be stopped quickly or reliably
during the activity's `onActivityStopped` hook.

To workaround that, move the task workers spawned from the
`ServicesPlugin` into a separate process bus. This bus will only be
stopped when the activity is destroyed (or killed). Besides making the
server lifecycle much more reliable, it allows tasks to continue running
and starting in the background as long as the activity process is alive.

Ideally the server and workers would be separate services outside of the
main webview activity, but python-for-android makes that really hard to
accomplish.

1. learningequality/kolibri#10249

Helps: endlessm/kolibri-explore-plugin#740
dbnicholson added a commit to endlessm/kolibri-installer-android that referenced this issue Aug 17, 2023
Running Kolibri tasks cannot be cancelled[1] and shutting down the
process bus running them will block until the running tasks complete.
This means the Kolibri server can't be stopped quickly or reliably
during the activity's `onActivityStopped` hook.

To workaround that, move the task workers spawned from the
`ServicesPlugin` into a separate process bus. This bus will only be
stopped when the activity is destroyed (or killed). Besides making the
server lifecycle much more reliable, it allows tasks to continue running
and starting in the background as long as the activity process is alive.

Ideally the server and workers would be separate services outside of the
main webview activity, but python-for-android makes that really hard to
accomplish.

1. learningequality/kolibri#10249

Helps: endlessm/kolibri-explore-plugin#740
dbnicholson added a commit to endlessm/kolibri-installer-android that referenced this issue Aug 17, 2023
Running Kolibri tasks cannot be cancelled[1] and shutting down the
process bus running them will block until the running tasks complete.
This means the Kolibri server can't be stopped quickly or reliably
during the activity's `onActivityStopped` hook.

To workaround that, move the task workers spawned from the
`ServicesPlugin` into a separate process bus. This bus will only be
stopped when the activity is destroyed (or killed). Besides making the
server lifecycle much more reliable, it allows tasks to continue running
and starting in the background as long as the activity process is alive.

Ideally the server and workers would be separate services outside of the
main webview activity, but python-for-android makes that really hard to
accomplish.

1. learningequality/kolibri#10249

Helps: endlessm/kolibri-explore-plugin#740
dylanmccall pushed a commit to endlessm/kolibri-installer-android that referenced this issue Aug 17, 2023
Running Kolibri tasks cannot be cancelled[1] and shutting down the
process bus running them will block until the running tasks complete.
This means the Kolibri server can't be stopped quickly or reliably
during the activity's `onActivityStopped` hook.

To workaround that, move the task workers spawned from the
`ServicesPlugin` into a separate process bus. This bus will only be
stopped when the activity is destroyed (or killed). Besides making the
server lifecycle much more reliable, it allows tasks to continue running
and starting in the background as long as the activity process is alive.

Ideally the server and workers would be separate services outside of the
main webview activity, but python-for-android makes that really hard to
accomplish.

1. learningequality/kolibri#10249

Helps: endlessm/kolibri-explore-plugin#740
@dbnicholson
Copy link
Member

I usually test with the Android emulator and I found there are a couple knobs to reduce the hardware resources with QEMU:

    -memory <size>                                                      physical RAM size in MBs
    -cores <number>                                                     Set number of CPU cores to emulator

That should help trigger ANRs more easily.

@erikos
Copy link
Contributor

erikos commented Sep 27, 2023

I am reducing the priority here since it can not be reproduced as often anymore.

@erikos
Copy link
Contributor

erikos commented Sep 27, 2023

Moving out of current sprint. This should be re-tested when we switch to Chaquopy. #185

@dbnicholson
Copy link
Member

With the conversion to chaquopy landed now in endlessm/kolibri-installer-android#185, I think this is testable again. I don't think there should be any ANRs since Kolibri runs in a separate service. The activity is just the webview and all interactions with the Kolibri service are asynchronous.

@vanessa-chang could you QA this again with 420 Narwhal 7.8 from internal testing?

@vanessa-chang
Copy link
Author

I have not able to reproduce this issue today with 420 Narwhal 7.8. I will close this issue first.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants