Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Misc fixes for LOD syncing and upgrade Morango #11525

Merged

Conversation

bjester
Copy link
Member

@bjester bjester commented Nov 13, 2023

Summary

  • Updates LOD sync processing to never cancel the task but just enqueue and catch JobRunning
  • Updates LOD sync processing to reset any queue records that are marked as Syncing since there should only ever be one job running that code at a time, and records with that status indicate it was terminated unexpectedly
  • Adds check in cleanup hook that enqueues sync garbage collection to only occur if Kolibri is provisioned
  • Adds defensive check to LOD syncing for unavailable network locations and backs off syncing in that scenario
  • Upgrades morango to 0.6.19, fixing issue with premature cleanup of LOD sync sessions interfering with resumption
  • Adds more unit tests for LOD syncing code

References

Resolves #11524
Includes learningequality/morango#205

Reviewer guidance

Sync


Testing checklist

  • Contributor has fully tested the PR manually
  • If there are any front-end changes, before/after screenshots are included
  • Critical user journeys are covered by Gherkin stories
  • Critical and brittle code paths are covered by unit tests

PR process

  • PR has the correct target branch and milestone
  • PR has 'needs review' or 'work-in-progress' label
  • If PR is ready for review, a reviewer has been added. (Don't use 'Assignees')
  • If this is an important user-facing change, PR or related issue has a 'changelog' label
  • If this includes an internal dependency change, a link to the diff is provided

Reviewer checklist

  • Automated test coverage is satisfactory
  • PR is fully functional
  • PR has been tested for accessibility regressions
  • External dependency files were updated if necessary (yarn and pip)
  • Documentation is updated
  • Contributor is in AUTHORS.md

@github-actions github-actions bot added DEV: backend Python, databases, networking, filesystem... SIZE: small labels Nov 13, 2023
@radinamatic
Copy link
Member

These are just logs after just one LoD import: it synced Ok for about 90 minutes, and then run into the usual wall 🤷🏽‍♀️

Android13-db-and-logs.zip
Windows10-db-and-logs.zip

@pcenov
Copy link
Member

pcenov commented Nov 14, 2023

Hi @bjester adding additional logs for 3 consecutively created Android LODs: https://drive.google.com/drive/folders/1__X6gsdgSh-s6UtdwvSk8qg-NiUG69FJ?usp=sharing

I'm still not able to pinpoint the exact cause, hoping the logs will give you a better picture of what's going on!

@radinamatic
Copy link
Member

radinamatic commented Nov 15, 2023

With the latest assets from yesterday, the first LoD (Android13) was syncing perfectly (user import, initial and new assignments download, added resources, etc.) for about 2 hours. But about 20 minutes after I synced another LoD (Android12, user created on the server's facility this time), the first one stopped, as in previous tests... 😞

On the second LoD (Android12), the cards of class assignments appeared much faster on the Home screen (in less than a minute), but the resources did not download at all (but the status on both the Coach side and on LoD would state Synced): every time I would try to tap on the card to actually open the lesson or a quiz, it would forward me to the Library screen. Eventually, the second device also stopped syncing, less than an hour after the setup, and I never got to see and interact with any of the resources.

Android11-michc-db-logs.zip, (this is actually Android 12, I forgot it went through an upgrade)
Android13-peterb-db-logs.zip
Windows10-db-logs.zip

@radinamatic
Copy link
Member

radinamatic commented Nov 15, 2023

Since yesterday it took us some time with @rtibbles to figure out some subtleties of using adb, I decided to give it a go again today, trying to capture the time frame when a LoD stops syncing.

  1. I cleared the data from the 2 devices that have previously stopped syncing (learners peterb and michc, logs included in the previous post)
  2. I imported the learner neelar on the Android 13 device and played with it for about an hour, with syncing working well
  3. At hour mark I start importing the learner marcg on the Android 12 device, and (surprise, surprise), Android 13 device stops syncing right about then.
  4. I fired the USB debugging immediately, and collected the adb output here: Android13-neelar.log.txt
  5. Android 12 device is currently still syncing, will add adb output logs if/when it stops.

@rtibbles
Copy link
Member

I am seeing this sequence in the logs:

11-15 13:21:53.337 14650 14785 I WM-WorkerWrapper: Worker result SUCCESS for Work [ id=2f64f5bb-e9f6-4ea3-9f1c-b7c3d1c229d0, tags={ kolibri_job_type:kolibri.core.auth.tasks.soud_sync_processing, kolibri_task_id:50, org.learningequality.Kolibri.TaskworkerWorker } ]
11-15 13:21:53.345 14650 14785 I WM-WorkerWrapper: Setting status to enqueued for cfec18da-211c-4944-a715-5ba7b772329f

The first is the current soud_sync_processing task being successfully completed, the next is the enqueueing of the follow up task (now that its prerequisite has completed) - however, I'm never seeing that task get processed. Unless something very strange is happening with WorkManager, the only thing I can think of is if somehow we are setting the delay to a huge increment.

@radinamatic
Copy link
Member

Just to confirm that even the very beginning of the learner import process on another LoD seems to interrupt the syncing of the LoD that was previously syncing correctly.

LoD with the learner marcg (Android 12 device) was syncing fine for a couple of hours, but as soon I started the process of setting up a new LoD, it stopped. I actually did not even finish the user import of the user susanl on the Android 9 tablet, just selected the server device, and the syncing on Android 12 device (with the learner marcg) was interrupted.

IMG_20231115_161135548

adb output here for Android 12: Android12-marcg.log.txt

Android 9 tablet is currently still syncing...

@radinamatic
Copy link
Member

Kolibri logs from the server and the 2 LoDs in this latest round of syncing.

Windows10-server-db-logs.zip
Android12-marcg-db-logs.zip
Android13-neelar-db-logs.zip

@rtibbles
Copy link
Member

This is the correlated log sequence to the logcat logs (noting that the Python logs are in UTC, while the logcat logs are in CET, hence the time discrepancy):

INFO 2023-11-15 12:21:42,155 __main__ Starting Kolibri task worker, for job 50
INFO 2023-11-15 12:21:42,273 kolibri.core.device.soud [user=820a1f6161c90a27fc98527090a07454] [server=58d91f1aff4f6f98b3f5f4def1e6167f] SoUD sync ready
INFO 2023-11-15 12:21:42,424 kolibri.core.discovery.utils.network.client Attempting connections to variations of the URL: http://192.168.1.126:8080/
INFO 2023-11-15 12:21:42,454 kolibri.core.discovery.utils.network.client Attempting connection to: http://192.168.1.126:8080/
INFO 2023-11-15 12:21:42,512 kolibri.core.discovery.utils.network.client Success! We connected to: http://192.168.1.126:8080/api/public/info/?v=3
INFO 2023-11-15 12:21:43,129 kolibri.core.auth.management.utils Remotely preparing data
INFO 2023-11-15 12:21:45,662 kolibri.core.auth.management.utils No records transferred
INFO 2023-11-15 12:21:47,340 kolibri.core.content.kolibri_plugin Processing content requests for synced dataset: bc5dc3336c78409bca3dd06a694ae2de
INFO 2023-11-15 12:21:47,342 kolibri.core.content.utils.content_request Processing new content assignment requests
INFO 2023-11-15 12:21:47,351 kolibri.core.content.utils.content_request Processing new content removal requests
INFO 2023-11-15 12:21:47,360 kolibri.core.content.kolibri_plugin Completed content requests for synced dataset: bc5dc3336c78409bca3dd06a694ae2de
INFO 2023-11-15 12:21:47,617 __main__ Starting Kolibri task worker, for job 175d3c23e4504598b4701afe72408aaf
INFO 2023-11-15 12:21:47,694 __main__ Ending Kolibri task worker, for job 175d3c23e4504598b4701afe72408aaf
INFO 2023-11-15 12:21:48,113 __main__ Starting Kolibri task worker, for job dfeae49f27414d41bbfffe89bb842398
INFO 2023-11-15 12:21:48,475 __main__ Ending Kolibri task worker, for job dfeae49f27414d41bbfffe89bb842398
INFO 2023-11-15 12:21:48,501 kolibri.core.auth.management.utils Locally preparing data to send
INFO 2023-11-15 12:21:50,724 kolibri.core.auth.management.utils No records transferred
INFO 2023-11-15 12:21:53,259 __main__ Ending Kolibri task worker, for job 50

So, this shows the same story of the task successfully completing and the follow up being enqueued, but it then never happens.

@pcenov
Copy link
Member

pcenov commented Nov 15, 2023

Here are my logs from today: https://drive.google.com/drive/folders/1KFFHncnxAsezcemNOxIDr1Ti0qMv2hkZ?usp=sharing
The AndroidLOD1 stopped syncing after several hours of working just fine. Then I created another AndroidLOD2 + Ubuntu and Windows LODs which have been syncing correctly while working together... Quite puzzling!

@radinamatic
Copy link
Member

Promising results with the latest asset on a non-Xiaomi device: leaving the Kolibri app in the background did not interrupt syncing, and it was constant after various intervals of backgrounding. 👏🏽

After that I left both devices for the night (6-7 hours approximately) without closing the app. Upon resume the server reported last sync 7 hours ago, and the app Not recently synced . I re-engaged a resource but the sync was not resuming. Only after the server restart was the sync resumed and the most recent interactions recorded in class activity.

Windows10-db-logs.zip
android12-rayb.log

@pcenov
Copy link
Member

pcenov commented Nov 16, 2023

I first started with testing only with 1 Android LOD - unfortunately it did stop syncing after several hours of working just fine and without an obvious reason to do so.
I then set up another Android LOD with a couple of imported learners, a LOD in an emulator in Android Studio and one Ubuntu LOD all simultaneously syncing without observing any issues - really impressive! Here are all of the logs from today: https://drive.google.com/drive/folders/1VW0ULon0WHPhhR3nUddTx0DNKtTHyYNK?usp=sharing

@bjester bjester marked this pull request as ready for review November 16, 2023 20:19
@bjester bjester added the TODO: needs review Waiting for review label Nov 16, 2023
@bjester bjester requested a review from rtibbles November 16, 2023 20:19
@bjester bjester changed the title Add enqueue logging and ensure we set backoff on unavailable network location Misc fixes for LOD syncing and upgrade Morango Nov 16, 2023
Copy link
Member

@rtibbles rtibbles left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All makes sense to me.

I think the only other thing we can do on the Kolibri side here is making our job management more robust to job runner failure (which would be useful for both the Android App, and Cloud installations). Some sort of 'last updated' for a job or somesuch might be helpful to give us a heuristic to be able to reset tasks that report as running but are not really.

# since there should only ever be one processing job running at a time, if we encounter any in
# the queue that are marked as syncing, we should reset their status to pending because it must
# mean that the previous job was terminated unexpectedly
SyncQueue.objects.filter(status=SyncQueueStatus.Syncing,).update(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@bjester bjester merged commit 0ce8716 into learningequality:release-v0.16.x Nov 17, 2023
34 checks passed
@pcenov
Copy link
Member

pcenov commented Nov 17, 2023

@bjester here are the logs for an Android LOD that stopped syncing again after about 2 hours of syncing correctly: https://drive.google.com/drive/folders/1kfOwOcq7wrJhVKtXO49ykUx3Din4sJtL
I see that this PR is already merged but since there is still an issue with the syncing stopping (using the latest assets) this should be further investigated.

@bjester
Copy link
Member Author

bjester commented Nov 17, 2023

@bjester here are the logs for an Android LOD that stopped syncing again after about 2 hours of syncing correctly: https://drive.google.com/drive/folders/1kfOwOcq7wrJhVKtXO49ykUx3Din4sJtL I see that this PR is already merged but since there is still an issue with the syncing stopping (using the latest assets) this should be further investigated.

@pcenov Richard noticed this in the Android logs

NotificationService: Suppressing notification from package org.learningequality.Kolibri by user request.

Did you perhaps silence or disable the Kolibri notifications?

@pcenov
Copy link
Member

pcenov commented Nov 20, 2023

@bjester yes, think I had the notifications disabled on this particular Samsung device. But looking closely at this it can also be an issue on Samsung's end as I don't actually see an option to turn the notifications back on and could be caused by a recent update to Samsung's One UI 5.1 (Android 13).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
DEV: backend Python, databases, networking, filesystem... SIZE: large SIZE: medium SIZE: small TODO: needs review Waiting for review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Beta 8 Android syncing issues
4 participants