Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce CollectionManager for preventing concurrent access #11849

Merged
merged 19 commits into from
Aug 16, 2022

Conversation

dae
Copy link
Contributor

@dae dae commented Jul 14, 2022

Currently there are many instances in the AnkiDroid codebase where
the collection is accessed a) on the UI thread, blocking the UI, and
b) in unsafe ways (eg by checking to see if the collection is open,
and then assuming it will remain open for the rest of a method call.
"Fix full download crashing in new schema case" (159a108) demonstrates
a few of these cases.

This PR is an attempt at addressing those issues. It introduces
a withCol function that is intended to be used instead of
CollectionHelper.getCol(). For example, code that previously looked
like:

    val col = CollectionHelper.getInstance().getCol(this)
    val count = col.decks.count()

Can now be used like this:

    val count = withCol { decks.count() }

The block is run on a background thread, and other withCol calls made
in parallel will be queued up and executed sequentially. Because of
the exclusive access, routines can safely close and reopen the
collection inside the block without fear of affecting other callers.

It's not practical to update all the legacy code to use withCol
immediately - too much work is required, and coroutines are not
accessible from Java code. The intention here is that this new path is
gradually bought into. Legacy code can continue calling CollectionHelper.
getCol(), which internally delegates to CollectionManager.

Two caveats to be aware of:

  • Legacy callers will wait for other pending operations to complete
    before they receive the collection handle, but because they retain it,
    subsequent access is not guaranteed to be exclusive.
  • Because getCol() and colIsOpen() are often used on the main thread,
    they will block the UI if a background operation is already running.
    Logging has been added to help diagnose this, eg messages like:

E/CollectionManager: blocked main thread for 2626ms:
com.ichi2.anki.DeckPicker.onCreateOptionsMenu(DeckPicker.kt:624)

Other changes/notes:

  • simplified CoroutineHelpers
  • added TR function for accessing translations without a col reference
  • onCreateOptionsMenu() needed refactoring to avoid blocking the UI.
  • The blocking colIsOpen() call in onPrepareOptionsMenu() had to be
    removed. I can not reproduce the issue it reports, and the code checks
    for col in onCreateOptionsMenu().
  • The subscribers in ChangeManager need to be cleared out at the start
    of each test, or two tests are flaky when run with the new schema.
  • Various accumulated bugfixes

Closes #11734
Closes #11417
Closes #11981
Closes #11979
Closes #11999
Closes #12007
Closes #12026
Closes #12027

@dae
Copy link
Contributor Author

dae commented Jul 16, 2022

@lukstbit in #10839 (comment) you talked about the need to inject dispatchers to make testing easier. If it is needed in this PR, could I trouble you to explain how it would help/suggest changes?

@lukstbit
Copy link
Member

Sorry for the late response.
Being able to inject dispatchers is a good practice because it allows us to use a test dispatcher in tests to make code more predictable. However, I wouldn't add anything to this PR just yet. I think moving to coroutines will require some changes to test code and at that point we could add the code to change dispatcher(a simple setter) in CollectionManager.

@dae dae force-pushed the queue branch 4 times, most recently from 8aa37d0 to e10448b Compare July 22, 2022 05:14
@dae dae mentioned this pull request Jul 22, 2022
6 tasks
@dae dae marked this pull request as ready for review July 28, 2022 06:45
@dae dae force-pushed the queue branch 2 times, most recently from e93809f to eb9eeae Compare July 28, 2022 07:15
@dae dae marked this pull request as draft July 28, 2022 08:11
@dae
Copy link
Contributor Author

dae commented Jul 28, 2022

Can anyone with a Windows machine see if they can reproduce the Windows issue locally?

@BrayanDSO
Copy link
Member

Can anyone with a Windows machine see if they can reproduce the Windows issue locally?

Running with ./gradlew jacocoUnitTestReport --rerun-tasks (--rerun-tasks to clear the cache), it always get hanged on Executing test com.ichi2.anki.DeckPickerTest

If I run the class individually, the tests pass though 🤷‍♂️, so it's hard to diagnose the issue

@dae
Copy link
Contributor Author

dae commented Jul 28, 2022

Thanks Brayan, that at least implies it's an OS issue, and not machine-specific. I can try reproduce this in a VM, but it might take me a while to get around to it.

@mikehardy
Copy link
Member

I have a couple windows VMs I leave laying around for things like this but don't have a lot of time myself and still won't for a couple weeks. Net couple days at least I have a lot of car transit time where I may have network if there's a particular set of tests you want me to run / focus on to try to move this forward?

Also, it appears to need conflict resolution at the moment

@dae
Copy link
Contributor Author

dae commented Jul 31, 2022

Might be easier for me to dig into this directly, but it'll probably be a few weeks before I can return to this.

@dae dae force-pushed the queue branch 5 times, most recently from 58ba3b6 to a00f797 Compare August 6, 2022 02:40
@dae
Copy link
Contributor Author

dae commented Aug 6, 2022

One of the test failures was my fault, which I've resolved. From what I can tell, the six remaining intermittently-failing tests seem to be triggering a bug in the coroutine/JDK code on Windows. I added a bunch of print statements to trace what was going on, and it seems that invoking a suspend fun sometimes hangs on Windows. The print statements reveal that the suspend fun is run, but the calling code never continues after the coroutine completes. I've worked around this for now by excluding the tests on Windows, but this is of course not ideal, as Windows-bound developers will need to rely on CI to test these routines. The desktop code suffers from a similar issue: the javascript unit tests don't run on Windows, so only get checked via CI.

@dae dae marked this pull request as ready for review August 6, 2022 03:14
dae added 11 commits August 16, 2022 02:27
Fixes a startup crash when the current deck ID is set to an invalid value
This regressed in 9644f57

+ Fix incorrect private find() implementation in CardContentProvider, and
move into shared file.

Closes ankidroid#11999
ankidroid#11849 (comment)

Android's onCreateOptionsMenu does not play well with coroutines, as
it expects the menu to have been fully configured by the time the routine
returns. This results in flicker, as the menu gets blanked out, and then
configured a moment later when the coroutine runs. To work around this,
the current state is stored in the deck picker, so that we can redraw the
menu immediately, and then make any potential changes in the background.

Other changes:
- refactored onCreateOptionsMenu to make it simpler
- instead of the sdCardAvailable checks (which I assume is a proxy for
"col is available"), the entire menu is wrapped in a group, and the
visibility of the group is toggled depending on whether the col is available
or not. This also fixes the error on a full sync.
- there are three sets of unit tests (one for search icon, one for sync icon,
one for entire menu) that have been a pain since I originally introduced
this PR, and and I've sunk a number of hours into trying to get them to work
properly at this point. The issue appears to be that when mixing coroutine
calls and invalidateOptionsMenu(), onCreateOptionsMenu() is not getting called
before trying to await the job, leading to a hang or stale data. I tried
advancing robolectric, but it did not help. Maybe someone more experienced
in this area can figure it out, but for now I've changed these routines
to be more of a unit test and less of an integration test: rather than
checking the menu itself, they directly invoke the function that updates
the menu state, and check the state instead. This takes onCreateOptionsMenu()
out of the loop, and avoids the problems (and probably allows these tests
to be re-enabled on Windows as well). The sync tests I've removed, as the
entire menu is hidden/shown now when the col is closed, so they are redundant.
Theoretically it should be unnecessary, as it's called in refreshState()
@dae
Copy link
Contributor Author

dae commented Aug 15, 2022

(Rebased to fix a conflict)

Copy link
Member

@mikehardy mikehardy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice touch on the delayed popup commit add

I'm queueing up dependency updates and a fresh i18n sync, this has sat a couple more days, I'm going to merge this in for a new alpha shortly to get all these fixes in and maintain what velocity we have here

@mikehardy mikehardy added Pending Merge Things with approval that are waiting future merge (e.g. targets a future release, CI wait, etc) and removed Review High Priority Request for high priority review Needs Review Needs Second Approval Has one approval, one more approval to merge labels Aug 16, 2022
@mikehardy mikehardy merged commit 771056c into ankidroid:main Aug 16, 2022
mikehardy pushed a commit that referenced this pull request Aug 16, 2022
#11849 (comment)

Android's onCreateOptionsMenu does not play well with coroutines, as
it expects the menu to have been fully configured by the time the routine
returns. This results in flicker, as the menu gets blanked out, and then
configured a moment later when the coroutine runs. To work around this,
the current state is stored in the deck picker, so that we can redraw the
menu immediately, and then make any potential changes in the background.

Other changes:
- refactored onCreateOptionsMenu to make it simpler
- instead of the sdCardAvailable checks (which I assume is a proxy for
"col is available"), the entire menu is wrapped in a group, and the
visibility of the group is toggled depending on whether the col is available
or not. This also fixes the error on a full sync.
- there are three sets of unit tests (one for search icon, one for sync icon,
one for entire menu) that have been a pain since I originally introduced
this PR, and and I've sunk a number of hours into trying to get them to work
properly at this point. The issue appears to be that when mixing coroutine
calls and invalidateOptionsMenu(), onCreateOptionsMenu() is not getting called
before trying to await the job, leading to a hang or stale data. I tried
advancing robolectric, but it did not help. Maybe someone more experienced
in this area can figure it out, but for now I've changed these routines
to be more of a unit test and less of an integration test: rather than
checking the menu itself, they directly invoke the function that updates
the menu state, and check the state instead. This takes onCreateOptionsMenu()
out of the loop, and avoids the problems (and probably allows these tests
to be re-enabled on Windows as well). The sync tests I've removed, as the
entire menu is hidden/shown now when the col is closed, so they are redundant.
@github-actions github-actions bot added this to the 2.16 release milestone Aug 16, 2022
@github-actions github-actions bot removed the Pending Merge Things with approval that are waiting future merge (e.g. targets a future release, CI wait, etc) label Aug 16, 2022
dae added a commit to ankitects/Anki-Android that referenced this pull request Aug 20, 2022
46c4f4d changed a single doProgress()
call to multiple ones. I suspect the speed improvements in that PR
came from checking cancellation at the top, and the other change was
not necessary.

Things got worse with the introduction of ankidroid#11849, as the progress listener
is called it a hot loop, and it invokes colIsOpen() each time.

This change puts performance back on par with the legacy path; further
improvements should be possible in the future by switching to a recycler
view and another backend method: ankidroid#11889
dae added a commit to ankitects/Anki-Android that referenced this pull request Aug 20, 2022
46c4f4d changed a single doProgress()
call to multiple ones. I suspect the speed improvements in that PR
came from checking cancellation at the top, and the other change was
not necessary.

Things got worse with the introduction of ankidroid#11849, as the progress listener
is called it a hot loop, and it invokes colIsOpen() each time.

This change puts performance back on par with the legacy path; further
improvements should be possible in the future by switching to a recycler
view and another backend method: ankidroid#11889
mikehardy pushed a commit that referenced this pull request Aug 20, 2022
46c4f4d changed a single doProgress()
call to multiple ones. I suspect the speed improvements in that PR
came from checking cancellation at the top, and the other change was
not necessary.

Things got worse with the introduction of #11849, as the progress listener
is called it a hot loop, and it invokes colIsOpen() each time.

This change puts performance back on par with the legacy path; further
improvements should be possible in the future by switching to a recycler
view and another backend method: #11889
@github-actions
Copy link
Contributor

github-actions bot commented Oct 8, 2022

Hi there @dae! This is the OpenCollective Notice for PRs merged from 2022-08-01 through 2022-08-31

If you are interested in compensation for this work, the process with details is here:

https://github.com/ankidroid/Anki-Android/wiki/OpenCollective-Payment-Process#how-to-get-paid

We only post one comment per person per month to avoid spamming you, regardless of the number of PRs merged, but this note applies to all PRs merged for this month

Please note that GSoC contributions are okay for this process. Our philosophy is that our users have donated to AnkiDroid for all contributions. The only PRs that will not go through the OpenCollective process are ones directly related to am accepted GSoC project from a selected participant, since those receive a stipend from GSoC itself.

Please understand that our monthly budget is never guaranteed to cover all claims - the cap on payments-per-person may be lower, but we try to make our process as fair and transparent as possible, we just need your understanding.

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment