Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Follow-up issues & fixes for Kenya user study research project #4897

Closed
BenHenning opened this issue Mar 10, 2023 · 0 comments · Fixed by #4896
Closed

Follow-up issues & fixes for Kenya user study research project #4897

BenHenning opened this issue Mar 10, 2023 · 0 comments · Fixed by #4896
Assignees

Comments

@BenHenning
Copy link
Member

Some issues were found during the Kenya user study that led to #4896 being created, and some additional existing issues were found during development.

Specifically, this issue is tracking adding:

  • Adding tracking information to the study-only learner & IDs screen that shows per-profile event tracking information (to help with server-side correlation).
  • Sharing functionality for on-device stored events (so that we can work with these events directly in the event that they fail to upload).
  • A force sync button to upload events immediately.

And, it's tracking fixing:

  • The sync status indicator can be wrong sometimes: if there are pending events that are cached but haven't been uploaded yet, and then an event is later logged due to internet connectivity being restored, the indicator will specify that events have been uploaded (even though there are some still not uploaded).
  • The possibility of cached events being lost if an event is logged with internet connectivity upon app open (with cached events) before the app's cached events are read.
  • The possibility of cached events being duplicated in very specific situations (that are very unlikely to actually occur in practice, but are theoretically possible).
@BenHenning BenHenning self-assigned this Mar 10, 2023
BenHenning added a commit that referenced this issue Mar 10, 2023
## Explanation
Fixes #4897

### Overview of changes

This PR follows up from the changes introduced in #4846 by adding new
functionality to help improve event reliability investigations, as well
as telemetry diagnostics for research project facilitators that need to
ensure events have been successfully uploaded for processing. To help
with this, the following features have been introduced:
- A new "force upload" button that immediately uploads any pending
cached events.
- Event counts to show both the number of events waiting to be uploaded
& the events that are successfully uploaded.
- An improvement on the 'share' button to include a Base64 deflated
binary representation of the entire event log store on the device.

Leading up to, and during, development, a number of theoretical issues
were encountered that were also fixed in case they are playing a role in
some of the ongoing concerns around Firebase event delivery reliability:
- The sync status reported on the learner analytics screen can be wrong
in some cases, leading to the assumption that events have been uploaded
when they may not have been.
- Events can, in some cases, be lost from previous app instances if they
haven't been uploaded yet and, in other cases, duplicated. The former is
much more likely to happen in practice, particularly with the
connectivity patterns being used in the research project.

### Technical specifics

- ``FakeSyncStatusManager`` was renamed to ``TestSyncStatusManager``
since it's actually augmenting the production implementation of
``SyncStatusManager``, not replacing it.
- Events are now being cached indefinitely if the learner study feature
is enabled. This allows for some robustness improvements (see below),
and the ability to share the events themselves if the team suspects
events are not arriving to Firebase.
- The issue of sync status being incorrect came from an unreliable means
of reporting sync status during different event scenarios. Specifically,
if the ``LogUploadWorker`` doesn't kick off and there's network
connectivity, a new event being logged could trigger the sync status to
become "data uploaded" even though there are still pending events. The
new solution is to leverage the now-tracked uploaded events to determine
whether there are still events yet waiting to upload. In this way, the
sync status should always be the most up-to-date information with one
caveat: network connectivity changes won't reflect until a new log is
requested (this is a limitation in how the app tracks network
connectivity, and it didn't seem very important to fix it).
- The issue of possible event duplication comes from
``PersistentCacheStore`` using ``mergeFrom`` with the *current* cached
state and the on-disk proto being loaded. I originally thought that all
pathways to loading the proto from disk could only execute once, but
there's at least one pathway where it's possible to chain multiple calls
into the cache store such that the load happens twice (and both results
end up being merged together, hence the event duplication). This has
been fixed by never reloading the cache store if it's already been
loaded which should always be correct.
- The issue if possible events being lost comes from
``AnalyticsController`` previously not priming the event log cache. This
means that if a new event is logged before the cache is read (e.g. early
events like "open profile screen" can race against ``LogUploadWorker``
kicking off on app startup) then the cache will be completely reset and,
once written, will overwrite any events currently cached on disk. This
has been fixed by ensuring that the event log cache has been primed
before any interactions with it (read or write) occur.

Note that all of the above has been thoroughly tested through automated
testing to ensure that these issues have been properly fixed, and stay
that way in the long-term.

### Test notes & exemptions

- In order to simplify testing ``SyncStatusManagerImpl`` and
``TestSyncStatusManager``, they ought to both pass a common test suite
specific to ``SyncStatusManager`` itself. This has been done by
introducing a ``SyncStatusManagerTestBase`` class that's inherited by
both implementation test suites so that they inherit common API tests.
This pattern is used in one other location in the codebase currently:
https://github.com/oppia/oppia-android/blob/0290ef8037f798d8ba2721441612a80dec4a1f1a/testing/src/test/java/org/oppia/android/testing/threading/TestCoroutineDispatcherTestBase.kt#L33
- ``CircularProgressIndicatorAdaptersTestActivity`` has been exempted
for an a11y label for the same reason as all other test activities: it
technically doesn't need one.
- ``TestSyncStatusManagerTest`` has been allow-listed to use
parameterized tests since it's aggressively verifying many different
force/override sync status cases. Making sure that the test utility
works correctly is extremely important to avoid false failures/passes of
dependent tests (which are, in turn, guarding against regressions for
the core issues covered by this PR, among other event system behaviors).
- ``SyncStatusManagerTestBase`` has been exempted from requiring KDocs
since it's technically a test suite, so KDocs aren't required.
- A number of classes have been exempted from having new test suites as
per existing conventions and requirements: ``ControlButtonsViewModel``
(renamed from ``ShareIdsViewModel``),
``CircularProgressIndicatorAdaptersTestActivity`` (simple test
activities don't require their own tests),
``CircularProgressIndicatorAdaptersTestViewModel`` (view models aren't
tested directly), ``SyncStatusManagerTestBase`` (the test base is more
or less a test suite itself, so it doesn't make sense for it to have
tests as well).
- ``TestSyncStatusManagerTest`` has been disabled in Gradle since it
depends on ``SyncStatusManagerTestBase`` which is part of a different
module. To make this work, the test base would have to be moved to
src/main of the testing module which isn't a very clean approach. The
current setup works fine with Bazel, and longer term this will get
cleaner by moving testing utilities like ``TestSyncStatusManager`` to be
near their corresponding production implementations (we just can't do
this today because of Gradle limitations).
- Due to the asynchronous and data race nature of some of the issues
discovered during test writing, changed & added tests were sometimes
found to be flaky (and, in some cases, very rarely, e.g. 1/50). To
ensure that all of the new & changed tests are stable, I made sure to
run all 8 affected test suites 128 times to ensure stability. Here are
the results of that run ([buildbuddy
link](https://app.buildbuddy.io/invocation/5da310e1-3cc9-451b-a1d0-951a3322c8d7)):

![128xruns_followup_mr6_fixes](https://user-images.githubusercontent.com/12983742/224290120-5bb2991d-59c1-421a-8b73-e05838734253.png)


## Essential Checklist
<!-- Please tick the relevant boxes by putting an "x" in them. -->
- [x] The PR title and explanation each start with "Fix #bugnum: " (If
this PR fixes part of an issue, prefix the title with "Fix part of
#bugnum: ...".)
- [x] Any changes to
[scripts/assets](https://github.com/oppia/oppia-android/tree/develop/scripts/assets)
files have their rationale included in the PR explanation.
- [x] The PR follows the [style
guide](https://github.com/oppia/oppia-android/wiki/Coding-style-guide).
- [x] The PR does not contain any unnecessary code changes from Android
Studio
([reference](https://github.com/oppia/oppia-android/wiki/Guidance-on-submitting-a-PR#undo-unnecessary-changes)).
- [x] The PR is made from a branch that's **not** called "develop" and
is up-to-date with "develop".
- [x] The PR is **assigned** to the appropriate reviewers
([reference](https://github.com/oppia/oppia-android/wiki/Guidance-on-submitting-a-PR#clarification-regarding-assignees-and-reviewers-section)).

## For UI-specific PRs only

Following is a video demonstrating the new functionality introduced in
this PR:


https://user-images.githubusercontent.com/12983742/224289397-73c5e3a7-4f7a-4116-8e71-3316e30fa9d7.mp4

Note that affected UI tests that could run on Espresso were **not**
verified in this PR similar for the reasons listed in #4846
(specifically, the ongoing issue of Espresso tests being time consuming
to investigate, often broken, and this PR being high priority).

Localization & accessibility changes aren't relevant for this change,
and any tablet-specific issues aren't important since the affected
screen is only expected to be accessed on handset devices.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant