Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add method to detect if network has changed and reinitialize zeroconf #8231

Merged
merged 3 commits into from
Jul 29, 2021

Conversation

jamalex
Copy link
Member

@jamalex jamalex commented Jul 28, 2021

Summary

We identified that zeroconf doesn't pick up when a new network connection is established, so it doesn't become discoverable on networks that it wasn't connected to at the time Kolibri started up.

This PR adds a method for detecting when the set of bound IP addresses on the system has changed, and reinitializing the zeroconf internals to ensure they're attached to each of those IPs.

Note: no code has been added to actually call this new method. We may want to spawn a thread that calls it every minute (as it should be very lightweight most of the time, since it's checking the list of bound IPs against its cached copy and only doing something if it changes). We'll need to ensure it happens in the same process, as the zeroconf state is process-local (in global vars inside kolibri.core.discovery.utils.network.search), so we may not want to use iceqube.

Reviewer guidance

To test:

  • Have Kolibri running on another machine on the network
  • On the test machine, disconnect from the network (e.g. disable wifi)
  • Start kolibri shell, and run from kolibri.core.discovery.utils.network.search import *; register_zeroconf_service(999)
  • Run [p["device_name"] for p in get_peer_instances()], and you should only see your own machine name
  • Reconnect to the network
  • Run [p["device_name"] for p in get_peer_instances()] again several times, waiting in between, and you should still only see your own machine name
  • Run reinitialize_zeroconf_if_network_has_changed()
  • Start trying [p["device_name"] for p in get_peer_instances()] again. After a short time, it should start showing the name of the other device on your network in the list as well.

Hopefully sorts out #8227

Testing checklist

  • Contributor has fully tested the PR manually
  • If there are any front-end changes, before/after screenshots are included
  • Critical user journeys are covered by Gherkin stories
  • Critical and brittle code paths are covered by unit tests

PR process

  • PR has the correct target branch and milestone
  • PR has 'needs review' or 'work-in-progress' label
  • If PR is ready for review, a reviewer has been added. (Don't use 'Assignees')
  • If this is an important user-facing change, PR or related issue has a 'changelog' label
  • If this includes an internal dependency change, a link to the diff is provided

Reviewer checklist

  • Automated test coverage is satisfactory
  • PR is fully functional
  • PR has been tested for accessibility regressions
  • External dependency files were updated if necessary (yarn and pip)
  • Documentation is updated
  • Contributor is in AUTHORS.md

@jamalex jamalex added this to the 0.15.0 milestone Jul 28, 2021
@jamalex jamalex requested a review from rtibbles July 28, 2021 07:03
@jamalex jamalex added the TODO: needs review Waiting for review label Jul 28, 2021
@pcenov
Copy link
Member

pcenov commented Jul 29, 2021

Hi @jamalex I tested this build on 2 Windows 10 laptops and now it's not possible to import a facility at all.
Here's a screenshot from the Learner's device showing the disabled radio-button. Notice that there's a /?v=2 parameter added to the the end of the URL which for some reason is added automatically:

2021-07-29_10-58-10

Here's a screenshot with the same issue observed when I attempt to import a facility from Device>Facilities:

2021-07-29_12-58-50

I'm also observing unusually high CPU usage with this build:

2021-07-29_11-07-28

Here are the logs from both laptops:
Admin logs.zip
Learner logs.zip

@jredrejo
Copy link
Member

Hi @jamalex I tested this build on 2 Windows 10 laptops and now it's not possible to import a facility at all.
Here's a screenshot from the Learner's device showing the disabled radio-button. Notice that there's a /?v=2 parameter added to the the end of the URL which for some reason is added automatically:

2021-07-29_10-58-10

Here's a screenshot with the same issue observed when I attempt to import a facility from Device>Facilities:

2021-07-29_12-58-50

I'm also observing unusually high CPU usage with this build:

2021-07-29_11-07-28

Here are the logs from both laptops:
Admin logs.zip
Learner logs.zip

Hello @pcenov that's an issue that was solved yesterday. Are you using the build that was created yesterday? https://github.com/learningequality/kolibri/releases/tag/v0.15.0-alpha2 That should have that problem solved.

@pcenov
Copy link
Member

pcenov commented Jul 29, 2021

Hi @jamalex as suggested by @jredrejo you'll need to rebase develop as he says that the issue I'm facing has been fixed in #8232 and merged yesterday in develop.

@rtibbles rtibbles force-pushed the rolodex_refresher branch from 7b1b1ff to 025c096 Compare July 29, 2021 14:49
@rtibbles
Copy link
Member

Hi @pcenov I have rebased this branch now.

Copy link
Member

@rtibbles rtibbles left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels a little circular, but I approve this PR.

@rtibbles rtibbles merged commit 5c88490 into learningequality:release-v0.15.x Jul 29, 2021
@pcenov
Copy link
Member

pcenov commented Jul 30, 2021

Hi @rtibbles I did another test run on this using this build and unfortunately the issue described in #8227 is still present.

Here are the logs:
Learner logs.zip
Admin logs.zip

Let me know if you need additional info or anything else.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
TODO: needs review Waiting for review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Sync a facility - Hard to reproduce issue where syncing a facility fails
4 participants