Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Import content issue fixes #9242

Merged
merged 5 commits into from
Apr 6, 2022

Conversation

rtibbles
Copy link
Member

@rtibbles rtibbles commented Apr 1, 2022

Summary

  • Fixes regression from Concurrent import file download #7521 that caused cancellation of large downloads to take a very long time, because it would wait for each individual file download to be processed by the ThreadExecutor and check to see if it was cancelled.
  • Adds options for regular priority workers and high priority workers to give a canonical location to reference how many workers are in use.
  • Adds in additional tolerance on the total number of file descriptors used when selecting the total number of server threads to use.
  • Only allocates as many download worker threads as should not overwhelm the number of available file descriptors for the process

References

Fixes #9229

Reviewer guidance

Import all of a Khan Academy channel. Cancel the task, it should cancel quickly.

Attempt to run imports of at least 5 channels simultaneously with the ulimit set to 256 - confirm that no errors occur.


Testing checklist

  • Contributor has fully tested the PR manually
  • If there are any front-end changes, before/after screenshots are included
  • Critical user journeys are covered by Gherkin stories
  • Critical and brittle code paths are covered by unit tests

PR process

  • PR has the correct target branch and milestone
  • PR has 'needs review' or 'work-in-progress' label
  • If PR is ready for review, a reviewer has been added. (Don't use 'Assignees')
  • If this is an important user-facing change, PR or related issue has a 'changelog' label
  • If this includes an internal dependency change, a link to the diff is provided

Reviewer checklist

  • Automated test coverage is satisfactory
  • PR is fully functional
  • PR has been tested for accessibility regressions
  • External dependency files were updated if necessary (yarn and pip)
  • Documentation is updated
  • Contributor is in AUTHORS.md

Properly break from file transfer loop when canceled.
Cancel all futures when canceled.
@rtibbles rtibbles added the TODO: needs review Waiting for review label Apr 1, 2022
@rtibbles rtibbles requested review from bjester and radinamatic April 1, 2022 15:03
Add upper bound to number of workers.
Update tests for additional cancel checks.
@pcenov
Copy link
Member

pcenov commented Apr 4, 2022

@radinamatic I tested this on both the Windows and Linux builds and both channel import cancelation and simultaneous channel import are functioning correctly without any noticeable issues.

Copy link
Member

@radinamatic radinamatic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Manual QA passes, good to go! 👍🏽 :shipit:

Copy link
Member

@jredrejo jredrejo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my pov, the code is fine.
Canceling a job works perfectly, no matter the size of the channel or how many channels are being downloaded.

However I've not been able to import five channels at the same time. In all the ocasions I had four channels being imported, one of them failed when I tried to import a new one. This is the flow I saw:

  • Begin to import up to 4 channels, some of them big
  • Select another channel to import
  • Then I saw errors in the console while this screen was in the browser:
    2022-04-06_18-44_listing
  • One of the channels being imported failed (it happened whenver I tried to import the fifth):
    2022-04-06_18-40_failed
    again:
    2022-04-06_18-48_fail2

The traceback in the console is always the same (a db locking is happening):

INFO: 127.0.0.1 - - "GET /api/content/remotechannel/0a3446937e3340fa86e6010ba80e16e1/" 200 0 "http://localhost:8080/en/device/" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.82 Safari/537.36"
ERROR    An error occurred during content import: (sqlite3.OperationalError) database is locked
[SQL: UPDATE jobs SET saved_job=? WHERE jobs.id = ?]
[parameters: ('{"job_id": "53fd3a8f5246429dba17e782a4ea5695", "state": "RUNNING", "exception": null, "traceback": null, "track_progress": true, "cancellable": true, ... (633 characters truncated) ... _id": null, "node_ids": ["c9d7f950ab6b5a1199e3d6c10d7f0103"], "exclude_node_ids": []}, "func": "django.core.management.call_command", "result": null}', '53fd3a8f5246429dba17e782a4ea5695')]
(Background on this error at: http://sqlalche.me/e/e3q8)
INFO     Setting availability to True of 220 LocalFile objects based on passed in checksums
INFO     Setting availability of non-topic ContentNode objects based on LocalFile availability in 6 batches of 10000
INFO     Annotating ContentNode objects with children for 6 levels
INFO     Annotating ContentNode objects with children for level 6
INFO     Annotating ContentNode objects with children for level 5
INFO     Annotating ContentNode objects with children for level 4
INFO     Annotating ContentNode objects with children for level 3
INFO     Annotating ContentNode objects with children for level 2
INFO     Annotating ContentNode objects with children for level 1
ERROR    Job 53fd3a8f5246429dba17e782a4ea5695 raised an exception: Traceback (most recent call last):
  File "/datos/le/mio/kolibri/venv/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1283, in _execute_context
    self.dialect.do_execute(
  File "/datos/le/mio/kolibri/venv/lib/python3.8/site-packages/sqlalchemy/engine/default.py", line 590, in do_execute
    cursor.execute(statement, parameters)
sqlite3.OperationalError: database is locked

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/datos/le/mio/kolibri/kolibri/core/tasks/worker.py", line 61, in handle_finished_future
    result = future.result()
  File "/usr/lib/python3.8/concurrent/futures/_base.py", line 437, in result
    return self.__get_result()
  File "/usr/lib/python3.8/concurrent/futures/_base.py", line 389, in __get_result
    raise self._exception
  File "/usr/lib/python3.8/concurrent/futures/thread.py", line 57, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/datos/le/mio/kolibri/kolibri/core/tasks/job.py", line 117, in execute_job
    result = func(*args, **kwargs)
  File "/datos/le/mio/kolibri/venv/lib/python3.8/site-packages/django/core/management/__init__.py", line 131, in call_command
    return command.execute(*args, **defaults)
  File "/datos/le/mio/kolibri/venv/lib/python3.8/site-packages/django/core/management/base.py", line 330, in execute
    output = self.handle(*args, **options)
  File "/datos/le/mio/kolibri/kolibri/core/tasks/management/commands/base.py", line 119, in handle
    return self.handle_async(*args, **options)
  File "/datos/le/mio/kolibri/kolibri/core/content/management/commands/importcontent.py", line 532, in handle_async
    self.download_content(
  File "/datos/le/mio/kolibri/kolibri/core/content/management/commands/importcontent.py", line 164, in download_content
    self._transfer(
  File "/datos/le/mio/kolibri/kolibri/core/content/management/commands/importcontent.py", line 476, in _transfer
    raise self.exception
  File "/datos/le/mio/kolibri/kolibri/core/content/management/commands/importcontent.py", line 406, in _transfer
    overall_progress_update(data_transferred)
  File "/datos/le/mio/kolibri/kolibri/core/tasks/management/commands/base.py", line 62, in update_progress
    self.update_callback(p.progress_fraction, p)
  File "/datos/le/mio/kolibri/kolibri/core/tasks/management/commands/base.py", line 115, in _update_all_progress
    self.update_progress(progress_list[0].progress_fraction, 1.0)
  File "/datos/le/mio/kolibri/kolibri/core/tasks/management/commands/base.py", line 123, in update_progress
    self.job.update_progress(progress_fraction, total_progress)
  File "/datos/le/mio/kolibri/kolibri/core/tasks/job.py", line 254, in update_progress
    self.storage.update_job_progress(self.job_id, progress, total_progress)
  File "/datos/le/mio/kolibri/kolibri/core/tasks/storage.py", line 263, in update_job_progress
    self._update_job(job_id, progress=progress, total_progress=total_progress)
  File "/datos/le/mio/kolibri/kolibri/core/tasks/storage.py", line 302, in _update_job
    return job, orm_job
  File "/usr/lib/python3.8/contextlib.py", line 120, in __exit__
    next(self.gen)
  File "/datos/le/mio/kolibri/kolibri/core/tasks/storage.py", line 65, in session_scope
    session.commit()
  File "/datos/le/mio/kolibri/venv/lib/python3.8/site-packages/sqlalchemy/orm/session.py", line 1042, in commit
    self.transaction.commit()
  File "/datos/le/mio/kolibri/venv/lib/python3.8/site-packages/sqlalchemy/orm/session.py", line 504, in commit
    self._prepare_impl()
  File "/datos/le/mio/kolibri/venv/lib/python3.8/site-packages/sqlalchemy/orm/session.py", line 483, in _prepare_impl
    self.session.flush()
  File "/datos/le/mio/kolibri/venv/lib/python3.8/site-packages/sqlalchemy/orm/session.py", line 2523, in flush
    self._flush(objects)
  File "/datos/le/mio/kolibri/venv/lib/python3.8/site-packages/sqlalchemy/orm/session.py", line 2664, in _flush
    transaction.rollback(_capture_exception=True)
  File "/datos/le/mio/kolibri/venv/lib/python3.8/site-packages/sqlalchemy/util/langhelpers.py", line 68, in __exit__
    compat.raise_(
  File "/datos/le/mio/kolibri/venv/lib/python3.8/site-packages/sqlalchemy/util/compat.py", line 178, in raise_
    raise exception
  File "/datos/le/mio/kolibri/venv/lib/python3.8/site-packages/sqlalchemy/orm/session.py", line 2624, in _flush
    flush_context.execute()
  File "/datos/le/mio/kolibri/venv/lib/python3.8/site-packages/sqlalchemy/orm/unitofwork.py", line 422, in execute
    rec.execute(self)
  File "/datos/le/mio/kolibri/venv/lib/python3.8/site-packages/sqlalchemy/orm/unitofwork.py", line 586, in execute
    persistence.save_obj(
  File "/datos/le/mio/kolibri/venv/lib/python3.8/site-packages/sqlalchemy/orm/persistence.py", line 230, in save_obj
    _emit_update_statements(
  File "/datos/le/mio/kolibri/venv/lib/python3.8/site-packages/sqlalchemy/orm/persistence.py", line 994, in _emit_update_statements
    c = cached_connections[connection].execute(
  File "/datos/le/mio/kolibri/venv/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1020, in execute
    return meth(self, multiparams, params)
  File "/datos/le/mio/kolibri/venv/lib/python3.8/site-packages/sqlalchemy/sql/elements.py", line 298, in _execute_on_connection
    return connection._execute_clauseelement(self, multiparams, params)
  File "/datos/le/mio/kolibri/venv/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1133, in _execute_clauseelement
    ret = self._execute_context(
  File "/datos/le/mio/kolibri/venv/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1323, in _execute_context
    self._handle_dbapi_exception(
  File "/datos/le/mio/kolibri/venv/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1517, in _handle_dbapi_exception
    util.raise_(
  File "/datos/le/mio/kolibri/venv/lib/python3.8/site-packages/sqlalchemy/util/compat.py", line 178, in raise_
    raise exception
  File "/datos/le/mio/kolibri/venv/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1283, in _execute_context
    self.dialect.do_execute(
  File "/datos/le/mio/kolibri/venv/lib/python3.8/site-packages/sqlalchemy/engine/default.py", line 590, in do_execute
    cursor.execute(statement, parameters)
sqlalchemy.exc.OperationalError: (sqlite3.OperationalError) database is locked
[SQL: UPDATE jobs SET saved_job=? WHERE jobs.id = ?]
[parameters: ('{"job_id": "53fd3a8f5246429dba17e782a4ea5695", "state": "RUNNING", "exception": null, "traceback": null, "track_progress": true, "cancellable": true, ... (633 characters truncated) ... _id": null, "node_ids": ["c9d7f950ab6b5a1199e3d6c10d7f0103"], "exclude_node_ids": []}, "func": "django.core.management.call_command", "result": null}', '53fd3a8f5246429dba17e782a4ea5695')]
(Background on this error at: http://sqlalche.me/e/e3q8)

I'm aproving this PR because it solves the issue that it mentions. There is absolutely no resource limits problems, but a traditional db lack of concurrency. In any case it would be good to beware of this db problem. I don't think there's much to do with it, excepting advicing to switch to Postgresql when such concurrent work is a need.


if self.is_cancelled():
break
max_workers = 10
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how does this relate with the two new options REGULAR_PRIORITY_WORKERS and HIGH_PRIORITY_WORKERS, i.e. what happens if REGULAR_PRIORITY_WORKERS + HIGH_PRIORITY_WORKERS > 10 . It seems these values are ignored according to the below code, isn't it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, those options are only used to estimate how many total content import jobs could be running at any one time - as these additional workers are being run within the context of a single worker.

@rtibbles
Copy link
Member Author

rtibbles commented Apr 6, 2022

@jredrejo thanks - it seems like we have an existing concurrency issue, which is meant to be fixed by the DB write lock machinery that we have in place, but seems like it's still not functioning as needed. I think we need a better solution for essentially waiting as long as we need to within async tasks for the SQLite DB to be available for writing.

@rtibbles rtibbles merged commit bc29ae2 into learningequality:release-v0.15.x Apr 6, 2022
@rtibbles rtibbles deleted the too_many_djs branch April 6, 2022 17:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
TODO: needs review Waiting for review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants