Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deleting large numbers of sources can fail such that is it thereafter impossible to delete the sources #5233

Closed
rmol opened this issue May 6, 2020 · 3 comments · Fixed by #5257
Assignees
Milestone

Comments

@rmol
Copy link
Contributor

rmol commented May 6, 2020

Description

In the journalist interface, if you select a large number of sources and delete them, the operation can take longer than the Apache timeout. Some sources' store directories will have been moved to the shredder before the failure. Thereafter, if you try to delete them, a ValueError will be thrown at line 246 of store.py, in move_to_shredder. The unhandled exception prevents the deletion of the source record, and this is how we get the zombie apocalypse.

Steps to Reproduce

In an environment using Apache (staging, prod VMs, QA hardware):

  • Add 500 sources with qa_loader.py --source-count 500.
  • Log in as a journalist.
  • Select all sources and click delete.

Expected Behavior

That the sources would be deleted without error.

Actual Behavior

You get a gateway timeout. Navigating back to the source list and trying again results in an internal server error, with a stacktrace in /var/log/apache2/journalist-error.log.

Comments

The fix might be as simple as checking for the existence of the source's store directory in journalist_app.utils.delete_collection and only calling move_to_shredder if it still exists. While there, the key pair deletion should be checked as well, so that if it's already gone, the source database record is still deleted.

@eloquence eloquence added the bug label May 6, 2020
@redshiftzero redshiftzero added this to the 1.3.0 milestone May 6, 2020
@eloquence
Copy link
Member

Since it's not a regression, agreement today was to undertake a timeboxed (~2 hour) attempt to resolve for 1.3.0; if it ends up being more complex, will likely bump to 1.4.0.

@rmol rmol self-assigned this May 6, 2020
@redshiftzero redshiftzero modified the milestones: 1.3.0 , 1.4.0 May 6, 2020
@rmol
Copy link
Contributor Author

rmol commented May 6, 2020

To no one's surprise, it turned out to be more complex. When the original deletion request times out, the mod_wsgi process is still seeing it to completion, because it doesn't care that Apache has given up.

Any workaround to press on past nonexistent source store directories or keypairs, can still fail when the source is queried in the final steps of journalist_app.utils.delete_collection, and has been deleted. A journalist could keep backing up to the source list, refreshing, selecting all and deleting, and it might look like their efforts are making headway as the list shrinks, but in fact they'll just be tripping over the failures while the initial request is still churning through the deletions.

We could lengthen the Apache timeout, or introduce a request timeout in the mod_wsgi configuration, but neither is a sure fix, and could introduce other problems. The right thing to do is ensure we generate a response to this request in a reasonable timeframe.

To that end, I'm going to look at adding a deleted flag to Source. Updating that should be quick. Another background worker process will periodically scan for deleted sources and do the work that is currently done in delete_collection. For now, we'll just omit deleted sources from the journalist interface.

@eloquence
Copy link
Member

For the sake of clarity, per chat w/ Jen, we'll keep this on the sprint for John to continue to investigate, during the 5/6-5/20 sprint, but we're still keeping it on the 1.4.0 (not 1.3.0) milestone, so it is not subject to the release/QA timeline pressure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants