Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve deletion of submissions #4713

Merged
merged 3 commits into from
Aug 29, 2019

Conversation

rmol
Copy link
Contributor

@rmol rmol commented Aug 27, 2019

Status

Ready for review

Description of Changes

Replace srm with shred, which is faster, reducing the chance that deletion of a submission will be interrupted.

Update rq and redis requirements, to eliminate long-standing bugs.

Add manage.py tasks for detecting and correcting submissions that have been disconnected from their files on disk, and vice versa. Update manage.py to explicitly run with the production virtualenv. Also specify the virtualenv in WSGI scripts and the run-test script. In the dev/test Docker container, install requirements in a virtualenv at the same path as production.

Add a supervisor script for requeuing interrupted rq jobs. If the app server is rebooted while an rq job is running, that job has already been deleted from the queue and rq will not automatically resume it on reboot, but it does have a record of it in the queue's started job registry. This script checks that registry for jobs that aren't already queued or being run, and requeues them.

Fixes #4711.
Fixes #4712.

Testing

Testing everything requires production VMs or hardware. These instructions assume VMs; adjust IP addresses as necessary if you want to test on hardware.

  • Check out this branch.
  • Build debs with make build-debs.
  • Copy the packages to the app server: scp build/xenial/*.deb [email protected]:
  • Copy the OSSEC server packages to the mon server: scp build/xenial/*ossec-server*.deb [email protected]:
  • Install the copied debs with dpkg -i --auto-deconfigure (--auto-deconfigure is necessary to upgrade securedrop-app-code as cron-apt would in production). On the app server, remove the two "ossec-server" packages and you can dpkg -i --auto-deconfigure *.deb.

Testing automatic requeuing of interrupted deletions

Establish two SSH connections to the app server. In one, become root with sudo su - and in the other become www-data with sudo -u www-data bash. In the www-data shell:

  • Activate the securedrop-app-code virtualenv: . /opt/venvs/securedrop-app-code/bin/activate
  • cd /var/www/securedrop
  • Create a big file that will take a while to delete: dd if=/dev/zero of=/var/lib/securedrop/store/bigfile bs=1M count=1000
  • Submit a job to delete it:
    • python3
      >>> import rm
      >>> import worker
      >>> q = worker.create_queue()
      >>> q.enqueue(rm.secure_delete, "/var/lib/securedrop/store/bigfile")
      
    • Exit Python.

In the root shell:

  • Reboot, then reconnect.
  • Look at the rqrequeue log: less /var/log/securedrop_worker/rqrequeue.err -- at the end you should see lines like this:
    2019-08-08 17:31:01,118 INFO Running every 60 seconds.
    2019-08-08 17:31:01,141 INFO Requeuing job <Job 1082e71f-7581-448c-b84b-027e55b4ef8e: rm.secure_delete('/var/lib/securedrop/store/bigfile')>
    2019-08-08 17:32:01,192 INFO Skipping job 1082e71f-7581-448c-b84b-027e55b4ef8e, which is already being run by worker rq:worker:6a6b548310f948e291fa954743b8094f
    
    That indicates the interrupted job was found and restarted, but was left alone at the next check because it was already running. The job should run to completion, /var/lib/securedrop/store/bigfile should be deleted, and the rqrequeue log should start saying:
    2019-08-08 17:33:01,253 INFO No interrupted jobs found in started job registry.
    

Testing detection and correction of disconnected submissions

Visit the source interface and send two messages. First we'll test a disconnected database record.
In your www-data shell:

  • cd /var/lib/securedrop/store
  • ls -laR
  • You should see the two message files. Remove one with rm.
  • cd /var/www/securedrop
  • ./manage.py check-disconnected-db-submissions should report There are submissions in the database with no corresponding files. Run "manage.py list-disconnected-db-submissions" for details.
  • ./manage.py list-disconnected-db-submissions should list the ID of the deleted submission, e.g. 2.
  • ./manage.py delete-disconnected-db-submissions should prompt you with Enter 'y' to delete all submissions missing files: -- reply y and you should see Removing submission IDs [2]... (the ID may differ).

Now we'll delete the remaining database record and verify that its disconnected file is detected. Still in your www-data shell:

  • sqlite3 /var/lib/securedrop/db.sqlite
    Delete the submission record for the remaining message (substitute your filename):
    • delete from submissions where filename = '1-exhausted_overmantel-msg.gpg';
  • ./manage.py check-disconnected-fs-submissions should report There are files in the submission area with no corresponding records in the database. Run "manage.py list-disconnected-fs-submissions" for details..
  • ./manage.py list-disconnected-fs-submissions should show a list like:
    /var/lib/securedrop/store/B3A5GPU4OHPQK736R76HKJUP5VONIOMKZLXK77GPTGNW7EJ63AY5YBX27P3DB2X4DZBXPX3LGBBXAJZYG3HQRHE4B6UE5YYBPGDYZOA=/1-exhausted_overmantel-msg.gpg
    
  • ./manage.py delete-disconnected-fs-submissions should prompt you to delete that file. Do so.

Testing OSSEC reporting of disconnects

Create a file under /var/lib/securedrop/store with touch /var/lib/securedrop/store/testfile. If you don't feel like waiting a day for the OSSEC report, you can edit /var/ossec/etc/ossec.conf, look for check-disconnect, and reduce the <frequency>, then service ossec restart.

Checklist

If you made changes to the server application code:

  • Linting (make lint) and tests (make -C securedrop test) pass in the development container

If you made changes to the system configuration:

  • Configuration tests pass

If you made non-trivial code changes:

  • I have written a test plan and validated it for this PR

Replace srm with shred, which is faster, reducing the chance that
deletion of a submission will be interrupted.

Update rq and redis requirements, to eliminate long-standing bugs.

Add manage.py tasks for detecting and correcting submissions that have
been disconnected from their files on disk, and vice versa. Update
manage.py to explicitly run with the production virtualenv. Also
specify the virtualenv in WSGI scripts and the run-test script. In the
dev/test Docker container, install requirements in a virtualenv at the
same path as production.

Add a supervisor script for requeuing interrupted rq jobs. If the app
server is rebooted while an rq job is running, that job has already
been deleted from the queue and rq will not automatically resume it on
reboot, but it does have a record of it in the queue's started job
registry. This script checks that registry for jobs that aren't
already queued or being run, and requeues them.
@rmol rmol force-pushed the 4712-submission-cleanup branch from e90ccf7 to a8ecb98 Compare August 27, 2019 20:43
@redshiftzero
Copy link
Contributor

I'm SSHed in here trying to debug the test failures (CI only) and I think there's some docker layer caching weirdness going on here: I rebuilt the container, reran the tests, and they passed...

@rmol rmol force-pushed the 4712-submission-cleanup branch 2 times, most recently from ad54e9c to f759401 Compare August 29, 2019 04:17
@codecov-io
Copy link

codecov-io commented Aug 29, 2019

Codecov Report

Merging #4713 into develop will decrease coverage by 0.14%.
The diff coverage is n/a.

Impacted file tree graph

@@             Coverage Diff             @@
##           develop    #4713      +/-   ##
===========================================
- Coverage    82.38%   82.23%   -0.15%     
===========================================
  Files           46       48       +2     
  Lines         3162     3350     +188     
  Branches       345      380      +35     
===========================================
+ Hits          2605     2755     +150     
- Misses         470      503      +33     
- Partials        87       92       +5
Impacted Files Coverage Δ
securedrop/securedrop/crypto_util.py 88.61% <0%> (-1.79%) ⬇️
securedrop/securedrop/journalist_app/__init__.py 88.23% <0%> (-0.34%) ⬇️
securedrop/securedrop/source_app/__init__.py 92.39% <0%> (-0.25%) ⬇️
securedrop/securedrop/manage.py 76.89% <0%> (-0.22%) ⬇️
securedrop/securedrop/management/submissions.py 73.59% <0%> (ø)
securedrop/securedrop/management/__init__.py 100% <0%> (ø)
securedrop/securedrop/worker.py 85% <0%> (+3.18%) ⬆️
securedrop/securedrop/rm.py 95.23% <0%> (+45.23%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 00d6fbf...f759401. Read the comment docs.

@rmol rmol force-pushed the 4712-submission-cleanup branch from f759401 to b2457d9 Compare August 29, 2019 12:49
Copy link
Contributor

@redshiftzero redshiftzero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran through the test plan here on a previous version of the diff, and all worked well except for a small issue due to the same underlying problem as #4656 (path to python in manage.py invocation). This is fixed in the latest diff, see discussion in #4656 for more.

Given that only a small change was added since I last tested, I'm going to approve and merge based on visual review of the diff. These PR testing steps are in the draft 1.0.0 test plan.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants