Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[scanner integration] Enable landing page scanner and collect baseline scan data #518

Closed
eloquence opened this issue Jun 1, 2018 · 14 comments

Comments

@eloquence
Copy link
Member

As one of the preparatory steps for #488, we should enable the existing landing page scanner and collect baseline data for all current SecureDrop directory entries. This will also take the scanner through its paces and help us discover operational issues that need to prioritized prior to a full integration.

Provided the scanner is stable, as part of this task, we should generate a CSV of the scan results for all landing pages.

Note that all code that displays information on SecureDrop.org based on scan results must be/remain disabled. Scan results should only be visible in the Wagtail admin interface, and be otherwise without consequence for the site.

@eloquence
Copy link
Member Author

Running manage.py scan in production once would be a good start, but to close this task, we should also generate the aforementioned CSV dump. We don't have to enable a cron-job yet as we'll likely have to make a few fixes to the scanner logic base don the initial data.

@redshiftzero
Copy link
Contributor

Heads up the cron job should not generate a ton of data because the scanning should only add new rows when a scan result changes (see freedomofpress/securethenews#84 (comment)).

@conorsch
Copy link
Contributor

Running now, will report results.

@conorsch
Copy link
Contributor

🔴 Negative:

(securedrop-alpha)gcorn@www-sd:/var/www/django-alpha$ ./manage.py scan
Traceback (most recent call last):
  File "./manage.py", line 12, in <module>
    execute_from_command_line(sys.argv)
  File "/home/gcorn/securedrop-alpha/lib/python3.4/site-packages/django/core/management/__init__.py", line 364, in execute_from_command_line
    utility.execute()
  File "/home/gcorn/securedrop-alpha/lib/python3.4/site-packages/django/core/management/__init__.py", line 356, in execute
    self.fetch_command(subcommand).run_from_argv(self.argv)
  File "/home/gcorn/securedrop-alpha/lib/python3.4/site-packages/django/core/management/base.py", line 283, in run_from_argv
    self.execute(*args, **cmd_options)
  File "/home/gcorn/securedrop-alpha/lib/python3.4/site-packages/django/core/management/base.py", line 330, in execute
    output = self.handle(*args, **options)
  File "/var/www/django-alpha/directory/management/commands/scan.py", line 46, in handle
    bulk_scan(securedrop_pages)
  File "/var/www/django-alpha/scanner/scanner.py", line 114, in bulk_scan
    securedrop = securedrops.get(domain=result_data['Domain'])
  File "/home/gcorn/securedrop-alpha/lib/python3.4/site-packages/django/db/models/query.py", line 384, in get
    (self.model._meta.object_name, num)
directory.models.entry.MultipleObjectsReturned: get() returned more than one DirectoryEntry -- it returned 2!

From that output it's not obvious which entry triggered the additional result. Re-running with --verbosity 3 didn't change the output, so perhaps we can add support for verbosity levels to aid in debugging.

@harrislapiroff
Copy link
Contributor

Just encountered this bug myself. We started allowing directory entries that have the same domain, but didn't update the scanner code to account for it.

I see two plausible solutions here:

  1. Update the scanner to allow multiple scans on the same domain. This will result in some redundant scans, but would be pretty quick to do.
  2. Rewrite the scanner to associate scan results with a domain rather than a directory entry. This would eliminate redundant scans but would take some effort.

My preference is 1, right now. I could imagine scenarios down the road where we might want the scanner to run separately on instances with the same domain. For example, if the scanner someday scans onion services as well as landing pages, we might have entries that share a domain but have different onion services and therefore would want to be scanned separately.

@eloquence
Copy link
Member Author

eloquence commented Jul 10, 2018

I'm confused by your use of "domain" here. We have entries that share the same onion URL ([org example redacted]) but AFAICT we have no entries that share the same landing page URL. So if the scanner currently dies in the [redacted] case, it most definitely should scan both, since they're different landing pages.

@harrislapiroff
Copy link
Contributor

@eloquence You're right—we have shared onion URLs, not shared domains. Now I'm not totally sure again what's going on. I'll investigate more. 🕵🏻‍♀️

@harrislapiroff
Copy link
Contributor

harrislapiroff commented Jul 11, 2018

I've identified the cause of this error: [org example redacted]
This still leaves a few resolutions for a general solution to this problem. We could disallow instances sharing a domain, but I maintain that there may be multiple securedrop instances on the same domain (this issue would crop up even if they had separate landing pages, e.g., https://organization.tld/subgroup1/ and https://organization.tld/subgroup2/) because pshtt scans per domain, not landing page. (We have separate, internal, code for scanning for landing page specific things.)

My favored solution right now, actually, is to tell pshtt to scan a deduped list of domains and then write the results of each domain scan to all entries that have that domain. This resolves both the uniqueness problem and the redundant scan problem. In pseudocode:

domains = set(securedrops.values_list('domain', flat=True))  # set to dedupe
results = inspect_domains(domains, {'timeout': 10})
for result in results:
    securedrops.filter(domain=result['Domain']).update(result=result)  # could update multiple
    # actual code will be more complex in this part but you get the idea

@eloquence
Copy link
Member Author

Thanks, @harrislapiroff . Since we currently don't have the problem you describe (at least as far as we know), now that the offending entry has been fixed, I'm assuming the scan command should run to completion. Flagging to @conorsch that we should attempt another test run here, and if that runs to completion, will file a separate issue for the problem you describe.

@conorsch
Copy link
Contributor

Re-running ./manage.py scan in prod to evaluate results.

@conorsch
Copy link
Contributor

Success!

(securedrop-beta)gcorn@www-sd:/var/www/django-beta$ ./manage.py scan
Scanning complete! Results added to database.

Please inspect results on the live site now, @eloquence!

@eloquence
Copy link
Member Author

eloquence commented Jul 31, 2018

The results on the live site are mostly what I'd expect:

  • I see correctly dated results for all sites I've examined.
  • The grading is completely off, but that's to be expected (it grades extremely harshly on some criteria, e.g., cookies, while ignoring other stuff we think is important) and we don't anticipate using those grades. Ignore them.
  • Redirects are causing a fair number of problems -- it's fairly common for landing pages without a trailing slash to redirect to the version with a trailing slash [example redacted] . [scanner integration] Support landing page redirects on the same domain #494 will fix this; in the meantime, we should probably update landing page URLs to the canonical version.
  • In evaluating the results I discovered a bug in the cache control validation, Landing page scanner validation of cache control headers is broken #524.

It would still be very useful to generate a CSV from the latest results, for more systematically assessing how common certain issues are (e.g., use of subdomains) across the entire set of sites in the directory. @harrislapiroff could you give an initial estimate of how much work that would be?

@eloquence
Copy link
Member Author

eloquence commented Aug 8, 2018

Just making a note that the relevant PR that we're committed to finishing as part of the 8/8-8/22 sprint is #527. Would also be great to then do a prod run of the scancsv command. Once that's done, we can close this issue, though we may intermittently need to re-run the CSV dump to verify improvements to the scanner.

@eloquence
Copy link
Member Author

@conorsch ran a fresh scan and provided baseline scan results (below, CSV). As such this task is complete for the current sprint, closing. In principle we're ready to enable regular runs; filing a separate infra ticket for that. For now, will add observations to relevant tickets.

scan-2018-08-09.txt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants