[scanner integration] Enable landing page scanner and collect baseline scan data #518

eloquence · 2018-06-01T00:51:52Z

As one of the preparatory steps for #488, we should enable the existing landing page scanner and collect baseline data for all current SecureDrop directory entries. This will also take the scanner through its paces and help us discover operational issues that need to prioritized prior to a full integration.

Provided the scanner is stable, as part of this task, we should generate a CSV of the scan results for all landing pages.

Note that all code that displays information on SecureDrop.org based on scan results must be/remain disabled. Scan results should only be visible in the Wagtail admin interface, and be otherwise without consequence for the site.

eloquence · 2018-06-13T05:21:05Z

Running manage.py scan in production once would be a good start, but to close this task, we should also generate the aforementioned CSV dump. We don't have to enable a cron-job yet as we'll likely have to make a few fixes to the scanner logic base don the initial data.

redshiftzero · 2018-06-13T17:45:22Z

Heads up the cron job should not generate a ton of data because the scanning should only add new rows when a scan result changes (see freedomofpress/securethenews#84 (comment)).

conorsch · 2018-06-13T18:51:07Z

Running now, will report results.

conorsch · 2018-06-13T18:55:38Z

🔴 Negative:

(securedrop-alpha)gcorn@www-sd:/var/www/django-alpha$ ./manage.py scan
Traceback (most recent call last):
  File "./manage.py", line 12, in <module>
    execute_from_command_line(sys.argv)
  File "/home/gcorn/securedrop-alpha/lib/python3.4/site-packages/django/core/management/__init__.py", line 364, in execute_from_command_line
    utility.execute()
  File "/home/gcorn/securedrop-alpha/lib/python3.4/site-packages/django/core/management/__init__.py", line 356, in execute
    self.fetch_command(subcommand).run_from_argv(self.argv)
  File "/home/gcorn/securedrop-alpha/lib/python3.4/site-packages/django/core/management/base.py", line 283, in run_from_argv
    self.execute(*args, **cmd_options)
  File "/home/gcorn/securedrop-alpha/lib/python3.4/site-packages/django/core/management/base.py", line 330, in execute
    output = self.handle(*args, **options)
  File "/var/www/django-alpha/directory/management/commands/scan.py", line 46, in handle
    bulk_scan(securedrop_pages)
  File "/var/www/django-alpha/scanner/scanner.py", line 114, in bulk_scan
    securedrop = securedrops.get(domain=result_data['Domain'])
  File "/home/gcorn/securedrop-alpha/lib/python3.4/site-packages/django/db/models/query.py", line 384, in get
    (self.model._meta.object_name, num)
directory.models.entry.MultipleObjectsReturned: get() returned more than one DirectoryEntry -- it returned 2!

From that output it's not obvious which entry triggered the additional result. Re-running with --verbosity 3 didn't change the output, so perhaps we can add support for verbosity levels to aid in debugging.

harrislapiroff · 2018-07-10T19:23:18Z

Just encountered this bug myself. We started allowing directory entries that have the same domain, but didn't update the scanner code to account for it.

I see two plausible solutions here:

Update the scanner to allow multiple scans on the same domain. This will result in some redundant scans, but would be pretty quick to do.
Rewrite the scanner to associate scan results with a domain rather than a directory entry. This would eliminate redundant scans but would take some effort.

My preference is 1, right now. I could imagine scenarios down the road where we might want the scanner to run separately on instances with the same domain. For example, if the scanner someday scans onion services as well as landing pages, we might have entries that share a domain but have different onion services and therefore would want to be scanned separately.

eloquence · 2018-07-10T22:50:23Z

I'm confused by your use of "domain" here. We have entries that share the same onion URL ([org example redacted]) but AFAICT we have no entries that share the same landing page URL. So if the scanner currently dies in the [redacted] case, it most definitely should scan both, since they're different landing pages.

harrislapiroff · 2018-07-11T14:13:24Z

@eloquence You're right—we have shared onion URLs, not shared domains. Now I'm not totally sure again what's going on. I'll investigate more. 🕵🏻‍♀️

harrislapiroff · 2018-07-11T14:32:26Z

I've identified the cause of this error: [org example redacted]
This still leaves a few resolutions for a general solution to this problem. We could disallow instances sharing a domain, but I maintain that there may be multiple securedrop instances on the same domain (this issue would crop up even if they had separate landing pages, e.g., https://organization.tld/subgroup1/ and https://organization.tld/subgroup2/) because pshtt scans per domain, not landing page. (We have separate, internal, code for scanning for landing page specific things.)

My favored solution right now, actually, is to tell pshtt to scan a deduped list of domains and then write the results of each domain scan to all entries that have that domain. This resolves both the uniqueness problem and the redundant scan problem. In pseudocode:

domains = set(securedrops.values_list('domain', flat=True))  # set to dedupe
results = inspect_domains(domains, {'timeout': 10})
for result in results:
    securedrops.filter(domain=result['Domain']).update(result=result)  # could update multiple
    # actual code will be more complex in this part but you get the idea

eloquence · 2018-07-24T23:37:39Z

Thanks, @harrislapiroff . Since we currently don't have the problem you describe (at least as far as we know), now that the offending entry has been fixed, I'm assuming the scan command should run to completion. Flagging to @conorsch that we should attempt another test run here, and if that runs to completion, will file a separate issue for the problem you describe.

conorsch · 2018-07-25T16:23:03Z

Re-running ./manage.py scan in prod to evaluate results.

conorsch · 2018-07-25T16:25:45Z

Success!

(securedrop-beta)gcorn@www-sd:/var/www/django-beta$ ./manage.py scan
Scanning complete! Results added to database.

Please inspect results on the live site now, @eloquence!

eloquence · 2018-07-31T23:00:07Z

The results on the live site are mostly what I'd expect:

I see correctly dated results for all sites I've examined.
The grading is completely off, but that's to be expected (it grades extremely harshly on some criteria, e.g., cookies, while ignoring other stuff we think is important) and we don't anticipate using those grades. Ignore them.
Redirects are causing a fair number of problems -- it's fairly common for landing pages without a trailing slash to redirect to the version with a trailing slash [example redacted] . [scanner integration] Support landing page redirects on the same domain #494 will fix this; in the meantime, we should probably update landing page URLs to the canonical version.
In evaluating the results I discovered a bug in the cache control validation, Landing page scanner validation of cache control headers is broken #524.

It would still be very useful to generate a CSV from the latest results, for more systematically assessing how common certain issues are (e.g., use of subdomains) across the entire set of sites in the directory. @harrislapiroff could you give an initial estimate of how much work that would be?

eloquence · 2018-08-08T19:06:10Z

Just making a note that the relevant PR that we're committed to finishing as part of the 8/8-8/22 sprint is #527. Would also be great to then do a prod run of the scancsv command. Once that's done, we can close this issue, though we may intermittently need to re-run the CSV dump to verify improvements to the scanner.

eloquence · 2018-08-09T19:33:27Z

@conorsch ran a fresh scan and provided baseline scan results (below, CSV). As such this task is complete for the current sprint, closing. In principle we're ready to enable regular runs; filing a separate infra ticket for that. For now, will add observations to relevant tickets.

scan-2018-08-09.txt

eloquence mentioned this issue Jun 1, 2018

Integrate source interface + landing page scan results in directory #488

Open

21 tasks

eloquence closed this as completed Aug 9, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[scanner integration] Enable landing page scanner and collect baseline scan data #518

[scanner integration] Enable landing page scanner and collect baseline scan data #518

eloquence commented Jun 1, 2018

eloquence commented Jun 13, 2018

redshiftzero commented Jun 13, 2018

conorsch commented Jun 13, 2018

conorsch commented Jun 13, 2018

harrislapiroff commented Jul 10, 2018

eloquence commented Jul 10, 2018 •

edited

Loading

harrislapiroff commented Jul 11, 2018

harrislapiroff commented Jul 11, 2018 •

edited by eloquence

Loading

eloquence commented Jul 24, 2018

conorsch commented Jul 25, 2018

conorsch commented Jul 25, 2018

eloquence commented Jul 31, 2018 •

edited

Loading

eloquence commented Aug 8, 2018 •

edited

Loading

eloquence commented Aug 9, 2018

[scanner integration] Enable landing page scanner and collect baseline scan data #518

[scanner integration] Enable landing page scanner and collect baseline scan data #518

Comments

eloquence commented Jun 1, 2018

eloquence commented Jun 13, 2018

redshiftzero commented Jun 13, 2018

conorsch commented Jun 13, 2018

conorsch commented Jun 13, 2018

harrislapiroff commented Jul 10, 2018

eloquence commented Jul 10, 2018 • edited Loading

harrislapiroff commented Jul 11, 2018

harrislapiroff commented Jul 11, 2018 • edited by eloquence Loading

eloquence commented Jul 24, 2018

conorsch commented Jul 25, 2018

conorsch commented Jul 25, 2018

eloquence commented Jul 31, 2018 • edited Loading

eloquence commented Aug 8, 2018 • edited Loading

eloquence commented Aug 9, 2018

eloquence commented Jul 10, 2018 •

edited

Loading

harrislapiroff commented Jul 11, 2018 •

edited by eloquence

Loading

eloquence commented Jul 31, 2018 •

edited

Loading

eloquence commented Aug 8, 2018 •

edited

Loading