Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add flag to proceed only with "secure" SSL connection #510

Open
benoit74 opened this issue Mar 25, 2024 · 2 comments
Open

Add flag to proceed only with "secure" SSL connection #510

benoit74 opened this issue Mar 25, 2024 · 2 comments

Comments

@benoit74
Copy link
Contributor

Currently, the crawler proceed with all HTTPS websites, not matter how secure the HTTPS connection is, e.g. certificates might be invalid.

We (Kiwix) would like to be able to ensure (when requested by the user) that the crawler proceeds only with valid HTTPs connections, i.e. we probably need a CLI "flag" to ensure that browser is not accepting insecure HTTPS connections

Is this something feasible? Easy to implement and we could help to at least draft a PR?

@ikreymer
Copy link
Member

ikreymer commented Mar 25, 2024

Sure, yes, this is actually fairly easy to add in the 1.x version, since we're not relying on a MITM proxy.
It's just a matter of switching this flag in fact:
https://github.com/webrecorder/browsertrix-crawler/blob/main/src/util/browser.ts#L101
Maybe it should even default to false.

However, I wanted to point out, even without that, we do same the Certificate Transparency info from: https://chromedevtools.github.io/devtools-protocol/tot/Network/#type-SecurityDetails in the WARC records.

For a valid HTTPS site, it should have something like this:
WARC-JSON-Metadata: {"ipType":"Public","cert":{"issuer":"DigiCert Global G2 TLS RSA SHA256 2020 CA1","ctc":"1"}}

with the ctc flag indicating if the browser considers it a compliant request according to CTC logs, which I think means its a trusted cert.
https://chromedevtools.github.io/devtools-protocol/tot/Network/#type-CertificateTransparencyCompliance

For a non-compliant, non-valid cert, this data would look like:

(from https://expired.badssl.com/)
{"ipType":"Public","cert":{"issuer":"COMODO RSA Domain Validation Secure Server CA","ctc":"0"}}
or
(from https://self-signed.badssl.com/)
{"ipType":"Public","cert":{"issuer":"*.badssl.com","ctc":"0"}}

@benoit74
Copy link
Contributor Author

Great, I will have a look and probably propose a PR then!

What the default value should be (true/false, secure/insecure) has been a long discussion on our side, and I'm not sure we have any kind of alignment even now, details are in the linked issue, if you have some popcorn, might be fun (or not) ^^

Thank you for the additional details, great to know. Might be interesting at some point, I don't know how our arguments will settle in the future. For now I think that we will prefer to fail the scraper immediately if secure mode is requested and HTTPS errors arise. But I can imagine some day we might want to run in secured mode but ignore HTTPS errors on sub-resources ... will see.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Triage
Development

No branches or pull requests

2 participants