Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

integrity for downloads #68

Open
dveditz opened this issue Mar 25, 2017 · 22 comments
Open

integrity for downloads #68

dveditz opened this issue Mar 25, 2017 · 22 comments

Comments

@dveditz
Copy link
Member

dveditz commented Mar 25, 2017

When we were first discussing sub-resource integrity verifying downloads was one of the original desires. It got booted from the "MVP" early on (I can't remember why) and didn't get carried over from the old issues space to this one. Now it's time to take it up again.

If part of the concern was about navigations vs downloads and/or wanting to know whether we had to check integrity before we started the download we could restrict it to links that also have the HTML download attribute.

@mikewest
Copy link
Member

SRI for explicit downloads seems like low-hanging fruit. You're thinking something like <a href='' download integrity="...">?

I vaguely recall @bzbarsky having concerns about content-encoding: gzip, but I think @devd worked them out. Otherwise, the infrastructure should be there.

@mikewest
Copy link
Member

(We just need someone to sign up to do the work... Y'all volunteering? :) )

@annevk
Copy link
Member

annevk commented Mar 27, 2017

It seems the main concern that hold this back in the past (proposed as https://wiki.whatwg.org/wiki/Link_Hashes and also various alternatives, see https://lists.w3.org/Archives/Public/public-whatwg-archive/2012Oct/0188.html; one of which was once added to the standard: a https+aes scheme) was lack of implementer interest and the worry that the integrity would get out-of-sync with the download and the user would just use some other tool to get the resource.

@annevk
Copy link
Member

annevk commented Mar 27, 2017

Note also that unless we carve out an exception (let's not?) this will require CORS, which is new for downloads. So you end up with <a crossorigin download=... integrity=...> and you'd have to define both crossorigin and integrity for <a>.

@shekyan
Copy link
Contributor

shekyan commented Mar 27, 2017

Sounds easy and interesting. I can try to write it up, if nobody more qualified signs up for this.

@mikewest
Copy link
Member

Note also that unless we carve out an exception (let's not?) this will require CORS

That's a good point.

We require CORS for subresource fetches because we'd otherwise be exposing the content of the resource via the hashes. Does the same apply to downloads? As far as I know, <a download> is fire-and-forget in Chrome; we don't expose a success/failure event or give the site access to its downloaded resources. Is the data exposed via one of the performance/timing APIs?

@annevk
Copy link
Member

annevk commented Mar 27, 2017

We've had requests already, e.g., in whatwg/html#954. I don't think we should try to postpone the need for safety as that will just make it very brittle.

@mikewest
Copy link
Member

Got it. In that case, I completely agree that the CORS requirement is something we should keep in place.

@devd devd added the SRI-next label Mar 27, 2017
@devd devd added this to the v2 milestone Mar 27, 2017
@riking
Copy link

riking commented Mar 9, 2018

Looks like this issue has fallen by the wayside?

Content integrity for downloads has resurfaced in the news, including cases where an HTTPS page links to a plain-HTTP download. While those cases should be fixed, including download integrity feels like a low-hanging fruit to my uninformed point of view.

[1]: https://citizenlab.ca/2018/03/bad-traffic-sandvines-packetlogic-devices-deploy-government-spyware-turkey-syria/

tdelmas pushed a commit to tdelmas/webappsec-subresource-integrity that referenced this issue Mar 10, 2018
Add the integrity check for `a` and `area`  elements with the download attribute.

It doesn't impact  `a` and `area`  elements without the download attribute.

Know issues with that proposal:
- It doesn't define the behavior of the `crossorigin` attribute
- It doesn't explains how to handle "open in a new tab/window" actions on links: should the user agent download it the same tab or can the user perform integrity check on new tab/window?
@annevk
Copy link
Member

annevk commented Mar 11, 2018

Given that the download attribute works in terms of navigation at the moment this actually seems even harder. Perhaps there is some way to decouple it from navigation, but that would be quite a major change to implementations.

@tdelmas
Copy link

tdelmas commented Mar 30, 2018

I create #78 to try push forward the discussion as this feature could really improve the security of the global ecosystem.

@annevk
Copy link
Member

annevk commented Apr 3, 2018

Unfortunately, I don't think that helps as it doesn't address the issues.

@Marcono1234
Copy link

Marcono1234 commented Mar 15, 2020

Is there something one (with limited HTML and HTTP knowledge) can do to help with the process of this issue?

Popular software such as GIMP or LibreOffice use mirrors and I would expect that the average computer user does not know how to verify the integrity or that this is important.

Regarding the linked whatwg mail archive thread it would be necessary to clarify what the intention of this issue is:

  • verify integrity of linked sites / documents: ❌
    Therefore it would make sense to call the attribute download-integrity to clarify that it has no effect unless used for downloads (would still require download attribute)
  • verify integrity of downloaded files: ✔️
  • prevent against corruption while download is saved to file system (due to file system errors): ❓
    Hashing the downloaded data on the fly (instead of re-reading the file) would be more performant

Supporting a length value describing the size of the downloaded content in bytes would allow failing fast, even while downloading if the content is larger than the specified length.

The proposed format should also support specifying multiple checksum algorithms in case the user agent does not support all, which will especially become the case in the future when new checksum algorithms emerge.

Therefore the following would in my opinion be a good format:

<a href="..." download download-integrity="INTEGRITY_DATA">

With INTEGRITY_DATA having this format (pseudo grammar):

INTEGRITY_DATA:
     (CHECKSUM,)+
     length:[1-9][0-9]*

CHECKSUM:
    ALG_NAME
    :
    CHECKSUM_VALUE

ALG_NAME:
    [a-zA-Z0-9-_]+

CHECKSUM_VALUE:
     Base64

Algorithm names should be clearly defined (either here or somewhere else) and should be matched case-sensitively to prevent something like "SHA-1", "shA-1", "sHa-1" and because in some programming languages comparing case insensitively can easily go wrong when the system language is used and it has special lowercasing rules (e.g. Turkish).

The checksum bytes are Base64 encoded because it can even in hex notation be quite large, e.g. for SHA-512 it is 128 chars in hex while only being 88 chars in Base64. Base64 padding (trailling =) is required and must not be omitted.

Example:
<a href="example.com/download" download download-integrity="length:1245667025,md5:1B2M2Y8AsgTpgAmY7PhCfg==,sha256:47DEQpj8HBSa+/TImW+5JCeuQeRkm5NMpJWZG3hSuFU=">

User agent behavior

If length is present, then the user agent must use it to verify the integrity.
If multiple checksums are present it may pick any, it is advised to pick the strongest one.

If no checksum algorithm is supported it may show a warning to the user, or it may just ignore the checksum information. It may also display the algorithms and checksum values to the user so they have a chance to verify the integrity manually.
Note: It might make sense to add a warn-if-none-supported:true/false value to the download-integrity attribute. The default value is false. If true the user agent must warn the user. The usecase would be mirror sites where failing to verify the integrity could have security implications.

If the integrity was successfully verified, the user agent is encouraged to indicate this to the user. However, it should be displayed as informational text (so the user knows they do not have to verify the integrity manually), but must not create a false impression of security, e.g. that the file is not a virus (similar to the previously green lock icon in the URL bar for HTTPS sites).

If the integrity check fails, the user must be informed that the file may be corrupted, modified by an attacker or that the site is incorrectly configured. The user agent is encouraged to advise the user to contact the site administrator. The user agent must offer the user two options: Deleting the file (preferred), and keeping the file. Unlike described in the whatwp wiki it should not use the term "Quarantine" since that would for most (if not all) OS' be just another folder. User agents are encouraged to only place the downloaded content in the "Downloads" folder of the OS as soon as the user accepted to keep the file. Otherwise the user might first see the file in the "Downloads" folder and open it before noticing the warning by the user agent.


Hopefully this comment is useful and not too intrusive. I tried to write down my thoughts as precise as possible. Any feedback is welcome :)

@tdelmas
Copy link

tdelmas commented Mar 15, 2020

@annevk What are the blocking point on that issues? What points need to be discussed to make it move forward?

It is an important security issue for all websites using mirrors/CDNs for downloads.

There is no workaround for it (VLC tried to use js to download the file in memory and do the checksum but it has a lot of drawbacks: the browser compatibility is terrible, it require CDNs to add CORS headers and it doesn't work well with large files).

@mozfreddyb
Copy link
Collaborator

Given that the download attribute works in terms of navigation at the moment this actually seems even harder. Perhaps there is some way to decouple it from navigation, but that would be quite a major change to implementations.

@annevk, Wasn't download respecified as based on fetch?

@khuguenin
Copy link

We wrote an article (https://serval.unil.ch/resource/serval:BIB_9BD511E5C0D0.P001/REF) on checksum verification recently and suggested extending SRI to handle downloads. We wrote an explainer: https://github.com/checksum-lab/checksum-lab.github.io/blob/master/README.markdown
One issue with the download attribute for elements (mentioned above) is that it is restricted to same-origin links, which is the case that makes the least sense for checksums (https://www.w3schools.com/tags/att_a_download.asp).

@mozfreddyb
Copy link
Collaborator

mozfreddyb commented Mar 17, 2020

I can answer parts of my own question to annevk from above. Downloading a hyperlink is specified in HTML.

@khuguenin same-origin or cors-same-origin, no? It would suffice if the CDN/Mirror sent a header of `access-control-allow-origin: *, which many CDNs do and already have to do for SRI with scripts/styles.

@tdelmas
Copy link

tdelmas commented Mar 17, 2020

@mozfreddyb I think requiring CORS would reduce the usage of checksum because all mirrors/CDNs do not support it. If the download is "fire and forget" and the original page have no way to know if the download is complete, valid, or not, then I do no see a reason to require CORS. (also, if the mirrors/CDNs do have CORS, the javascript could do the checksum itself already today)

@mozfreddyb
Copy link
Collaborator

How do we ensure the download is (and remains) unobservable? I see there's the request's initiator set to download in the spec, but I'm not entirely sure that it can not be forged. I'd like to hear an expert's opinion here (@annevk, probably :))

@devd
Copy link
Contributor

devd commented Mar 17, 2020 via email

@annevk
Copy link
Member

annevk commented Mar 20, 2020

What HTML says about downloads isn't entirely in line with implementations. Basically, navigation can result in a download (Content-Disposition) so it's all handled there. The download attribute is an additional input to the navigation algorithm to force downloads. I don't remember the crucial differences unfortunately, but any change here would be rather involved I'm afraid.

@jb-wisemo
Copy link

This feature should not be postponed or redefined for things other than specifying the uncorrupted hash of download.

Accordingly, this reduces to the following simple changes to the SRI specification:

  1. The integrity attribute (as already specified) is valid for any HTML element that specifies an URI with any protocol.
  2. The CORS requirement in the SRI specification shall not apply to any resource that would not otherwise be checked by the rules in the CORS specification. Downloads and alternative URLs are just examples of this, but so are subresources downloaded with other protocols such as FTP and TFTP.

Note that nothing in the SRI specification and concept depend if the user agent uses the "fetch" specification or not.

As a logical consequence, the following would all apply:

Specifying integrity for an ordinary page link, shall cause the loading of the linked page to fail with an appropriate error (not warning) if the page doesn't match. CORS does not (by default) apply to these links. This is useful for having a trusted document delivered in an off-web secure way (such as S/MIME e-mail) to refer to stable documents online. This link hashing can be chained to unlimited depth as long as the author avoids dependency loops (a.html specifies the hash of b.html which specifies the hash of c.html which specifies the hash of a.html).

Specifying integrity for a download link (with or without download attribute) shall cause the download to fail with an appropriate error (not warning), if the file doesn't match. This is useful for any download provided via a CDN or other 3rd party server. CORS does not (by default) apply to these links.

Specifying integrity for an image, sound, video, applet, script or font that doesn't match shall result in a failed subresource download (broken image symbol etc.). CORS does (by default) apply to these .

Alternative URIs in IMG tags etc. are not subject to the generic integrity attribute (it wouldn't match), but new attributes could be introduced to specify their hash values. For many of these, CORS does (by default) apply, but conceivably, new extensions to HTML could introduce alternative URIs for things to which CORS does not (by default) apply.

Alternative documents available via HTTP or other content negotiation mechanism will need their own enhancement of the SRI specification, perhaps by providing the hash of a list/tree of resource hashes where that list/tree is provided in the negotiation server response. However the basic specification for URIs that return a stable byte stream should not wait for such enhancements.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests