Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Express returns a non-compliant HTTP/206 response when gzip is enabled #185

Open
mscbpi opened this issue Oct 20, 2023 · 9 comments
Open

Comments

@mscbpi
Copy link

mscbpi commented Oct 20, 2023

CDNs (mostly Azure Front Door) is using HTTP Range Requests to retrieve data from Origin when caching enabled.

Express should ignore Range: requests header when gzip compression is in place since Express is unable to respond a HTTP/206 compliant answer with the right computed Content-Range header in the response that is taking the compressed data length into account.

A fair compliant workaround is to, in that very case of compression where computing Content-Range values would be too complex, ignore client's Range: header in the request, and answer the whole compressed content in a HTTP/200 response.

Handling Range: headers is optional so answering a HTTP/200 is OK.

Answering a HTTP/206 with wrong Content-Range values is notOK.

Meanwhile another workaround is to disable compression and make CDN handle it, or disable CDN caching, however it would be fair to expect Express to return a compliant HTTP response in any case.

References:

RFC7233

Details and highlighting Express behavior:

https://github.com/DanielLarsenNZ/nodejs-express-range-headers

@IbrahimTanyalcin
Copy link

IbrahimTanyalcin commented Dec 3, 2023

if the client send a Range request header and also Accept request header with the application/gzip mime type, then it is the clients responsibility to take into account whether the sent byte ranges correspond to zipped file or not. This is the intuitive behavior.

From the top of my head, I can tell a couple of cases where this change might break applications that depend on genomics data that needs to handle huge files such as https://github.com/igvteam/igv.js/ that make request to .bgz, fasta or bam files. These files arrive to the client compressed, decompressed by the browser and then the client knows that the bytes correspond to the uncompressed payload. You can't just dump a 5GB file to the client with compression enabled if there happens to be a range request.

The decompression is handled by the browser and the final receiver gets the bytes of the payload that corresponds to the uncompressed form. This has always been like this. For a lot of developers, intuition & usability >>> RFC compliance. Am I misinterpreting something?

PS: Not everyone uses Next.js or related Azure services. And they have to serve GBs of files that can be compressed fairly well on request (1-10MB chunks )such as genomic data and CANNOT compute the total length of the compressed file ahead of time for performance reasons. That means those who do use compression for large files that can benefit from this library will have drop it or filter these files out. How do these half-baked requirements make their way into RFC is really appalling sometimes..

@IbrahimTanyalcin
Copy link

IbrahimTanyalcin commented Dec 3, 2023

Is it possible to add a config object such as:

app.use(compression({ enforceRFC7233: true}))

which would default to false by primum non nocere, thus avoiding breaking apps that rely on this behavior and can be turned on by setting it to true which would drop Content-Range header and set status to 200 if a Range request is encountered for response with Transfer-encoding? This would make the life of those that want to be super compliant with the RFC easier by just setting a config prop and prevent those that have to drop using compression because the cost to stay compliant (calculating the gzipped size) for large files is greater than the benefit of being compliant

@dougwilson
Copy link
Contributor

Hi @IbrahimTanyalcin you can make any rule you like to determine if the response should br compressed or not using the filter option. You can find more and an example at https://github.com/expressjs/compression#filter-1

@IbrahimTanyalcin
Copy link

IbrahimTanyalcin commented Dec 3, 2023

@dougwilson I understand that I can pass a filter function, but that means dropping support for compression for plain text file types (fasta etc.) that would actually benefit the greatest from such compression. The benefit from compressing a bundle from Next.js of let's say 300kB is nothing compared to the benefit of compressing a 10Mb chunk of a 5Gb genomic sequence file. Am I wrong in my reasoning? It would be so nice if it was possible to devise a solution that wouldn't break apps of people like me and also allow @mscbpi and others to achieve what they want.

@dougwilson
Copy link
Contributor

dougwilson commented Dec 3, 2023

I'm not dure I understand. If you return true from the filter, then it would compress those files you would like compressed. I'm not sure whay the default behavior is for the file you are referring to is, however. If you think it should be compressed by default without needing to make a filter function, please let us know the details on how to detect that type of file for compression.

@IbrahimTanyalcin
Copy link

In genomics we deal with large files, let's say mysequence.fasta. This a plain text file that reads like ATTTCCGGTTT... and is around 5Gb:

  • These files have an accompanying files called indexes that show byte ranges on each line. This index file ends with mysequence.fai extension and is generally small (10-100kB)
  • Tools like https://github.com/igvteam/igv.js calculate the portion of the sequence that fits to a user's viewport and make a Range-Request based on the index file, let's say they request a 5MB portion
  • Using your compression library, this 5MB portion is extracted from the static file using ExpressJS and compressed and sent to the browser.
  • The browser decompresses the payload and IGV knows that the requested byte-range corresponds to the uncompressed file not the compressed one. In fact, from a technical standpoint, this is the only cost effective way to make Range requests without costing the server resources (looking from index files rather than telling the server "calculate me the correct Content-Range", which is impractical) and using compression meanwhile.

The RFC requires the server to know the gzipped size beforehand so that the Content-Range can be correctly calculated. It is in truth just a semantic compliance that will require genomic apps that use Express to drop compression because obviously they do not want Express to send 200 and dump the file, nor they want to gzip the entire file and calculate the correct Content-Range header based on that. There are plethora of other apps that rely on the current behavior.

I think a solution that wouldn't break backwards compatibility without resorting to filter (effectively dropping support for that file type) and also allows people like @mscbpi to achieve what they want (because they rely on Next.js and other CDN's behavior) is more reasonable. At the end of the day, it is your call. Thanks for the time.

@IbrahimTanyalcin
Copy link

IbrahimTanyalcin commented Dec 3, 2023

Ok here is some more info, in my route file for serving static content, I have something like:

app.use('/static', compression(), _static, function(req, res, next) { //downstream middleware ...

_static is the expressjs static middleware.

I also dug into what are the request headers for such large fasta files and compared them to regular js/html/css files.

Here is the causal js/html/css files:
expressjs-compress-ss2

And here is the request headers I send for a supposedly large genomic file:
expressjs-compress-ss3

So it turns out, the client logic is sending Accept-Encoding: identity header which in the compression library does NOT compress and is passed to static which doesn't know what to do with .fasta file so it adds Content-Type: application/octet-stream.

This means in my case this wouldn't seem to break the behavior with large files, as they are already not compressed. I was wrong. ( There might still be other genomics apps though that do not send the correct encoding request header and expect it work though 🤷)

@mscbpi
Copy link
Author

mscbpi commented Dec 4, 2023

if the client send a Range request header and also Accept request header with the application/gzip mime type, then it is the clients responsibility to take into account whether the sent byte ranges correspond to zipped file or not. This is the intuitive behavior.

Hi @IbrahimTanyalcin thanks a lot,

This is not Microsoft azure's point of view/interpretation of standard but they may be themselves wrong.

https://learn.microsoft.com/en-us/azure/frontdoor/front-door-caching?pivots=front-door-standard-premium#delivery-of-large-files

Tip
If your origin compresses the response, ensure that the Content-Range header value matches the actual length of the compressed response.

Other resources / SR lead to the same observation.

@IbrahimTanyalcin
Copy link

@mscbpi yes you are correct, I never objected the RFC, however from a technical standpoint it is very costly to compute the total zipped size and adjust the Content-Range value.

I wish a provisionary header like Content-Range-Origin: transformed | identity existed so that the ambiguity is clarified and servers are given a choice. But it is ok, I guess safest bet is to send Accept-Encoding: identity or use the filter option to drop support for that file type and hope for the best.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants