HTTP response caching, with TTL and LRU logic #11342

GregoryTravis · 2024-10-16T17:55:11Z

This implements

A response cache for HTTP.{fetch,request} and Data.{read,fetch} and other methods that use those. (Data.download is not cached.)
Per-file request limit
Total cache size limit
TTL logic, based on HTTP response headers if available
LRU logic for removing old entries to make room

Checklist

Please ensure that the following checklist has been satisfied before submitting the PR:

The documentation has been updated, if necessary.
Screenshots/screencasts have been attached, if there are any visual changes. For interactive or animated visual changes, a screencast is preferred.
All code follows the
Scala,
Java,
TypeScript,
and
Rust
style guides. In case you are using a language not listed above, follow the Rust style guide.
Unit tests have been written where possible.
If meaningful changes were made to logic or tests affecting Enso Cloud integration in the libraries,
or the Snowflake database integration, a run of the Extra Tests has been scheduled.
- If applicable, it is suggested to paste a link to a successful run of the Extra Tests.

distribution/lib/Standard/Base/0.0.0-dev/src/Data.enso

jdunkerley

The overall approach looks great to me.

A few changes though please.
I think we should also put the LRU into its own class and make thread safe sooner rather than later (as next step will parallel download!)

distribution/lib/Standard/Base/0.0.0-dev/src/Data.enso

std-bits/base/src/main/java/org/enso/base/enso_cloud/TransientHTTPResponseCache.java

jdunkerley · 2024-10-17T10:42:16Z

std-bits/base/src/main/java/org/enso/base/enso_cloud/TransientHTTPResponseCache.java

+      MessageDigest messageDigest = MessageDigest.getInstance("SHA-256");
+      messageDigest.update(resolvedURI.toString().getBytes());
+
+      var sortedHeaders = resolvedHeaders.stream().sorted(headerNameComparator).toList();
+      for (Pair<String, String> resolvedHeader : sortedHeaders) {
+        messageDigest.update(resolvedHeader.getLeft().getBytes());
+        messageDigest.update(resolvedHeader.getRight().getBytes());
+      }


Could we use just use Java's string hash?
Any reason to use SHA-256 over faster MD5?

Do you mean concatenate the uri and headers and use String.hashCode()? I used this for better security, but can definitely switch to MD5.

SHA256 is heavier to compute than a simple MD5 - not sure makes too much difference on security as reversing an MD5 is pretty hard anyway, but as this is on headers and uri its not a real issue.

The Java string hash was just couldn't we just make a big string and hash it - but I prefer yours.

used this for better security

Security of what? I mean:

there is no check whether the first download of the HTTP resource is trustworthy, right?

e.g. the system downloads everything and just puts a stamp on it

if someone can get access to a computer (user account) then there is no security anyway

so: what kind of attack this hash is shielding the user against?

Synthesizing a URL that would hash to an existing resource in the user's cache that they hadn't intended to load. It's an extremely unlikely case, but I thought of it just as the routine use of non-trivial hashes for resource naming, and the cost is once per HTTP request. I'm fine using hashCode for this.

distribution/lib/Standard/Base/0.0.0-dev/src/Data.enso

radeusgd · 2024-10-17T14:26:20Z

std-bits/base/src/main/java/org/enso/base/enso_cloud/TransientHTTPResponseCache.java

+  // 1 year.
+  private static final int DEFAULT_TTL_SECONDS = 31536000;


Is this some standardized value? If not, isn't it too large for the purposes of this cache? It feels like something around 24h would be more fitting, no?

This was suggested by @jdunkerley .

GregoryTravis · 2024-10-17T20:38:10Z

Comments resolved except for nio change.

distribution/lib/Standard/Base/0.0.0-dev/src/Errors/Common.enso

std-bits/base/src/main/java/org/enso/base/enso_cloud/TransientHTTPResponseCache.java

radeusgd

The approach looks very good and I'm very happy about how readable the code is, even though quite a lot is happening, it is really easy to follow it and understand it. It's really appreciated! ❤️

I have some reservations about 'security'. While our current design of the secrets system cannot be 100% secure, because the secrets are in the JVM and can always be recovered e.g. using reflection, we still want to make the 'attack surface' as small as possible.

Adding the, relatively complex, logic of TransientHTTPResponseCache into the enso_cloud package that is dealing with secrets is making the 'sensitive surface' quite large - one has to look at all the methods of the TransientHTTPResponseCache class and make sure that secrets aren't leaked anywhere. And in fact I did find a leak there.

I would suggest to decrease the 'sensitive' surface smaller by adding a bit more encapsulation. We could move TransientHTTPResponseCache to a separate package and make the 'secure' EnsoSecretHelper communicate with it through a much smaller API that can be more easily analyzed for security.

I think the RequestMaker interface is a good starting point. What if we extend it a bit?

I'd suggest

interface RequestMaker {
    /* Executes the HTTP request and returns the response. All secret handling should be encapsulated inside of the `run` function, so that secret values are not exposed to the outside world. */
    EnsoHttpResponse run() throws IOException, InterruptedException;

    /* Returns a hash key that can be used to uniquely identify this request (by hashing its URI and headers). This will be used to decide if the `run` method should be executed, or if a cached response will be returned.
       The hash key should not be reversible - any secrets that were present inside of the URI or headers should not be recoverable from the hash key. */
    String hashKey();

    /* When a cached response is returned, instead of executing `run`, this method is used to construct the response. */
    EnsoHttpResponse reconstructResponseFromCachedStream(InputStream cachedResponseData, ...);
}

By encapsulating all the logic for running the actual query and constructing the response from cache in the RequestMaker, the TransientHTTPResponseCache can be 'clueless' about the details and security of secret handling - the TransientHTTPResponseCache itself is not given any secret information at all (in fact it does not need to know even the non-secret rendered URI now), so we don't have to worry about its security anymore. We only need to be careful about the RequestMaker implementation to avoid leaking secrets there - but this is much smaller piece of code than the whole TransientHTTPResponseCache. So it becomes much easier.

The ... in the code snippet above indicate that we still need to know some more info to be able to reconstruct the request. I think we could parametrize the RequestMaker by ResponseMetadata type - and add a ResponseMetadata dumpMetadataForCache(EnsoHttpResponse) method to the RequestMaker interface. Then the cache would call this when storing the cache entry and then provide an instance of ResponseMetadata to reconstructResponseFromCachedStream. This way even the stored metadata will stay opaque for the cache.
Alternatively, and perhaps much simpler, it could be to keep storing the metadata (response headers and status code) like now, and add them as parameters to reconstruct the response - these metadata are not secret anyway, so we can store them in the cache directly. I'd just avoid storing the URI and instead let the RequestMaker reconstruct it.

GregoryTravis added 30 commits October 2, 2024 15:26

wip

2aae40a

wip

2be76a5

Merge branch 'develop' into wip/gmt/10640-download-cache

b2372c1

builds

8761b11

set flag

dd48dbb

Merge branch 'develop' into wip/gmt/10640-download-cache

7df84fb

wip

e6ec4c9

merge

7c7b200

max-age and length in test server

c358f69

get maxAge from server

5b6d70f

Merge branch 'develop' into wip/gmt/10640-download-cache

ac81dd2

builds

4081c51

works

0d2e215

other entry points

d6827d0

only 200

ca985f6

fixed cache flag logic

5eb4419

old tests pass

bbc817e

Cache_Policy

c32a3d9

wip

afeec45

one test

fdf0f9b

merge

0fcd36b

test

9b6d93c

test

5e419f6

tests

4873f80

tests

73d6041

tests

543e219

expiry test

0e2c18f

Age header test

66a6a18

Merge branch 'develop' into wip/gmt/10640-download-cache

a9067f1

wip

24d1648

changelog

08035ec

GregoryTravis commented Oct 16, 2024

View reviewed changes

distribution/lib/Standard/Base/0.0.0-dev/src/Data.enso Outdated Show resolved Hide resolved

jdunkerley requested changes Oct 17, 2024

View reviewed changes

radeusgd reviewed Oct 17, 2024

View reviewed changes

distribution/lib/Standard/Base/0.0.0-dev/src/Data.enso Outdated Show resolved Hide resolved

radeusgd reviewed Oct 17, 2024

View reviewed changes

GregoryTravis added 5 commits October 17, 2024 10:43

Merge branch 'develop' into wip/gmt/10640-download-cache

47f3f62

review

56a0320

only run expiry when adding cache entries

8506397

wip

649179f

review

374992f

GregoryTravis requested a review from radeusgd October 17, 2024 20:38