Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTTP response caching, with TTL and LRU logic #11342

Open
wants to merge 70 commits into
base: develop
Choose a base branch
from

Conversation

GregoryTravis
Copy link
Contributor

@GregoryTravis GregoryTravis commented Oct 16, 2024

This implements

  • A response cache for HTTP.{fetch,request} and Data.{read,fetch} and other methods that use those. (Data.download is not cached.)
  • Per-file request limit
  • Total cache size limit
  • TTL logic, based on HTTP response headers if available
  • LRU logic for removing old entries to make room

Checklist

Please ensure that the following checklist has been satisfied before submitting the PR:

  • The documentation has been updated, if necessary.
  • Screenshots/screencasts have been attached, if there are any visual changes. For interactive or animated visual changes, a screencast is preferred.
  • All code follows the
    Scala,
    Java,
    TypeScript,
    and
    Rust
    style guides. In case you are using a language not listed above, follow the Rust style guide.
  • Unit tests have been written where possible.
  • If meaningful changes were made to logic or tests affecting Enso Cloud integration in the libraries,
    or the Snowflake database integration, a run of the Extra Tests has been scheduled.
    • If applicable, it is suggested to paste a link to a successful run of the Extra Tests.

Copy link
Member

@jdunkerley jdunkerley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The overall approach looks great to me.

A few changes though please.
I think we should also put the LRU into its own class and make thread safe sooner rather than later (as next step will parallel download!)

distribution/lib/Standard/Base/0.0.0-dev/src/Data.enso Outdated Show resolved Hide resolved
distribution/lib/Standard/Base/0.0.0-dev/src/Data.enso Outdated Show resolved Hide resolved
distribution/lib/Standard/Base/0.0.0-dev/src/Data.enso Outdated Show resolved Hide resolved
distribution/lib/Standard/Base/0.0.0-dev/src/Data.enso Outdated Show resolved Hide resolved
distribution/lib/Standard/Base/0.0.0-dev/src/Data.enso Outdated Show resolved Hide resolved
Comment on lines +382 to +389
MessageDigest messageDigest = MessageDigest.getInstance("SHA-256");
messageDigest.update(resolvedURI.toString().getBytes());

var sortedHeaders = resolvedHeaders.stream().sorted(headerNameComparator).toList();
for (Pair<String, String> resolvedHeader : sortedHeaders) {
messageDigest.update(resolvedHeader.getLeft().getBytes());
messageDigest.update(resolvedHeader.getRight().getBytes());
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we use just use Java's string hash?
Any reason to use SHA-256 over faster MD5?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean concatenate the uri and headers and use String.hashCode()? I used this for better security, but can definitely switch to MD5.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SHA256 is heavier to compute than a simple MD5 - not sure makes too much difference on security as reversing an MD5 is pretty hard anyway, but as this is on headers and uri its not a real issue.

The Java string hash was just couldn't we just make a big string and hash it - but I prefer yours.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

used this for better security

Security of what? I mean:

  • there is no check whether the first download of the HTTP resource is trustworthy, right?
  • e.g. the system downloads everything and just puts a stamp on it
  • if someone can get access to a computer (user account) then there is no security anyway
  • so: what kind of attack this hash is shielding the user against?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Synthesizing a URL that would hash to an existing resource in the user's cache that they hadn't intended to load. It's an extremely unlikely case, but I thought of it just as the routine use of non-trivial hashes for resource naming, and the cost is once per HTTP request. I'm fine using hashCode for this.

Comment on lines +46 to +47
// 1 year.
private static final int DEFAULT_TTL_SECONDS = 31536000;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this some standardized value? If not, isn't it too large for the purposes of this cache? It feels like something around 24h would be more fitting, no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was suggested by @jdunkerley .

@GregoryTravis
Copy link
Contributor Author

Comments resolved except for nio change.

Copy link
Member

@radeusgd radeusgd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The approach looks very good and I'm very happy about how readable the code is, even though quite a lot is happening, it is really easy to follow it and understand it. It's really appreciated! ❤️


I have some reservations about 'security'. While our current design of the secrets system cannot be 100% secure, because the secrets are in the JVM and can always be recovered e.g. using reflection, we still want to make the 'attack surface' as small as possible.

Adding the, relatively complex, logic of TransientHTTPResponseCache into the enso_cloud package that is dealing with secrets is making the 'sensitive surface' quite large - one has to look at all the methods of the TransientHTTPResponseCache class and make sure that secrets aren't leaked anywhere. And in fact I did find a leak there.

I would suggest to decrease the 'sensitive' surface smaller by adding a bit more encapsulation. We could move TransientHTTPResponseCache to a separate package and make the 'secure' EnsoSecretHelper communicate with it through a much smaller API that can be more easily analyzed for security.

I think the RequestMaker interface is a good starting point. What if we extend it a bit?

I'd suggest

interface RequestMaker {
    /* Executes the HTTP request and returns the response. All secret handling should be encapsulated inside of the `run` function, so that secret values are not exposed to the outside world. */
    EnsoHttpResponse run() throws IOException, InterruptedException;

    /* Returns a hash key that can be used to uniquely identify this request (by hashing its URI and headers). This will be used to decide if the `run` method should be executed, or if a cached response will be returned.
       The hash key should not be reversible - any secrets that were present inside of the URI or headers should not be recoverable from the hash key. */
    String hashKey();

    /* When a cached response is returned, instead of executing `run`, this method is used to construct the response. */
    EnsoHttpResponse reconstructResponseFromCachedStream(InputStream cachedResponseData, ...);
}

By encapsulating all the logic for running the actual query and constructing the response from cache in the RequestMaker, the TransientHTTPResponseCache can be 'clueless' about the details and security of secret handling - the TransientHTTPResponseCache itself is not given any secret information at all (in fact it does not need to know even the non-secret rendered URI now), so we don't have to worry about its security anymore. We only need to be careful about the RequestMaker implementation to avoid leaking secrets there - but this is much smaller piece of code than the whole TransientHTTPResponseCache. So it becomes much easier.

The ... in the code snippet above indicate that we still need to know some more info to be able to reconstruct the request. I think we could parametrize the RequestMaker by ResponseMetadata type - and add a ResponseMetadata dumpMetadataForCache(EnsoHttpResponse) method to the RequestMaker interface. Then the cache would call this when storing the cache entry and then provide an instance of ResponseMetadata to reconstructResponseFromCachedStream. This way even the stored metadata will stay opaque for the cache.
Alternatively, and perhaps much simpler, it could be to keep storing the metadata (response headers and status code) like now, and add them as parameters to reconstruct the response - these metadata are not secret anyway, so we can store them in the cache directly. I'd just avoid storing the URI and instead let the RequestMaker reconstruct it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants