Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update compileTime used for caching? #1062

Open
MarcelWaldvogel opened this issue Jan 14, 2022 · 10 comments
Open

Update compileTime used for caching? #1062

MarcelWaldvogel opened this issue Jan 14, 2022 · 10 comments

Comments

@MarcelWaldvogel
Copy link
Contributor

There is a hardcoded compileTime in the caching file, which is used in the Last-Modified header of the returned files:

private final long compileTime = 1567581178724L;

As the returned diagram will only depend on the input string and the software version used, this seems like a good approach. However, the date in there is Wed Sep 4 07:12:58 UTC 2019 and probably some pieces of the rendering have changed in these 2+ years, so that this variable should be updated regularly.

What comes to mind right now:

  • Change the source file during the build process (downside: modifies file which is under source code control)
  • Change the source code when releasing a new version
  • Outsource compileTime to a file not under source code control, which is e.g. created by the Makefile before the mvn runs
  • Use the starting time of the server process (should be relatively stable, as the process is probably long-running)

What do you think?

@ggrossetie
Copy link
Member

I think we should find a way to include the diagram version:

The last modification date depends on:

The main issue is that Last-Modified is an HTTP date, not a unique identifier. Using Kroki compile time or start time is inaccurate.

MDN Web Docs states:

The Last-Modified response HTTP header contains a date and time when the origin server believes the resource was last modified. It is used as a validator to determine if the resource is the same as the previously stored one. Less accurate than an ETag header, it is a fallback mechanism.

Since ETag header is fully supported, I guess we could remove this header? https://caniuse.com/?search=etag

However, and as mentioned above, we will need to compute the ETag from the input string, diagram options and the diagram library version.

@MarcelWaldvogel
Copy link
Contributor Author

So, Last-Modified would just be the current date? (I think dropping it would not be helpful/necessary)

I see no ETag handling in the code, just ETag creation. So, the caching and If-Match handling (and If-Modified-Since for Last-Modified above) is entire left to caches (proxy, client)?

Then, why not just calculate the ETag as a strong hash of the response? Switching to input parameters (input string, options, versions) seems more complicated and only makes sense when Kroki itself can forego repeated processing by handling If-Match/If-Modified-Since itself?

@ggrossetie
Copy link
Member

So, Last-Modified would just be the current date? (I think dropping it would not be helpful/necessary)

I don't think we should set Last-Modified if we cannot reliably compute a value.

If we really want to keep it then I guess we should use the release/build date (for instance, using Maven maven.build.timestamp).

I see no ETag handling in the code, just ETag creation. So, the caching and If-Match handling (and If-Modified-Since for Last-Modified above) is entire left to caches (proxy, client)?

Currently yes but I think the gateway server should implement this logic and return a 304 Not Modified.

Then, why not just calculate the ETag as a strong hash of the response?

Because we want to save processing time. If we need to generate an image from the input in order to calculate an ETag that's a bit unfortunate.
Since generating an image is a pure function (i.e., it always return the same image from the same input, given that the diagram library used is the same), we should leverage that and calculate the ETag using the input.

Switching to input parameters (input string, options, versions) seems more complicated and only makes sense when Kroki itself can forego repeated processing by handling If-Match/If-Modified-Since itself?

That's the plan 👍

@MarcelWaldvogel
Copy link
Contributor Author

I'm in favor of outsourcing complexity to someone who already has implemented it instead of implementing it on my own.

So, unless this is a real issue (I do not know the load on kroki.io, but it should be no problem at all for a local installation), I would add a frontend proxy with enough on-disk cache, and the problem is solved, however we create Last-Modified and the ETag. Then, also, the Last-Modified may be set to the current time (which is fine, as it will not be used and touching the file on disk would have the same result) and ETag can just be calculated from the output. (I got the impression that configuration variables can be all over the HTTP request, in all kinds of formats/capitalizations; but this is just from the glance when trying to make header checks work.)

If it becomes a performance problem (or already is?), then looking at Kroki-side caching might make sense, but having more data on the workload would be great (how often is the same file requested, how long does generating take, …).

@ggrossetie
Copy link
Member

I'm in favor of outsourcing complexity to someone who already has implemented it instead of implementing it on my own.

Me too but in this case, NGINX (or really any reverse/frontend proxy) cannot do much with ETag.

So, unless this is a real issue (I do not know the load on kroki.io, but it should be no problem at all for a local installation), I would add a frontend proxy with enough on-disk cache, however we create Last-Modified and the ETag

As far as I know, NGINX won't use ETag but Expires and Cache-Control on GET requests.

Then, also, the Last-Modified may be set to the current time (which is fine, as it will not be used and touching the file on disk would have the same result)

To quote MDN: "The Last-Modified response HTTP header contains a date and time when the origin server believes the resource was last modified."

https://kroki.io/graphviz/svg/eNpLyUwvSizIUHBXqPZIzcnJ17ULzy_KSanlAgB1EAjQ is basically a static image (we could generate it at build/compilation time). So the resource was last modified when Kroki was released/built.

and ETag can just be calculated from the output. (I got the impression that configuration variables can be all over the HTTP request, in all kinds of formats/capitalizations; but this is just from the glance when trying to make header checks work.)

Again, if we do that we cannot save processing time which is a missed opportunity.
We can indeed pass diagram options using query parameters, HTTP headers or in the body of POST requests but they are normalized as a JSON object before calling the pure function:

void convert(String sourceDecoded, String serviceName, FileFormat fileFormat, JsonObject options, Handler<AsyncResult<Buffer>> handler);

This function will return the same output given the same input. So the ETag can be generated just before calling convert if the request contains a If-None-Match header. If the generates ETag matches the If-None-Match value, then we can return a 304 without a body! (and without calling convert).

We should implement this logic here:

convert(routingContext, sourceDecoded, serviceName, fileFormat);

And there:

convert(routingContext, diagramSource, serviceName, fileFormat);

If it becomes a performance problem (or already is?), then looking at Kroki-side caching might make sense, but having more data on the workload would be great (how often is the same file requested, how long does generating take, …).

We don't want to cache images (i.e., save on disk) on the Kroki server but we do want to be as efficient as possible. If we can skip work, respond faster and save bandwidth then we should do it.

@ggrossetie
Copy link
Member

Anyway, this issue was about Last-Modified and compileTime. Let's focus on that.
I think compileTime should be equal to maven.build.timestamp. We can add a reference in: https://github.com/yuzutech/kroki/blob/main/server/src/main/resources/application.properties

app.buildTime=${maven.build.timestamp}

Then, we can (statically) load /application.properties and get the value of app.buildTime.

I think we should create a dedicated issue for ETag.

@MarcelWaldvogel
Copy link
Contributor Author

It seems that the design for Last-Modified is clear, i.e., that it should be the time at which the code was built, as the result will only depend on the request and the code. I'm spinning off the caching discussion to a new issue.

(One open question about the Last-Modified time: When we have external containers, such as Mermaid, will they create their own Last-Modified or will Kroki set its build time? The former would be preferred, in case the Mermaid container was updated independently.)

@ggrossetie
Copy link
Member

It seems that the design for Last-Modified is clear, i.e., that it should be the time at which the code was built, as the result will only depend on the request and the code. I'm spinning off the caching discussion to a new issue.

👍🏻

(One open question about the Last-Modified time: When we have external containers, such as Mermaid, will they create their own Last-Modified or will Kroki set its build time? The former would be preferred, in case the Mermaid container was updated independently.)

Ideally yes but, currently, we do release all containers all together so it's not really an issue 😄

@MarcelWaldvogel
Copy link
Contributor Author

Build time as Last-Modified is now included in #1065, together with tests.

@ggrossetie
Copy link
Member

@MarcelWaldvogel I believe the initial scope of this issue was resolved in #1065. Should we close this issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants