Imageserver stops working on stage when a file can't be found (not always the same file) #2148

andrewjbtw · 2024-05-02T23:44:40Z

Occasionally, the imageserver stops serving images in the stage environment. This generally shows up in sul-embed as network "failed to fetch" errors. When this happens, the failures appear to be across the board rather than affecting only a subset of SDR items.

There are two imageserver nodes in stage and you can view their status at these links:
http://sul-imageserver-stage-a.stanford.edu/health
http://sul-imageserver-stage-b.stanford.edu/health

When the imageserver is having a problem, one or both of those health checks will have the "color" of "RED". There will also be a message like:

“/stacks/dm/057/nt/0476/asawa.jp2 (No such file or directory) (dm/057/nt/0476/asawa.jp2 -> edu.illinois.library.cantaloupe.source.FilesystemSource)”

In each of the instances where I've investigated this issue, the file that's reported missing in the error message is a file that appeared to have been deleted properly. By "deleted properly" I mean that I've checked Argo and the item history shows that someone intentionally removed the file, which happens when someone changes the file's "shelve" status to "no". These have not been cases where the file was deleted outside of SDR processes, like someone going to the filesystem and just deleting it.

To resolve this error, what I've done is put the file back up at the path indicated in the message. After doing that, the healthcheck turns back to green. The imageserver doesn't seem to care that the file is the same file as before, just that a file appears at the indicated path. You could probably just do touch /path/to/missing/file and clear up the check.

The odd thing is that once the error is cleared, we've found that you can then delete the same file and the error will not come back.

It is not clear what specifically generates this error, but since it apparently takes the whole server down, we would benefit from figuring out what's going on. I should note that I have never seen the issue in production, only stage.

Frequency of occurrence

The first time I remember being aware enough of this issue to monitor it was 2023-09 (see related Slack thread).

This happened again on 2024-05-01. The imageserver reported a file missing from an item that I had made dark. I reaccessioned the item to shelve the file and then the healthcheck turned green. Deleting the file again later did not trigger a recurrence of the error.

I had not been tracking occurrences, so those are the only two that I can identify with certain time frames.

Based on standup discussion this morning (2024-05-02) we decided we should create an issue, if only to have a place to track recurrences.

The text was updated successfully, but these errors were encountered:

andrewjbtw · 2024-05-02T23:45:15Z

Also, this issue is starting in embed because it's not clear if we have a more specific repo for it.

justinlittman · 2024-07-12T16:25:48Z

It looks like Cantaloupe's health check strategy is to check the file of the most recent image that it processed: https://github.com/cantaloupe-project/cantaloupe/blob/develop/src/main/java/edu/illinois/library/cantaloupe/status/HealthChecker.java

If that file is no longer there (which it might not be, e.g., if it was made dark), then this is considered a health check failure.

An alternative if to use the more basic health check option (https://cantaloupe-project.github.io/manual/5.0/endpoints.html#Health%20Check) which is probably fine, especially for stage.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Imageserver stops working on stage when a file can't be found (not always the same file) #2148

Imageserver stops working on stage when a file can't be found (not always the same file) #2148

andrewjbtw commented May 2, 2024

andrewjbtw commented May 2, 2024

justinlittman commented Jul 12, 2024

Imageserver stops working on stage when a file can't be found (not always the same file) #2148

Imageserver stops working on stage when a file can't be found (not always the same file) #2148

Comments

andrewjbtw commented May 2, 2024

andrewjbtw commented May 2, 2024

justinlittman commented Jul 12, 2024