Rescanning the job, without restarting the job #1098

ian-cameron · 2021-02-12T23:33:19Z

Is your feature request related to a problem? Please describe.

This feature would override the behavior of comparing the the lastrun date from _status.json and file creation dates, instead index or re-index files based on if they are in the index already or not.

This is related to resuming a failed job, and looping a job. The only way currently to resume a failed job is to --restart, which will re-index every file like starting fresh. This is not ideal for a very large volume. This might help #828

It would also address the problem of when a file gets moved, or a parent directory is renamed, files can be deleted from the index but do not get re-indexed on the next job loop. They get skipped because their creation dates don't change when files are moved or their directories are renamed.

Describe the solution you'd like

Add an option --rescan

Does a complete traversal of the folder, but compares each file's properties to the indexed version by querying the file.url in elasticsearch.

Skips the file if all properties match the index (created, last_modified, filesize)
Reindexes the file if the properties don't all match the index.
Indexes the file if its not found in the index.

This will happen for every file instead of using the 'lastscan' value from _status.json to skip files. It won't be as fast as using lastscan, but will be more accurate and will be significantly faster than restarting the job.

ian-cameron added the feature_request for feature request label Feb 12, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rescanning the job, without restarting the job #1098

Rescanning the job, without restarting the job #1098

ian-cameron commented Feb 12, 2021

Rescanning the job, without restarting the job #1098

Rescanning the job, without restarting the job #1098

Comments

ian-cameron commented Feb 12, 2021