Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rescanning the job, without restarting the job #1098

Open
ian-cameron opened this issue Feb 12, 2021 · 0 comments
Open

Rescanning the job, without restarting the job #1098

ian-cameron opened this issue Feb 12, 2021 · 0 comments
Labels
feature_request for feature request

Comments

@ian-cameron
Copy link
Contributor

Is your feature request related to a problem? Please describe.

This feature would override the behavior of comparing the the lastrun date from _status.json and file creation dates, instead index or re-index files based on if they are in the index already or not.

This is related to resuming a failed job, and looping a job. The only way currently to resume a failed job is to --restart, which will re-index every file like starting fresh. This is not ideal for a very large volume. This might help #828

It would also address the problem of when a file gets moved, or a parent directory is renamed, files can be deleted from the index but do not get re-indexed on the next job loop. They get skipped because their creation dates don't change when files are moved or their directories are renamed.

Describe the solution you'd like

Add an option --rescan

Does a complete traversal of the folder, but compares each file's properties to the indexed version by querying the file.url in elasticsearch.

  • Skips the file if all properties match the index (created, last_modified, filesize)
  • Reindexes the file if the properties don't all match the index.
  • Indexes the file if its not found in the index.

This will happen for every file instead of using the 'lastscan' value from _status.json to skip files. It won't be as fast as using lastscan, but will be more accurate and will be significantly faster than restarting the job.

@ian-cameron ian-cameron added the feature_request for feature request label Feb 12, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature_request for feature request
Projects
None yet
Development

No branches or pull requests

1 participant