You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
This feature would override the behavior of comparing the the lastrun date from _status.json and file creation dates, instead index or re-index files based on if they are in the index already or not.
This is related to resuming a failed job, and looping a job. The only way currently to resume a failed job is to --restart, which will re-index every file like starting fresh. This is not ideal for a very large volume. This might help #828
It would also address the problem of when a file gets moved, or a parent directory is renamed, files can be deleted from the index but do not get re-indexed on the next job loop. They get skipped because their creation dates don't change when files are moved or their directories are renamed.
Describe the solution you'd like
Add an option --rescan
Does a complete traversal of the folder, but compares each file's properties to the indexed version by querying the file.url in elasticsearch.
Skips the file if all properties match the index (created, last_modified, filesize)
Reindexes the file if the properties don't all match the index.
Indexes the file if its not found in the index.
This will happen for every file instead of using the 'lastscan' value from _status.json to skip files. It won't be as fast as using lastscan, but will be more accurate and will be significantly faster than restarting the job.
The text was updated successfully, but these errors were encountered:
Is your feature request related to a problem? Please describe.
This feature would override the behavior of comparing the the lastrun date from _status.json and file creation dates, instead index or re-index files based on if they are in the index already or not.
This is related to resuming a failed job, and looping a job. The only way currently to resume a failed job is to --restart, which will re-index every file like starting fresh. This is not ideal for a very large volume. This might help #828
It would also address the problem of when a file gets moved, or a parent directory is renamed, files can be deleted from the index but do not get re-indexed on the next job loop. They get skipped because their creation dates don't change when files are moved or their directories are renamed.
Describe the solution you'd like
Add an option --rescan
Does a complete traversal of the folder, but compares each file's properties to the indexed version by querying the file.url in elasticsearch.
This will happen for every file instead of using the 'lastscan' value from _status.json to skip files. It won't be as fast as using lastscan, but will be more accurate and will be significantly faster than restarting the job.
The text was updated successfully, but these errors were encountered: