How is _status.json supposed to relate to what work has already been done? #828

budachst · 2019-10-20T15:30:44Z

I am still fiddling around to find a way to have fscrawler index multiple folders at once, while writing to the same index on my ES. If anything goes south in any of those jobs, where fscrawler might have crashed or some other error occured, no _status.json gets written and thats fine, but…

{ "name" : "352226", "lastrun" : "2019-10-14T16:36:06.748", "indexed" : 8037, "deleted" : 0 }

this one above doesn't really reveal it's relation to what has been indexed and what not, or does it? So my question is, what's the relation with this and how does fscrawler determine, what other files do still have to be indexed?

In any way, fscrawler would have to query ES if the file it is about to index has already been indexed, but I just can't wrap my head around, as of what the lastrun value would have to do with that. Maybe it's to constrain the search in the index, but this only seems to have a limited benefit.

However, if the _status.json ist missing, it seems that ES doesn't even get checked for already indexed files and all files are indexed again… this is what --restart is supposed to do, I guess, but this behaviour totally denies to have another run on after an unsuccessful one, without having to only index the files, that didn't make it in the previous run. This is very cumbersome on some 14TB volume…

Hope this makes sense… ;)

The text was updated successfully, but these errors were encountered:

dadoonet · 2019-11-26T16:01:34Z

this one above doesn't really reveal it's relation to what has been indexed and what not, or does it?

True. It was meant for statistics and debugging purpose. Sadly it's not really a pointer to the last file scanned from which we could restart.
It's only generated if the run succeed. So it indicates then what is the timestamp to use to compare with file dates on the next run.

Definitely we should do better. I hope that at some point, I'll change the implementation with #399 and that would provide may be more ways to restart from the last file...

ian-cameron mentioned this issue Feb 12, 2021

Rescanning the job, without restarting the job #1098

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How is _status.json supposed to relate to what work has already been done? #828

How is _status.json supposed to relate to what work has already been done? #828

budachst commented Oct 20, 2019

dadoonet commented Nov 26, 2019

How is _status.json supposed to relate to what work has already been done? #828

How is _status.json supposed to relate to what work has already been done? #828

Comments

budachst commented Oct 20, 2019

dadoonet commented Nov 26, 2019