Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How is _status.json supposed to relate to what work has already been done? #828

Open
budachst opened this issue Oct 20, 2019 · 1 comment

Comments

@budachst
Copy link

I am still fiddling around to find a way to have fscrawler index multiple folders at once, while writing to the same index on my ES. If anything goes south in any of those jobs, where fscrawler might have crashed or some other error occured, no _status.json gets written and thats fine, but…

{ "name" : "352226", "lastrun" : "2019-10-14T16:36:06.748", "indexed" : 8037, "deleted" : 0 }

this one above doesn't really reveal it's relation to what has been indexed and what not, or does it? So my question is, what's the relation with this and how does fscrawler determine, what other files do still have to be indexed?

In any way, fscrawler would have to query ES if the file it is about to index has already been indexed, but I just can't wrap my head around, as of what the lastrun value would have to do with that. Maybe it's to constrain the search in the index, but this only seems to have a limited benefit.

However, if the _status.json ist missing, it seems that ES doesn't even get checked for already indexed files and all files are indexed again… this is what --restart is supposed to do, I guess, but this behaviour totally denies to have another run on after an unsuccessful one, without having to only index the files, that didn't make it in the previous run. This is very cumbersome on some 14TB volume…

Hope this makes sense… ;)

@dadoonet
Copy link
Owner

this one above doesn't really reveal it's relation to what has been indexed and what not, or does it?

True. It was meant for statistics and debugging purpose. Sadly it's not really a pointer to the last file scanned from which we could restart.
It's only generated if the run succeed. So it indicates then what is the timestamp to use to compare with file dates on the next run.

Definitely we should do better. I hope that at some point, I'll change the implementation with #399 and that would provide may be more ways to restart from the last file...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants