-
-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OTA engine robustness #1174
Comments
I'm having some trouble triggering the recorder, whatever I do, even for a service that was not previously tracked, it says 'Recorded 0 new versions' and 'Recorded 0 new snapshots and 0 new versions'. Investigating... |
Ah it was because I was rebased on the |
Added mimeType to getLatest interface. Now should get it from https://github.com/OpenTermsArchive/engine/blob/5963f7272ba0429eda75e6146dd18ce4e14ed274/src/archivist/services/sourceDocument.js#L9 |
Before starting the crawl, there is a search:
It makes sense that at this point only the location and the filters are known and the content and mimeType are not. But we could add the mimeType into the declaration, and we can even let it default to This does change the behaviour of the crawler a bit because it means a crawl would reasonably have to fail if a pdf is encountered where this was not expected. So that's a downside of option 3. 4th option, similar to the 2nd one: we could do an And I'm now filling in option 3 ('pass the mimeType as an argument') as getting it from the declaration. But we could also imagine prohibiting So to recap the options (adding a 6 - 8 here mainly for completeness, these are probably not better than 4 or 1):
|
Option 1 cannot be done in a single So options 1, 8 and 2 are not really attractive. Option 3 seems doable, but I'm now thinking maybe option 5 is also not that bad. |
OK, I think I found a solution, in two parts:
|
In fact, we could also make an even bigger step and make the whole snapshot recording optional. |
And probably if we only have a versions repo then the whole wildcards problem goes away anyway! :) |
Even if you do write the snapshots then you could still skip the read-back |
I'll work on OpenTermsArchive/engine#1103 first and then see if OpenTermsArchive/engine#1101 is still needed. |
Option 9: use the filename returned in the call to |
I'm now running with OpenTermsArchive/engine#1103 and crawling 12,000 documents still fails. The logs show that it gets from A to S in 3 hours and then just stops. This is my cronjobe script, maybe I should add a
|
I'll chop the task into 10 smaller ones. We could also split it into 100 tasks but that would incur an overhead of starting and stopping the headless browser 100 times and it will give 100 times the risk something goes wrong with the git push so that's why I think 10 is a better balance. |
I'll consider this resolved now. |
Initial tests with #1164 show a number of problems:
To reduce the effect of a git repo getting into the wrong state, it would help to shard the snapshots and versions repos -> #1172
Apart from sharding the repos, I also want to rethink the scheduling through a cronjob -> #1173
I think these two measures combined will make the crawler a lot more stable, and also allow us to run it on the same server as the rest of ToS;DR, so it will not only be more reliable, but also cheaper for us to host.
The text was updated successfully, but these errors were encountered: