Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scraping optimization #3

Open
stucka opened this issue Jan 13, 2018 · 0 comments
Open

Scraping optimization #3

stucka opened this issue Jan 13, 2018 · 0 comments
Assignees

Comments

@stucka
Copy link
Member

stucka commented Jan 13, 2018

This script is wasting a lot of time scraping pages to see if we already have files.

When we get to the initial page, we don't have all the information for our filenames, which will have a file extension.

But we should have enough to get most of the filename. And if we parse those less the file extension, we can look for conflicts. If there are conflicts, we need to traverse all the subpages. Otherwise, we need to hit the subpages for which we do not have matching files.

This would let us check for new documents, in most cases, with a single hit to the site, making this a better maintenance task than a one-off task.

@stucka stucka self-assigned this Jan 13, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant