Skip to content
This repository has been archived by the owner on Sep 10, 2020. It is now read-only.

Implement our own page load timeout function for the crawler #87

Open
psivesely opened this issue Oct 29, 2016 · 1 comment
Open

Implement our own page load timeout function for the crawler #87

psivesely opened this issue Oct 29, 2016 · 1 comment
Labels

Comments

@psivesely
Copy link
Contributor

Selenium's page load timeout function is highly unreliable. If it doesn't close down a connection within 5s of when it's supposed to, we should stop a crawl by whatever means necessary (probably closing all circuits will be sufficient, but we already have a method for restarting TB if we need to). This will stop the crawler from wasting time getting stuck on these sites which load for minutes at a time. See fpsd/tests/test_sketchy_sites.py for some good example sites/ a good test case for this timeout function.

@psivesely
Copy link
Contributor Author

I'm not sure the last time the features were computed (I can't SSH into the VPSs right now for some reason--probably IPTables?), but anyway it seems like this is definitely needed:

fpsd=> select * from features.cell_timings order by total_elapsed_time desc limit 3;
 exampleid | total_elapsed_time
-----------+--------------------
      1106 |         930.656736
      7567 |         449.786331
      1387 |         441.871928
(3 rows)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

1 participant