Implement our own page load timeout function for the crawler #87

psivesely · 2016-10-29T00:07:28Z

Selenium's page load timeout function is highly unreliable. If it doesn't close down a connection within 5s of when it's supposed to, we should stop a crawl by whatever means necessary (probably closing all circuits will be sufficient, but we already have a method for restarting TB if we need to). This will stop the crawler from wasting time getting stuck on these sites which load for minutes at a time. See fpsd/tests/test_sketchy_sites.py for some good example sites/ a good test case for this timeout function.

The text was updated successfully, but these errors were encountered:

psivesely · 2017-01-09T20:18:14Z

I'm not sure the last time the features were computed (I can't SSH into the VPSs right now for some reason--probably IPTables?), but anyway it seems like this is definitely needed:

fpsd=> select * from features.cell_timings order by total_elapsed_time desc limit 3;
 exampleid | total_elapsed_time
-----------+--------------------
      1106 |         930.656736
      7567 |         449.786331
      1387 |         441.871928
(3 rows)

psivesely added the bug label Jan 9, 2017

psivesely mentioned this issue Jan 9, 2017

Some pages load for more than 20s #22

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement our own page load timeout function for the crawler #87

Implement our own page load timeout function for the crawler #87

psivesely commented Oct 29, 2016

psivesely commented Jan 9, 2017

Implement our own page load timeout function for the crawler #87

Implement our own page load timeout function for the crawler #87

Comments

psivesely commented Oct 29, 2016

psivesely commented Jan 9, 2017