-
Notifications
You must be signed in to change notification settings - Fork 1
/
log.txt
83 lines (73 loc) · 5 KB
/
log.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
24/08/2015
Copied top 20 news websites from alexa.com
Searched for "*" on:
- http://query.nytimes.com/search/sitesearch/#/*/ - 15 520 441 results
- https://www.washingtonpost.com/newssearch/search.html?query=* - 12 447 results (all since 2005)
- http://timesofindia.indiatimes.com/topic/* - 0 results
- http://www.usatoday.com/search/*/ - 437 412 results
- http://economictimes.indiatimes.com/topic/* - 0 results
- http://www.latimes.com/search/dispatcher.front?Query=*&target=all - 0 results
- http://www.chron.com/search/?action=search&firstRequest=1&searchindex=gsa&query=* - 0 results
- http://indianexpress.com/?s=* - 0 results
- http://nypost.com/?s=* - 0 results
- http://www.thehindu.com/search/simple.do - 3 525 637 results
- http://www.sfgate.com/search/?action=search&firstRequest=1&searchindex=gsa&query=* - 0 results
- http://www.hindustantimes.com/Search/search.aspx?q=*&op=All&pt=all&auth=all - 0 results
- http://www.smh.com.au/execute_search.html?text=*&ss=smh.com.au%2Fexecute_search.html - 0 results
- not English
- not English
- http://www.hollywoodreporter.com/search/* - 633 results
- http://www.chicagotribune.com/search/dispatcher.front?Query=*&target=all&spell=on - 0 results
- not English
- http://www.examiner.com/search/google?query=*&cx=partner-pub-7479725245717969%3A9ze01gmnpyp&cof=FORID%3A9&ie=ISO-8859-1&sa=Search - not only articles
- not English
Searched for "the" on:
- http://query.nytimes.com/search/sitesearch/#/the/ - 0 results
- https://www.washingtonpost.com/newssearch/search.html?query=the - 1 098 011 results (all since 2005)
- http://timesofindia.indiatimes.com/topic/the - 0 results
- http://www.usatoday.com/search/the/ - 387 050 results
- http://economictimes.indiatimes.com/topic/the - 0 results
- http://www.latimes.com/search/dispatcher.front?Query=the&target=all - 29 880 results
- http://www.chron.com/search/?action=search&firstRequest=1&searchindex=gsa&query=the - 1 550 000 results
- http://indianexpress.com/?s=the - 0 results
- http://nypost.com/?s=the - 0 results
- http://www.thehindu.com/search/simple.do - 0 results
- http://www.sfgate.com/search/?action=search&firstRequest=1&searchindex=gsa&query=the - 1 300 000 results
- http://www.hindustantimes.com/Search/search.aspx?q=the&op=All&pt=all&auth=all - 0 results
- http://www.smh.com.au/execute_search.html?text=the&ss=smh.com.au%2Fexecute_search.html - 1 150 585 results
- not English
- not English
- http://www.hollywoodreporter.com/search/the - 0 results
- http://www.chicagotribune.com/search/dispatcher.front?Query=the&target=all&spell=on - 24 760 results
- not English
- http://www.examiner.com/search/google?query=the&cx=partner-pub-7479725245717969%3A9ze01gmnpyp&cof=FORID%3A9&ie=ISO-8859-1&sa=Search - not only articles
- not English
Summary:
"*": 5 websites with results, average of 3 899 314, total of 19 496 570
"the": 7 websites with results, average of 791 469, total of 5 540 286
25/08/2015
Tried to discover if http://query.nytimes.com/search/sitesearch/#/*/ uses ajax
to load search results to make scraping easier. To no avail.
Search results are contained like so:
<div id="searchResults">
<ol class="searchResultsList flush">
<li class="story"></li> // if class="... paidpost ..." > advertisement
...
<li></li>
</ol>
</div>
About the dates Adriatik used for WP, Laura says:
"(I think) on the 13th of August, so 13/08/2014 until 13/08/2015."
Started writing html results scraper.
Found nytimes ajax url
Finished writing json scraping script.
28/08/2015
The script seemed broken, but after careful inspection it turned out to be my
internet connection.
05/09/2015
400 error at 1010 urls was because of max page of 100 at nyt. Worked around this
by writing separate queries for all dates. After this 500 errors occurred, changed
script to skip over them too.
06/09/2015
Scrape result: 92165 lines in output file, 27 errors, lost 1022, 92138 urls succesfully.
Wrote catch for network timeout.