forked from NikolaiT/GoogleScraper
-
Notifications
You must be signed in to change notification settings - Fork 1
/
TODO
94 lines (71 loc) · 2.84 KB
/
TODO
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
- edit caching with decorator pattern
- add all google search params to config
- write functional tests
- add sqlalchemy support for results
- add better proxy handling
- extend parsing functionality
- update readme
- prevent parsing config two times
04.11.2014:
- Refactor code, change docstrings to google format
https://google-styleguide.googlecode.com/svn/trunk/pyguide.html#Commentss [done]
15.11.2014:
- add shell access with sqlalchemy session [done]
- test selenium mode thoroughly [done]
- double check selectors
- add alternative selectors
- Add gevent support
- make all modes workable through proxies [done for http and sel]
- update README [done]
- write blog post that illustrates usage of GoogleScraper
- some testing
- release version 0.2.0 on the cheeseshop
- released version 0.1.5 on pypy [done]
11.12.2014
- JSON output is still slightly corrupt
- CSV output probably also not ideal.
- Improve documentation after Google style guide
- Maybe add other search engines!
- finally implement async mode!!!
30.12.2014:
- Fixed issue #45 [done]
02.01.2015:
- Check output verbosity levels and modify them. [done]
13.01.2015:
- Handle sigint. Then close all open files (csv, json).
15.01.2015:
- Implement JSON static tests [done]
- Implement CSV static tests [done]
- Catch Resource warnings in testing [done]
- Add no_results_selectors for all SE [done]
- add test for no_results_selectors parsing [done]
- Add page number selectors for all SE [done]
- add static tests [done]
- add fabfile (google a basic template) for []
- adding & committing and uploading to master []
- push to the cheeseshop []
- add function in fabfile that pushes to cheeseshop only after all tests were successful []
- Add functionality that distinguishes the page number of serp pages when caching []
- implement async mode [done]
- reade 20 minutes about asyncio built in moduel and decide whether if feets my needs [done]
18.01.2015
- add four different examples:
- a basic usage [done]
- using selenium mode [done]
- using http mode [done]
- using async mode [done]
- scraping with a keywords.py module
- scraping images [done]
- finding plagiarized content [done]
- Add dynamic tests for selenium mode:
- Add event: No results for this query.
- Test Impossible query:
-> Cannot have next_results page
-> No results [done]
-> But still save serp page. [done]
-> add to missed keywords []
- What is the best way to detect that the page loaded????
-> Research, read about selenium
- Add test for duckduckgo
- Fix: If there's no internet connection, Malicious request detected is show. Show no internet connection instead.
- FIGURE OUT: WHY THE HELLO DOES DUCKDUCKGO NOT WORK IN PHANTOMJS?