-
-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Photon Library
Photon is available as a library for both Python 2 & Python 3.
To install photon as a library, you can simply do
pip install photon --user
import photon
result = photon.crawl('http://example.com')
The crawl
function returns a dict
by default but you can use the format='json'
argument for json output. It applies to both crawl
and result
functions. A sample json output can be found here.
To make the crawling as flexible as possible, following optional arguments are present
Argument | Type | Default |
---|---|---|
level | int | 2 |
threads | int | 2 |
timeout | float | 6 |
delay | float | 0 |
regex | str | None |
exclude | str | None |
seeds | list | None |
user_agent | list | random |
cookies | dict | None |
keys | boolean | False |
only_urls | boolean | False |
Please go through the Photon wiki for a detailed explanation of each option.
The results are stored permanently after a crawling session. You can view them anytime as follows
import photon
photon.crawl('http://example.com')
print (photon.results())
Why is there a separate function for it?
Well it can be used in asynchronous programming. You can view the results even when the crawling is in progress.
If you are crawling different websites, you can easily clear the previous result by calling the clear()
function as follows:
import photon
websites = ['https://google.com', 'https://github.com']
for website in websites:
print (photon.crawl(website))
photon.clear()
import photon
result = photon.crawl('http://example.com', level=3, threads=10, keys=True, exclude='/blog/20[18|17]')