Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error 403 - Forbidden for url: https://www.craigslist.org/about/sites #105

Open
luisandrecunha opened this issue Jan 5, 2021 · 22 comments

Comments

@luisandrecunha
Copy link

Hi Julio,

I have used your code before (early 2020), but now I'm getting the error below when trying to import CraigslistHousing, using "from craigslist import CraigslistHousing":

HTTPError: 403 Client Error: Forbidden for url: https://www.craigslist.org/about/sites
Screen Shot 2021-01-05 at 5 25 31 PM

Not sure why, it seems that could be related with this issue: https://stackoverflow.com/questions/16627227/http-error-403-in-python-3-web-scraping.

Do you happen to know why this is happening?

Thanks,

@irahorecka
Copy link
Contributor

irahorecka commented Jan 5, 2021

Seems like this works on my end. Did you upgrade python-craigslist to the latest version? I have a feeling this issue might be agnostic of package upgrade, but it doesn't hurt..

@luisandrecunha
Copy link
Author

Yep, I did the upgrade and continue to have the same issue. Using v1.1.0 and python 3.6, I'm using Google's Colab notebooks.

@irahorecka
Copy link
Contributor

Ah, this looks to be a problem with the requests library in your environment, not python-craigslist, per se.
I'm guessing the same exception would be thrown if you executed this:

import requests
requests.get("https://www.craigslist.org/about/sites")

@luisandrecunha
Copy link
Author

You are completely right, I also tried in a new colab and got "<Response [403]>"

If I run the code below I get a successful response and the page code. I believe it's related with the web scraping issue in this page.

from urllib.request import Request, urlopen
req = Request('https://www.craigslist.org/about/sites', headers={'User-Agent': 'XYZ/3.0'})
webpage = urlopen(req, timeout=10).read()

print(webpage)

@juliomalegria
Copy link
Owner

Thanks for reporting @luisandrecunha.

Interesting. Seems like Craigslist is blocking requests coming from your IP (or Google's Colab IPs). I'm guessing the IP hit a max number of requests per day/hour/minute.

Do you mind running the code suggested by @irahorecka but setting a User-Agent like you did with urllib:

import requests
requests.get("https://www.craigslist.org/about/sites", headers={'User-Agent': 'python-craigslist/1.1.0'})

If this works fine, I'll add a default User-Agent to all requests to prevent this from happening in the future.

Thanks!

@luisandrecunha
Copy link
Author

Hi @juliomalegria ,

It seems that Google's Colab IPs is blocked by Craigslist... I successfully ran the code in a local jupyter notebook and it worked like a charm.

I tried the code you suggested in Colab and continued to get the 403 response... However I receive the right page if I use the code below, not sure if somehow the code could be adapted.

from urllib.request import Request, urlopen
req = Request('https://www.craigslist.org/about/sites', headers={'User-Agent': 'XYZ/3.0'})
webpage = urlopen(req, timeout=10).read()

print(webpage)

Thank you again,

@jraVette
Copy link

jraVette commented Feb 18, 2021

Just a heads up, I've got the exact same issue. I've been running my code for more than a year and this just happened this week. So, something must have changed on the craigslist side? I'll have to dig into the code. I can cut and paste the url into a browser and it works fine. Just wanted to let you know of another user with the same issues.

>>> import requests
>>> requests.get('https://boston.craigslist.org')
<Response [200]>
>>> requests.get('https://boston.craigslist.org/search')
<Response [403]>
>>> requests.get('https://boston.craigslist.org/search',headers={'User-Agent': 'XYZ/3.0'})
<Response [403]>

I tried it on a couple of computers, so I don't think it's IP related. Guess how the servers are seeing the 'requests' library versus a regular library.

Thanks!

@juliomalegria
Copy link
Owner

Hey everyone! Sorry for the inactivity. I've released a new version (1.1.1) adding a User-Agent to requests.get. Hopefully that will solve the issue, please report back if it does or doesn't. If it doesn't I'll have to change libraries to urllib.
Thanks!

@cwittwer
Copy link

I am still getting the 403 error with the updated utils.py.

@KeeonTabrizi
Copy link

KeeonTabrizi commented Feb 21, 2021

+1 Having the same behavior - 403s on /search paths through just a general requests.get() call so the library/class is also not functioning.

Also note I tried taking the headers object from the cURL to /search which loads in a regular browser and used that for the requests call which they also blocked.

I used a selenium driver I had with some mods I've used in the past and I was able to load /search just fine so I don't suspect they are doing something super sophisticated to block the request.

@KeeonTabrizi
Copy link

Okay I've dug into it a bit more - I don't think this has anything do to with user agents or anything they are blocking like that. I recommend upgrading both the requests and urlib3 library pip install urllib3 --upgrade pip install requests --upgrade. Once I did that things started working again. So not sure the actual issue - as older versions of those libraries were working - but with the updates it looks fine to me.

After I did that I tested the request function (which is effectively requests.get()) works:

import requests
import urllib3
from craigslist import utils

>> requests.__version__
Out[5]: '2.25.1'

>>urllib3.__version__
Out[6]: '1.26.3'

>> utils.requests_get('https://boston.craigslist.org/search')
Out[8]: <Response [200]>

@juliomalegria
Copy link
Owner

Thanks @KeeonTabrizi! That's a very good point.
I've updated the requirements to include some minimum version for requirements (requests and beautifulsoup4).
Can anyone having issues try updating their library (pip install python-craigslist --upgrade) and let me know if this fixed the issue.
Thanks again!

@usctzen
Copy link

usctzen commented Feb 23, 2021 via email

@jraVette
Copy link

jraVette commented Feb 23, 2021

Hey y'all, thanks so much for taking the time to fix this! So, it could just be how my packages were managed, but, when I performed (pip install python-craigslist --upgrade) it updated requests but not urllib3. I guess urllib3 is used by requests. So, it did not work with just upgrading python-craigslist. But, after updating both request and urllib3 to the latest, back up and running! Maybe consider adding urllib to the requirements? Thanks again!!

These versions are what got my code working:

>>> requests.__version__
'2.25.1'
>>> urllib3.__version__
'1.26.3'

PS. great module, it's helped me get some great deals on Craiglist!

@cwittwer
Copy link

Hey y'all, thanks so much for taking the time to fix this! So, it could just be how my packages were managed, but, when I performed (pip install python-craigslist --upgrade) it updated requests but not urllib3. I guess urllib3 is used by requests. So, it did not work with just upgrading python-craigslist. But, after updating both request and urllib3 to the latest, back up and running! Maybe consider adding urllib to the requirements? Thanks again!!

These versions are what got my code working:

>>> requests.__version__
'2.25.1'
>>> urllib3.__version__
'1.26.3'

PS. great module, it's helped me get some great deals on Craiglist!

+1 this fixed everything. Good catch!

@irahorecka
Copy link
Contributor

irahorecka commented Mar 30, 2021

@cwittwer, @jraVette, @usctzen, @KeeonTabrizi, @luisandrecunha If you guys are interested in a new Craigslist API format, check out pycraigslist.
I enjoy python-craigslist, but there were some features I wanted to implement immediately. Some additional features are in the works.

@usctzen
Copy link

usctzen commented Mar 30, 2021 via email

@usctzen
Copy link

usctzen commented Mar 30, 2021 via email

@irahorecka
Copy link
Contributor

Hey @usctzen, I always appreciate your feedback. Could you post the same issue in pycraigslist issues? I’ll address it there :)

@usctzen
Copy link

usctzen commented Mar 30, 2021 via email

@juliomalegria
Copy link
Owner

Hey everyone! Sorry for the delay, I've updated the requirements in 88a6b73 and pushed a new version in PyPI. Could anyone confirm if the issue is fixed with this?
Thanks for all the patience!

@Agwebberley
Copy link

I am still having this issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants