Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: TLS fingerprinting prevents scraping #2888 #3384

Conversation

bilhert
Copy link

@bilhert bilhert commented Mar 28, 2024

What type of PR is this?

  • bug

What this PR does / why we need it:

TLS fingerprinting* detection can be used to protect websites from (scraper)bots. Cloudflare provides this service for example**

https://www.ah.nl/allerhande became unavailable since this technique was used. Using a TLS spoofing technique this is countered.
To do so httpx is replaced with curl-cffi (https://pypi.org/project/curl-cffi/0.2.1)

Which issue(s) this PR fixes:

fixes #2888

Special notes for your reviewer:

Testing

(fill-in or delete this section)
A docker build was used to build the project. afterwards only the import of recipies from urls was tested using ah.nl/allerhande, lidl-kochen.de and various random other sites.

@bilhert bilhert changed the title fix: TLS fingerprinting prveents scraping #2888 fix: TLS fingerprinting prevents scraping #2888 Mar 28, 2024
@bilhert
Copy link
Author

bilhert commented Mar 29, 2024

Formatting has been fixed by poetry black

@hay-kot
Copy link
Collaborator

hay-kot commented Apr 1, 2024

FYI, we're addressing some security issues in #3368 so once that is merged you'll have some additional work to do here to also address those security issues.

@hay-kot hay-kot marked this pull request as draft April 2, 2024 15:07
@bilhert
Copy link
Author

bilhert commented Apr 2, 2024

Thank you for mentioning this. I looked into the PR. unfortunately the chosen method to implement the security fix seems to be incompatible with curl_cffi. I could not find a option that mimics the transport parameter of httpx. Therefore I foresee that the amount of time that I have to invest into setting up a fully equipped dev environment, getting familiar with the project's structure etc. to do the additional work properly is more than I have available at the moment. Thus for now I sadly have to leave this as an example how TLS fingerprints can be spoofed within the project for someone who is more up to speed with the project.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Scaper can not recognise anything on Allerhande.nl (ah.nl) anymore while it did in the past
2 participants