Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for adding Referer and User-agent #33

Open
shtrom opened this issue Jun 28, 2023 · 0 comments
Open

Support for adding Referer and User-agent #33

shtrom opened this issue Jun 28, 2023 · 0 comments

Comments

@shtrom
Copy link

shtrom commented Jun 28, 2023

When dealing with the ACM website (e.g., https://github.com/shtrom/ftr-site-config/blob/shtrom-s-master/cacm.acm.org.txt), the login URL only works if the HTTP Referer header is from an acm.org URL.

In this particular instance, it sets a cookie, and serves a redirect to the original page.

For example, this works

$ curl -D - 'https://cacm.acm.org/login' -X POST -H 'User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/114.0'  -H 'referer: https://cacm.acm.org/oeuu'  --data-raw 'current_member%5Buser%5D=USER&current_member%5Bpasswd%5D=PASSWORD'
HTTP/2 302 
date: Wed, 28 Jun 2023 12:32:28 GMT
content-type: text/html; charset=utf-8
location: https://cacm.acm.org/oeuu
cf-ray: XXX-MEL
cf-cache-status: DYNAMIC
cache-control: no-cache
set-cookie: format=full; domain=acm.org; path=/
set-cookie: INDIV_CLIENT=XXX; domain=acm.org; path=/
set-cookie: _cacm_acm_session=XXX; domain=acm.org; path=/
status: 302 Found
x-powered-by: Phusion Passenger
x-runtime: 0.06599
server: cloudflare

<html><script src="/cdn-cgi/apps/head/nLYIPopMPWKseIlIthEH-UJkbT0.js"></script><body>You are being <a href="https://cacm.acm.org/oeuu">redirected</a>.</body></html>
But simply removing the `Referer` or `User-Agent` lead to failures:
$ curl -D - 'https://cacm.acm.org/login' -X POST   --data-raw 'current_member%5Buser%5D=USER&current_member%5Bpasswd%5D=PASSWORD'              
HTTP/2 403 
date: Wed, 28 Jun 2023 12:35:50 GMT
content-type: text/plain; charset=UTF-8
content-length: 16
x-frame-options: SAMEORIGIN
referrer-policy: same-origin
cache-control: private, max-age=0, no-store, no-cache, must-revalidate, post-check=0, pre-check=0
expires: Thu, 01 Jan 1970 00:00:01 GMT
server: cloudflare
cf-ray: 7de5f8d01f702ea6-MEL

error code: 1020%      
$ curl -D - 'https://cacm.acm.org/login' -X POST -H 'User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:109.0) Gecko/2
0100101 Firefox/114.0'  --data-raw 'current_member%5Buser%5D=USER&current_member%5Bpasswd%5D=PASSWORD'                           
HTTP/2 500                                                                                                                                                              
date: Wed, 28 Jun 2023 12:34:47 GMT                                                                                                                                     
content-type: text/html                                                                                                                                                 
cf-ray: 7de5f73e49741f64-MEL                                                                                                                                            
cf-cache-status: DYNAMIC                                                                                                                                                
server: cloudflare

This means that despite the login_* variables in the site-config, fetching full articles fails, as those two headers are missing.

I think this can be solved by

  1. letting guzzle-site-authenticator pass headers on demand
  2. making graby and/or wallabag pass the User-Agent override from the site-config if any
  3. making graby and/or wallabag pass the Referer to be the original URL to be fetched

This should fix the ACM issue, and I think it is sufficiently generic to be equally helpful (or at least not detrimental) on other sites. If this turns out to break thing, we'd need additional site-config options to specify whether additional login_* headers should be included, and their value.

Now, this is all conjecture, as I haven't been able to successfully hack my wallabag instance to behave as described. I got lost jumping between wallabag, graby, and guzzle-site-authenticator.

I'm willing to keep going on this, but I would welcome pointers as to

  1. where I can send headers from guzzle-site-authenticator (I unsuccessfully tried in LoginFormAuthenticator::login https://github.com/wallabag/guzzle-site-authenticator/blob/master/lib/Authenticator/LoginFormAuthenticator.php#L36-L37 by adding a headers array, but maybe I did it wrong)
  2. how to see debug messages from the Authenticator about the requests they are sending (at the moment, I see graby and wallabag determining that a login is needed, and then failure from the login page, but no more debug in between)
  3. how/where I could change/update the HttpClient that, I think, gets injected by wallabag or graby.
  4. any other simpler way to achieve all this?
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant