-
Notifications
You must be signed in to change notification settings - Fork 296
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Page title fetching is still failing in some cases #531
Comments
I tried the 5 first links. With Firefox API, title, url and description are fetched correctly. However, with bookmarklet (or Shaarli-next firefox addon), description are not loaded. |
Thanks for the dataset! It'll be easier to fix this for good. So far, here's what I got:
Encoding issues: Not solved yet and/or probably won't fix:
Probably due to going through reverse proxies, but that's more likely a bad configuration on their side.
@alexisju: it doesn't work the same way. With bookmarklets, the page is already opened, and Shaarli grabs the title with JS (or Firefox API). In Shaarli, it has to make an external request from your server to the targeted URL. |
OK, the explanation seems legit, but does not convince me completely.
I do not agree with that. The correct title appears in a
On this one, the retrieved title has an extra space between the “-” and the preceding word. It looks like the carriage return is interpreted as a space character. Other title differences could be related to an adaptation to the browser’s accepted language settings.
Yes, I know. I expected to get “403 Forbidden” as page title.
This is not a valid HTML page, but my browser (Firefox) still displays a title (Marcus Rohrmoser). It looks like it was extracted from the
I almost only used the Shaarli Web interface for adding links. |
I ran
The message seems to depend on the User Agent (
I think Shaarli can live without RDF parsing. |
My bad, I jumped on that conclusion too fast. I found another issue, which will probably fix some other issues: the URL is escaped too early, so Shaarli's trying to reach a non valid URL (
Yep, I thought you were talking about the translation. The replace function adds an additional space to replace new lines, which is unnecessary. Also, I guess adding the locale in the HTTP request wouldn't harm.
No, Shaarli only downloads the page/retrieves the title if it finds 200 OK HTTP code. I don't really see the point to set 403 Forbidden as a title.
Right, NoScript was blocking the rendering. I agree with @nodiscc though. Anyway, there is a bunch of things to fix, but that will be a great improvement. Thanks! |
Actually I found something else for fr.atlassian.com: they use EDIT: Another. Bad certificate can be ignored. |
It looks like your commit fixed most problems, as I tested the problematic links with that version.
Title appears, but is not the expected one:
host1.no has no 404 error anymore, title fetching works well here. I found some new links causing other issues:
|
So there are 2 fixable things here: unicode domains and supporting pre-HTML5 charset. For the links that work for me, it might be because I have a better connection than yours. Although a proper server should have a better connection than mine. |
They now work for me, after another try.
|
Oh right, good catch. Although:
|
Talking about anti-bot policies on some websites, how do you explain that they detect Shaarli as a bot given that it is using a desktop browser user agent? On another hand, most links flagged as having an anti-bot policy worked for me with |
To be honest, I've no idea. Playing the headers didn't change anything. I'm not willing to spend more time on this. If anyone sees what Shaarli could do in a better way, feel free to reopen this issue. |
…turn Fixes #531 - Title retrieving is failing with multiple use case
Hello again, I have made discoveries about the links that are still broken. First of all, I analysed (using Wireshark) the HTTP request emitted by Shaarli (on my local instance) when looking for the title and compared with
After digging trough the code, it appears that the title fetching procedure is called from In order to do a more extensive testing, I extracted the code responsible for calling $url = 'http://an-url-to-test.tld/';
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_HEADER, true);
curl_setopt($ch, CURLOPT_NOBODY, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_TIMEOUT, 30);
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (X11; Fedora; Linux x86_64; rv:45.0) Gecko/20100101 Firefox/45.0');
$r = curl_exec($ch);
return $r; Running that on problematic URLs gave me those results:
There are several things to say:
As a conclusion, I would say that |
Thanks for the thorough investigation. I'm against adding a php5-curl dependency to workaround a few edge cases, but this could benefit from more investigation. Note that |
Thanks for the investigation. I've noticed in another project that cURL might be more reliable that default While we shouldn't add a server requirement for that ( |
Thanks everyone for the feedback! @nodiscc I contacted the server operator and was replied that HEAD requests are blocked as a protection against crawlers. Talking about webpage fetching methods, I also considered using |
|
@julienCXX Since you already worked on this, feel free to submit a PR with cURL fetching and the current code as fallback, if you want to. |
I am thinking about it, but I have a few questions before working:
|
I think you should replace I'm not sure what's this TODO is for specifically, but there are todos application wide regarding errors, because we lack a proper error system. It'll be refactored eventually, don't worry about it. Also, it's PHP. |
There are interesting discussions on how to catch PHP warnings/errors/fatals with a custom handler:
I think the |
It’s almost there. The latest broken links seem to work, but I discovered other bugs:
|
That's the expected behaviour actually. We don't change user input (except for some parameters cleaning). |
OK, that’s not up to me. Testing again with the cURL-based method, title in http://android.izzysoft.de/ now appears in English. |
Looks like something
Neat. That will remain a mystery though.
Works fine here with your PR and the fallback method. |
This is not a mystery. It comes from the fact that cURL sends the Accept-Language header properly, whereas
Yes I know. But I wanted to inform you that the issue was probably not related with a bad configuration on the webmaster’s side, as you stated in your first comment. |
… use case" This reverts commit 112fb2c.
… use case" This reverts commit 112fb2c.
Hello, this still not fixed in 2023 ! Page title fetching still not work properly Any updates ? regards |
Hi,
Works properly on the demo instance https://demo.shaarli.org/admin/shaare?post=https%3A%2F%2Ftinast.fr%2F, and on my own instance. Make sure that:
If this still doesn't work, check your webserver/PHP logs for any possible errors (and provide them in a new issue), and/or enable |
There are, to my knowledge, only a few remaining cases where page title fetching is still broken:
|
After rechecking all links in all comments of this issue, I was only able to reproduce the problem (no title retrieved) with these URLs on Shaarli v0.12.2. Each one may have a different cause, so we may have to open specific issues or update the documentation:
Before reporting other cases, make sure your Shaarli installation is compliant with the requirements above (latest Shaarli release, PHP curl extension installed and enabled, outgoing HTTP requests not blocked by your hosting provider, destination webserver reachable form the server hosting Shaarli) |
Metadata retrieval does not work for YouTube videos, for example https://www.youtube.com/watch?v=e4TFD2PfVPw. The problem is that YouTube redirects (302) to https://www.youtube.com/supported_browsers?next_url=https%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3De4TFD2PfVPw. The reason for this redirection seems to be that Shaarli uses the user agent string from Firefox 45 which YouTube apparently deems to be too ancient. I can confirm that changing the version in the user agent string from 45.0 to something more recent like 121.0 (current release) or 115.0 (current extended support release) fixes the issue, at least for now. If you wish, I will submit a pull request to that effect. |
It seems you are correct $ curl --silent --user-agent 'Mozilla/5.0 (Windows NT 10.0; rv:109.0) Gecko/20100101 Firefox/115.0' https://www.youtube.com/watch?v=e4TFD2PfVPw | grep --only-matching '<title>.*</title>'
<title>Parcels - Live Vol. 1 (Complete Footage) - YouTube</title>
$ curl --silent --user-agent 'Mozilla/5.0 (Windows NT 10.0; rv:109.0) Gecko/20100101 Firefox/45.0' https://www.youtube.com/watch?v=e4TFD2PfVPw | grep --only-matching '<title>.*</title>'
# nothing
It would be nice, thank you. Youtube used to be one of the few sites that used javascript to update the HTML |
YouTube responds with a redirect if it thinks that the user agent is too old. This commit changes the user agent string to that of a current version of Firefox. See also shaarli#531 (comment).
Hi,
I recently installed Shaarli (v0.6.5) on a private server (running PHP 7.0.5), in order to save a large amount of links from opened browser tabs. As I imported links manually, I noticed that the link’s title did not always appear properly (nothing, only partially or breaking the UI). That happened quite often (about 90 broken titles on more than 520 links). As these issues seemed to have been fixed in #410 and #512, I tried to reiterate the process with a local instance (same PHP version), using the most recent development version at that time (11609d9). The result was a bit better (no UI breakage), but about 50 links are still broken, in different ways.
These are the following:
No title appears:
http://www.leboncoin.fr/annonces/offres/ile_de_france/occasions/?th=1&q=test404https://www.amazon.fr/gp/cart/view.html/ref=nav_cartbinary output, not HTMLhttp://ast2015.acroe-ica.org/website down/timeouthttps://static.societegenerale.fr/404http://antix.mepis.org/NXDOMAINhttp://www.quickvz.com/NXDOMAINhttp://www.kris3d.com/website down/timeouthttps://mro.name/foaf.rdf404A title appears, but is truncated/not the right one:
A title appears, but non-ASCII characters are replaced with a question mark (encoding issue?):
Furthermore, title fetching fails on links to PDF files.
The text was updated successfully, but these errors were encountered: