Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow option to ignore redirects warnings #54

Closed
4 tasks done
dereknola opened this issue Oct 2, 2024 · 22 comments
Closed
4 tasks done

Allow option to ignore redirects warnings #54

dereknola opened this issue Oct 2, 2024 · 22 comments
Labels
💪 phase/solved Post is done

Comments

@dereknola
Copy link

Initial checklist

Problem

There are many redirects that are considered valid for our website https://github.com/k3s-io/docs.
With the bump to 2.0.0 we are now seeing warning such as:

30:3-30:80      warning Unexpected redirecting URL `https://etcd.io/docs/latest/op-guide/recovery/`, expected final URL `https://etcd.io/docs/v3.5/op-guide/recovery/` 

It is very common for /stable/ or /latest/ urls to auto redirect to the version in question.

See https://github.com/k3s-io/docs/actions/runs/11113288516 for all the specific errors we ran into suddenly.

Solution

Add an option to ignore/skip the check on whether a url has been redirected. I don't care that the URL redirects, I only care that the URL hits a 404 or does not exist.

Alternatives

Currently we have simply started running this plugin without --frail and will manually check whether or not some URLs are actually 404/dead.

@github-actions github-actions bot added 👋 phase/new Post is being triaged automatically 🤞 phase/open Post is being triaged manually and removed 👋 phase/new Post is being triaged automatically labels Oct 2, 2024
@wooorm
Copy link
Member

wooorm commented Oct 2, 2024

Hmm, I can see that some redirects are useful indeed. Though, I am sure some are to be avoided too? Could there be a regex pattern system for redirects?

@wooorm
Copy link
Member

wooorm commented Oct 3, 2024

Also, could different status codes be used to differentiate?

@jrfnl
Copy link

jrfnl commented Oct 7, 2024

This is really a problematic change as it basically prohibits canonical links which always redirect to the latest version of something.

@wooorm
Copy link
Member

wooorm commented Oct 7, 2024

Can you read the above comments and add something useful to the conversion... You likely don’t want some redirects. You likely want to allow some redirects. How could we make that distinction?

@jrfnl
Copy link

jrfnl commented Oct 7, 2024

Can you read the above comments and add something useful to the conversion...

I can and did. Can you read the README and CHANGELOG ?

Nowhere in the documentation of this project does it mention anything about forbidding redirection of URLs.

Now, I get that redirects should only be followed up to a point (2, maybe 3 redirects), but forbidding/failing on any redirect is an undocumented change in behaviour (not mentioned in the README or changelog) and should therefor be considered a bug.

It is also outside the scope of this plugin. People use this package for what it is says it does: check for dead URLs. URLs which redirect to a valid URL are not dead.

You likely don’t want some redirects. You likely want to allow some redirects. How could we make that distinction?

You can't, but you shouldn't have to. The only distinction you need to make if it a URL redirects to a 4## or 5## code or gets into a redirect loop. Anything else should not be considered problematic.

@wooorm
Copy link
Member

wooorm commented Oct 7, 2024

ok well it’s an intended major change to improve docs — here’s your money back; feel free to use other things.

In the previous comments I was discussing which how more useful features can be added

You can't

I think it might be possible! there’s differences between 301 and 302 and 303 and 307 and 308.

@jrfnl
Copy link

jrfnl commented Oct 7, 2024

here’s your money back; feel free to use other things.

Seriously ? Look I'm a maintainer too and I feel your pain, but these type of remarks are not helpful and only go to alienate people who do want to help (and understand how OS works).

I use this package in an action, often run on a weekly cronjob, to ensure that URLs in the README, CHANGELOG and other docs are not broken when end-users browse them.

I've had to make emergency fixes to a dozen or so projects over the last few days because of this undocumented change as otherwise contributors to my projects would be blocked by failing builds.

However, as canonical links are now "forbidden", this means these "fixed" links will soon go out of date and the docs of my projects will point to outdated information, which I then have to fix again.

Now, I realize that that is not your concern, but I just want to give you some perspective of the consequences of this bug.

In the previous comments I was discussing which how more useful features can be added

You can't

I think it might be possible! there’s differences between 301 and 302 and 303 and 307 and 308.

Well, that presumes sites actually send the correct HTTP code ;-) Then again, I've seen 404 pages being served with a 200 status, so I suppose that's something we shouldn't even consider.

I think limiting to the 301 and 308 statuses ("Moved Permanently"/"Permanent Redirect") may already be an improvement, but a 308 might still flag canonical links.

Having said that, I would still prefer an option to turn the warning on redirection off completely, like the OP suggested above.

Add an option to ignore/skip the check on whether a url has been redirected. I don't care that the URL redirects, I only care that the URL hits a 404 or does not exist.

@wooorm
Copy link
Member

wooorm commented Oct 7, 2024

Yes, seriously. You should have used versioning: this was in a major release. I understand that the Python ecosystem has different traditions. But you should know that about JS too. This is between an abandoned v1 up to a maintained rewritten v2 with breaking changes.

Users are good at raising problems. I don‘t think they are always good at coming up with solutions. I think I am in a better position, as a maintainer, for that.
To come up with a good solution, I would like to hear more about your situation.
I also believe in first investigating several possible solutions, and weighing trade offs. See also https://whatwg.org/faq#adding-new-features.

If you see OPs log, you will see that they indeed use several intentionally redirecting URLs. But there are also redirecting URLs that can be improved.

I would appreciate it if you can share what your inputs are. What the actual current output is. And what your expected output is. Perhaps you can even look at which status codes you are getting. Thanks

@jrfnl
Copy link

jrfnl commented Oct 7, 2024

Yes, seriously. You should have used versioning: this was in a major release.

That's a fair point, if it wasn't for predefined action runners and Docker containers including this package and me not having any control over those.

And even if I changed that setup to directly request the packages from within the action workflows (which I have done for some), Dependabot does not keep those up to date, which makes that a maintenance nightmare in the making.

I understand that the Python ecosystem has different traditions. But you should know that about JS too.

Interesting about the Python ecosystem. I have no idea about that at all. In the PHP world, using versioning is common and straight-forward.

The problem with versioning in the JS world - IMO - is that it appears to be a rule that a package can't be released without at least a 1000 dependencies, leading to regular dependency version conflicts, also known as "dependency hell".

Please note: this is not critism of this package, this is just a generic observation on JS/Node and dependencies.

This is between an abandoned v1 up to a maintained rewritten v2 with breaking changes.

Well, thank you for that. I had no idea the package was abandoned and can only compliment you on taking it on.

Regarding breaking changes: it would be great if the consequences of those were annotated in a human understandable way in the changelog/release.

Users are good at raising problems. I don‘t think they are always good at coming up with solutions. I think I am in a better position, as a maintainer, for that. To come up with a good solution, I would like to hear more about your situation. I also believe in first investigating several possible solutions, and weighing trade offs. See also https://whatwg.org/faq#adding-new-features.

If you see OPs log, you will see that they indeed use several intentionally redirecting URLs. But there are also redirecting URLs that can be improved.

I would appreciate it if you can share what your inputs are. What the actual current output is. And what your expected output is. Perhaps you can even look at which status codes you are getting. Thanks

Happy to. Would you like me to share some links to failing builds ? Or would you prefer examples ?

To start, here are some examples of what I would say are redirections which should be left alone:

Canonical Redirects to HTTP Status [*]
https://keepachangelog.com/ https://keepachangelog.com/en/1.1.0/ 200 (with <meta http-equiv=refresh...> in the HTML header causing the redirect)
https://coveralls.io/repos/github/PHPCSStandards/PHPCSExtra/badge.svg https://s3.amazonaws.com/assets.coveralls.io/badges/coveralls_100.svg 302
http://getcomposer.org/ https://getcomposer.org/ 302
https://github.com/Yoast/PHPUnit-Polyfills/issues/new/choose https://github.com/login?return_to=https%3A%2F%2Fgithub.com%2FYoast%2FPHPUnit-Polyfills%2Fissues%2Fnew%2Fchoose 302

Here are some which can be argued should be reported (and are):

Requested URL Redirects to HTTP Status [*]
https://eslint.org/docs/rules/no-lonely-if https://eslint.org/docs/latest/rules/no-lonely-if 301
http://semver.org/ https://semver.org/ 301
https://img.shields.io/packagist/php-v/phpcsstandards/phpcsextra.svg https://img.shields.io/packagist/dependency-v/phpcsstandards/phpcsextra/php.svg 301 (though to be honest, I'm not sure I want to care about the shields.io endpoint being changed when it still works via the redirect)

*: Status codes determined via a Curl request to the original URLs.

I'm also seeing anchors being reported as missing when the target website uses href="#anchor" type links with anchors defined in data- attributes (and the site handles those correctly).

Example:

9:1-9:177    warning Unexpected dead URL `https://packagist.org/packages/phpcsstandards/phpcsextra#dev-develop`, expected live URL                                                                                                                   no-dead-urls   remark-lint
  [cause]:
             error   Unexpected missing anchor element on `https://packagist.org/packages/phpcsstandards/phpcsextra` for fragment `dev-develop`, remove if unneeded or refer to an existing element       

The https://packagist.org/packages/phpcsstandards/phpcsextra#dev-develop URL works perfectly fine and is based on the following HTML code in the target page:

<li class="details-toggler version" data-version-id="dev-develop" data-load-more="/versions/3558491.json">
                <a href="#dev-develop" class="version-number">dev-develop                        / 1.x-dev
                    </a>

Other errors I'm seeing intermittently which may be related:

Any links to https://poser.pugx.org/ (which are working perfectly fine) are marked as dead with an error along the lines of:

5:2-5:85      warning Unexpected dead URL `https://poser.pugx.org/phpcsstandards/phpcsutils/v/stable`, expected live URL                                                                                                                                                                                 no-dead-urls   remark-lint
  [cause]:
              error   Unexpected error fetching `https://poser.pugx.org/phpcsstandards/phpcsutils/v/stable`                                                                                                                                                                                              fetch          dead-or-alive
  [cause]:
    AbortError: This operation was aborted
    at fetch (/home/runner/work/PHPCSUtils/PHPCSUtils/node_modules/undici/index.js:112:13)
    at runNextTicks (node:internal/process/task_queues:60:5)
    at listOnTimeout (node:internal/timers:545:9)
    at process.processTimers (node:internal/timers:519:7)
    at async deadOrAliveInternal (file:///home/runner/work/PHPCSUtils/PHPCSUtils/node_modules/dead-or-alive/lib/index.js:206:16)
    at async deadOrAlive (file:///home/runner/work/PHPCSUtils/PHPCSUtils/node_modules/dead-or-alive/lib/index.js:124:17)
    at async file:///home/runner/work/PHPCSUtils/PHPCSUtils/node_modules/remark-lint-no-dead-urls/lib/index.js:152:22
    at async Promise.all (index 0)
    at async rule (file:///home/runner/work/PHPCSUtils/PHPCSUtils/node_modules/remark-lint-no-dead-urls/lib/index.js:148:3)

Hope this helps.

@wooorm
Copy link
Member

wooorm commented Oct 8, 2024

I thought you were doing python but it’s php, ok!

I think you are wrong about JavaScript; take a look at the 100s of projects that are being maintained by us. Or see https://github.com/wooorm/npm-high-impact/blob/main/lib/top.js. They don’t have 1000s of dependencies. I also think that what you call “dependency hell” is the reason JavaScript is more popular than PHP/Ruby/etc. And, the main problems that the JS ecosystem has — and it has many — I ascribe to being so popular.

If docker makes it hard to version things: don’t use docker. You can npm install remark-cli@9 remark-lint-no-dead-urls@1.

I'm also seeing anchors being reported as missing when the target website uses

Pass checkAnchor: false. Or pass anchorAllowlist. Or get the website to support user-content- prefixes on ids instead of data attributes (see resolveClobberPrefix) https://github.com/wooorm/dead-or-alive#options

Other errors I'm seeing intermittently which may be related:

It may be. Or see the options https://github.com/wooorm/dead-or-alive#options. If the website is slow, perhaps tweak the timeout option. Or ignore it if they are flakey

@wooorm
Copy link
Member

wooorm commented Oct 8, 2024

*: Status codes determined via a Curl request to the original URLs.

Thanks for these! So this looks like we can differentiate between 301 and 302!
302 is acceptable to keep in docs. 301 should be changed.

As for your note on shield.io: I kinda get it. But every http request also takes time for your users. HTTP on 301:

The server is suggesting that a user agent with link-editing capability can permanently replace references to the target URI with one of the new references sent by the server. However, this suggestion is usually ignored unless the user agent is actively editing references (e.g., engaged in authoring content) […]

https://datatracker.ietf.org/doc/html/rfc9110#section-15.4.2

@wooorm
Copy link
Member

wooorm commented Oct 8, 2024

http -> https for getcomposer.org is the same as for semver.org—both a good idea I think—but the former uses 302 and the latter 301. Sad!

@wooorm
Copy link
Member

wooorm commented Oct 8, 2024

OK, I’ll publish a fix in a second. Where now this tool warns for each link in the following document:

# URLs

## Fine

[a](https://keepachangelog.com/)

[b](https://coveralls.io/repos/github/PHPCSStandards/PHPCSExtra/badge.svg)

[c](http://getcomposer.org/)

[e](https://github.com/Yoast/PHPUnit-Polyfills/issues/new/choose)

## Changable

[f](https://eslint.org/docs/rules/no-lonely-if)

[g](http://semver.org/)

[h](https://img.shields.io/packagist/php-v/phpcsstandards/phpcsextra.svg)

It will then warn for the last 3 URLs, as they are permanently redirected (solely 301 or 308s). This tool then proposes:

# URLs

## Fine

[a](https://keepachangelog.com/)

[b](https://coveralls.io/repos/github/PHPCSStandards/PHPCSExtra/badge.svg)

[c](http://getcomposer.org/)

[e](https://github.com/Yoast/PHPUnit-Polyfills/issues/new/choose)

## Changable

[f](https://eslint.org/docs/latest/rules/no-lonely-if)

[g](https://semver.org/)

[h](https://img.shields.io/packagist/dependency-v/phpcsstandards/phpcsextra/php.svg?)

Which then works without warnings

@jrfnl
Copy link

jrfnl commented Oct 8, 2024

@wooorm

You can `npm install remark-cli@9 remark-lint-no-dead-urls@1

Will start using that more diligently in the repos where I converted the workflow away from Docker already - in combination with watching releases of all those repos as I won't be able to rely on Dependabot. Thanks.

Pass checkAnchor: false. Or pass anchorAllowlist.

Thanks for that suggestion. I think I'll go with the anchorAllowlist for now. I'll have a play to see if I can figure out how to configure that. (sorry, yaml for js configs is often still a mystery to me)

Or get the website to support user-content- prefixes on ids instead of data attributes (see resolveClobberPrefix) https://github.com/wooorm/dead-or-alive#options

http -> https for getcomposer.org is the same as for semver.org—both a good idea I think—but the former uses 302 and the latter 301. Sad!

In both these cases, it's the same team behind them, so I'll open some issues for them and see if I can convince them to update both. Might take a while, but still worth a try ;-)

It may be. Or see the options https://github.com/wooorm/dead-or-alive#options. If the website is slow, perhaps tweak the timeout option. Or ignore it if they are flakey

Yes, currently ignoring. Retriggering the workflow normally fixes it. Was just weird as I can't remember seeing that one before 2.0.

OK, I’ll publish a fix in a second. Where now this tool warns for each link in the following document:
...
It will then warn for the last 3 URLs, as they are permanently redirected (solely 301 or 308s). This tool then proposes:
...
Which then works without warnings

That's awesome! Thank you so much for listening and addressing this!

wooorm added a commit to wooorm/dead-or-alive that referenced this issue Oct 8, 2024
wooorm added a commit that referenced this issue Oct 8, 2024
@wooorm
Copy link
Member

wooorm commented Oct 8, 2024

OK, can you both please check out https://github.com/remarkjs/remark-lint-no-dead-urls/releases/tag/2.0.1? Non-permanent redirects are now fine. In both your logs there’s a ton of things that I believe should be improved in the docs (or, sometimes, on the servers). Is this new behavior sufficient? Is something else needed? Is an option to allow permanent redirects too still needed (if so: please provide strong arguments). Thanks!

@jrfnl
Copy link

jrfnl commented Oct 8, 2024

@wooorm Currently on my way to a conference, so probably won't get to this until I'm back. I've made a note to revert the "fix" commits I made earlier after the 2.0 release and to re-evaluate the run results.
Will let you know how I get on.

@wooorm
Copy link
Member

wooorm commented Oct 10, 2024

OK I’ll close this as I think it‘s done. But open to further discussion!

@jrfnl
Copy link

jrfnl commented Oct 13, 2024

@wooorm Reporting back, looking good so far, but I didn't manage to get the anchorAllowlist working. Might be good to add some example code to the Dead-or-Alive README to show what the format should look like (for the non-json experts amongst us).

The below did not work for me:

    [
        "remark-lint-no-dead-urls",
        {
            "deadOrAliveOptions": {
                "anchorAllowlist": [
                    ["https://packagist.org/packages/phpcsstandards/phpcsextra", "dev-develop"]
                ],
                "maxRetries": 3
            }
        }
    ],

@wooorm
Copy link
Member

wooorm commented Oct 14, 2024

Right!

Regexes are expected. JSON and YAML don’t support regular expressions. In JS it would be something like:

anchorAllowlist: [
  [/^https:\/\/packagist\.org\/packages\/phpcsstandards\/phpcsextra\/$/i, /^dev-develop$/i]
]

…but the point of using regexes is that you can use expressions to match different things instead of single, literal, values

@dereknola
Copy link
Author

Thanks for fixing this so quickly. You are correct our website did have a bunch of links that actually needed to be fixed. Now on to figure out why www.mysql.com returns a 403 🤔

@ChristianMurphy
Copy link
Member

Curl might offer some pointers curl -v -L follows redirects and prints all the statuses along the way

for your example

curl -v -L www.mysql.com

I don't see a 403, but I do see an 301 MOVED PERMANENTLY upgrading from HTTP to HTTPS.

@wooorm
Copy link
Member

wooorm commented Oct 15, 2024

I get a 403 as well. Some garbage old HTML with:

            <td width="68%" valign="top">
            <div align="center">
            <p align="justify"><font face="Arial, Helvetica, sans-serif">This site http://www.mysql.com/ </font><font face="Arial, Helvetica, sans-serif"> is experiencing technical difficulty. We are aware of the issue and are working as quick as possible to correct the issue. <br />
            <br />
            We apologize for any        inconvenience this may have caused. <br />
            <br />
            To speak with an Oracle sales representative: 1.800.ORACLE1.<br />
            <br />
            To contact Oracle Corporate Headquarters from anywhere in the world: 1.650.506.7000.<br />
            <br />
            To get technical support in the United States: 1.800.633.0738. </font><br />
            &nbsp;</p>
            </div>

Well. that’s the web 🤷‍♂️ Maybe contact an Oracle rep 😉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
💪 phase/solved Post is done
Development

No branches or pull requests

4 participants