Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gh-93851: Add Tools/scripts/checkhtmllinks.py #93856

Closed
wants to merge 8 commits into from

Conversation

arhadthedev
Copy link
Member

@arhadthedev arhadthedev commented Jun 15, 2022

Broken links from a parent issue were found using this tool.

C:\Users\oleg\Documents\dev\notmine\cpython>python Tools/scripts/checkhtmllinks.py -h
usage: checkhtmllinks.py [-h] [-r] [-l LIMIT] path

Check if specified HTML files have dead or redirected links.

positional arguments:
  path                  a glob pattern of file paths to scan

options:
  -h, --help            show this help message and exit
  -r, --allow-redirects
                        do not report HTTP 3xx links as kind-of-broken
  -l LIMIT, --limit LIMIT
                        skip files that contain more links than specified

Call this script on HTML files of the rendered documentation. Eventhough the script is
multithreaded and findings for already processed pages are cached, a full run through the whole
rendered documentation takes about an hour.
Example output for python Tools/scripts/checkhtmllinks.py -l50 build/doc-html/using/*.html

Note: [1/19] link to ... is not a bug; it's a manifestation of <a href=''> in a top bar, like this:

3.10.5 Documentation » The Python Standard Library » Debugging and Profiling » Audit events table
                                                                               ^^^^^^^^^^^^^^^^^^
                                                                               the empty link
collecting filenames to check...

=======================================
[1/7] build\doc-html\using\cmdline.html
=======================================

skipped; 158 links is above the --limit threshold

=========================================
[2/7] build\doc-html\using\configure.html
=========================================

skipped; 171 links is above the --limit threshold

=======================================
[3/7] build\doc-html\using\editors.html
=======================================

[1/19] link to ...
[2/19] link to https://www.sphinx-doc.org/...
[3/19] link to https://www.python.org/...
[4/19] link to ../bugs.html...
[5/19] link to https://github.com/python/cpython/blob/main/Doc/using/editors.rst...
[6/19] link to ../index.html...
[7/19] link to https://www.python.org/psf/donations/...
[8/19] link to mac.html...
[9/19] link to /license.html...
[10/19] link to #editors-and-ides...
   skipped /license.html (absolute links are unsupported yet)
[11/19] link to ../genindex.html...
[13/19] link to ../copyright.html...
[12/19] link to /bugs.html...
   skipped /bugs.html (absolute links are unsupported yet)
[15/19] link to index.html...
[14/19] link to ../py-modindex.html...
[16/19] link to ../reference/index.html...
[17/19] link to https://peps.python.org/pep-0008/...
[18/19] link to https://wiki.python.org/moin/IntegratedDevelopmentEnvironments...
[19/19] link to https://wiki.python.org/moin/PythonEditors...
   redirected https://www.sphinx-doc.org/

=====================================
[4/7] build\doc-html\using\index.html
=====================================

skipped; 103 links is above the --limit threshold

===================================
[5/7] build\doc-html\using\mac.html
===================================

[1/47] link to ...
[2/47] link to https://www.sphinx-doc.org/...
[3/47] link to #running-scripts-with-a-gui...
[4/47] link to https://www.python.org/...
   redirected https://www.sphinx-doc.org/
[6/47] link to ../bugs.html...
[5/47] link to #other-resources...
[8/47] link to #how-to-run-a-python-script...
[7/47] link to http://aquamacs.org/...
[11/47] link to http://macvim-dev.github.io/macvim/...
[9/47] link to ../index.html...
[13/47] link to https://www.python.org/psf/donations/...
[14/47] link to http://www.hashcollision.org/hkn/python/idle_intro/index.html...
[15/47] link to https://github.com/python/cpython/blob/main/Doc/using/mac.rst...
[10/47] link to ../contents.html...
[16/47] link to /license.html...
[17/47] link to https://macromates.com/...
[12/47] link to ../library/tkinter.html#module-tkinter...
   skipped /license.html (absolute links are unsupported yet)
[18/47] link to mailto:bobsavage%40mac.com...
[19/47] link to #distributing-python-applications-on-the-mac...
[20/47] link to ../genindex.html...
[21/47] link to https://pypi.org/project/pyobjc/...
[22/47] link to https://www.wxpython.org...
[23/47] link to /bugs.html...
   redirected http://macvim-dev.github.io/macvim/
   skipped /bugs.html (absolute links are unsupported yet)
[24/47] link to https://pip.pypa.io/...
[25/47] link to #getting-and-installing-macpython...
[26/47] link to #using-python-on-a-mac...
[28/47] link to https://riverbankcomputing.com/software/pyqt/intro...
[27/47] link to ../copyright.html...
[29/47] link to #ide...
[30/47] link to ../py-modindex.html...
[31/47] link to https://www.python.org...
[32/47] link to #installing-additional-python-packages...
[33/47] link to https://pypi.org/project/py2app/...
[34/47] link to editors.html...
[35/47] link to index.html...
[36/47] link to http://www.barebones.com/products/bbedit/index.html...
[37/47] link to https://www.activestate.com...
[38/47] link to https://wiki.python.org/moin/MacPython...
   redirected https://pip.pypa.io/
[39/47] link to cmdline.html#envvar-PYTHONPATH...
[40/47] link to windows.html...
[41/47] link to #configuration...
[42/47] link to https://www.python.org/community/sigs/current/pythonmac-sig/...
[43/47] link to #the-ide...
[44/47] link to #gui-programming-on-the-mac...
[45/47] link to #mac-package-manager...
[46/47] link to https://www.tcl.tk...
[47/47] link to #...

====================================
[6/7] build\doc-html\using\unix.html
====================================

[1/36] link to ...
[2/36] link to #getting-and-installing-the-latest-version-of-python...
[3/36] link to https://www.sphinx-doc.org/...
[4/36] link to https://www.python.org/...
[5/36] link to ../bugs.html...
   redirected https://www.sphinx-doc.org/
[8/36] link to https://www.python.org/psf/donations/...
[6/36] link to ../index.html...
[7/36] link to ../contents.html...
[10/36] link to /license.html...
[9/36] link to #custom-openssl...
[11/36] link to https://devguide.python.org/setup/#getting-the-source-code...
[14/36] link to https://www.opencsw.org/...
[16/36] link to ../genindex.html...
[12/36] link to ../library/subprocess.html#module-subprocess...
[17/36] link to https://github.com/python/cpython/blob/main/Doc/using/unix.rst...
   skipped /license.html (absolute links are unsupported yet)
[15/36] link to https://www.debian.org/doc/manuals/maint-guide/first.en.html...
[18/36] link to /bugs.html...
[19/36] link to #using-python-on-unix-platforms...
[13/36] link to https://github.com/python/cpython/tree/main/README.rst...
   skipped /bugs.html (absolute links are unsupported yet)
[20/36] link to ../copyright.html...
[21/36] link to ../py-modindex.html...
[22/36] link to #on-freebsd-and-openbsd...
[23/36] link to #on-opensolaris...
[24/36] link to index.html...
[25/36] link to #building-python...
[26/36] link to https://www.python.org/downloads/source/...
[27/36] link to #miscellaneous...
[28/36] link to https://docs-old.fedoraproject.org/en-US/Fedora_Draft_Documentation/0.1/html/RPM_Guide/ch-creating-rpms.html...
[29/36] link to http://www.slackbook.org/html/package-management-making-packages.html...
   failed https://devguide.python.org/setup/#getting-the-source-code
[30/36] link to cmdline.html...
[31/36] link to configure.html...
[32/36] link to #on-linux...
[33/36] link to configure.html#configure-options...
[34/36] link to https://en.opensuse.org/Portal:Packaging...
   redirected https://github.com/python/cpython/tree/main/README.rst
[35/36] link to #...
[36/36] link to #python-related-paths-and-files...
   redirected https://docs-old.fedoraproject.org/en-US/Fedora_Draft_Documentation/0.1/html/RPM_Guide/ch-creating-rpms.html

=======================================
[7/7] build\doc-html\using\windows.html
=======================================

skipped; 117 links is above the --limit threshold

========================
Final report on problems
========================

build\doc-html\using\editors.html
  redirected https://www.sphinx-doc.org/ link; increased loading time
build\doc-html\using\mac.html
  redirected https://www.sphinx-doc.org/ link; increased loading time
  redirected http://macvim-dev.github.io/macvim/ link; increased loading time
  redirected https://pip.pypa.io/ link; increased loading time
build\doc-html\using\unix.html
  redirected https://www.sphinx-doc.org/ link; increased loading time
  broken https://devguide.python.org/setup/#getting-the-source-code link; check if #getting-the-source-code exists
  redirected https://github.com/python/cpython/tree/main/README.rst link; increased loading time
  redirected https://docs-old.fedoraproject.org/en-US/Fedora_Draft_Documentation/0.1/html/RPM_Guide/ch-creating-rpms.html link; increased loading time

@CAM-Gerlach
Copy link
Member

Could you explain how this is necessary when we can just use Sphinx's built-in -n for internal links (which only takes a few minutes) and its linkcheck builder for external links (which takes a couple tens of minutes)?

@arhadthedev
Copy link
Member Author

when we can just use Sphinx's built-in -n for internal links

Generated parts of HTML pages and external links also need to be checked. According to a size of gh-93853, we need either a tool like the one in this PR or a Sphinx extension.

@CAM-Gerlach
Copy link
Member

CAM-Gerlach commented Oct 12, 2022

I'm afraid I'm still confused. If a link is an external link in the source (even generated by standard or most third-party roles), it will be caught by linkcheck if broken or redirected. If a link is an internal link, -n will immediately catch it at build time, far faster than scraping every link. Aside from possible very narrow corner cases (which I have yet to see conclusively illustrated), I still don't understand what this bespoke manual script usefully does that a combination of -n and linkcheck doesn't, faster, more efficiently and without having to maintain a bespoke solution.

@AA-Turner is there something I'm missing here?

@arhadthedev
Copy link
Member Author

I'm afraid I'm still confused. If a link is an external link in the source (even generated by standard or most third-party roles), it will be caught by linkcheck if broken or redirected.

My apologises, I've totally missed this part of your first comment:

[...] and its linkcheck builder for external links (which takes a couple tens of minutes)?

@arhadthedev arhadthedev deleted the script-checkhtml branch October 12, 2022 18:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants