Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

In the built documentation, replace duplicate files by symlinks #25111

Closed
jhpalmieri opened this issue Apr 6, 2018 · 30 comments
Closed

In the built documentation, replace duplicate files by symlinks #25111

jhpalmieri opened this issue Apr 6, 2018 · 30 comments

Comments

@jhpalmieri
Copy link
Member

Until Sage 8.2, the _static directories in the generated HTML documentation of the reference manual were symlinks to a single master _static directory. Now all the files are copied, leading to a huge explosion in size of the built documentation (from 1.8GB in Sage 8.2 to about 20GB in Sage 8.3).

CC: @timokau

Component: documentation

Issue created by migration from https://trac.sagemath.org/ticket/25111

@jhpalmieri jhpalmieri added this to the sage-8.2 milestone Apr 6, 2018
@kiwifb
Copy link
Member

kiwifb commented Apr 6, 2018

comment:1

I actually do this in sage-on-gentoo, but I do it at the packaging level rather than the building level.

@jhpalmieri
Copy link
Member Author

comment:2

Here is some Python code which works for me. Is this the sort of thing you use?

from filecmp import dircmp
import os, shutil

def directories_equal(left, right, ignore=None):
    """
    True if and only if the directories ``left`` and ``right`` have
    the same contents, file by file. Ignore any files listed in
    ``ignore``.
    """
    dcmp = dircmp(left, right, ignore=ignore)
    return (not dcmp.left_only and not dcmp.right_only 
            and not dcmp.common_funny and not dcmp.funny_files
            and not dcmp.diff_files and 
            all(directories_equal(os.path.join(left, a), os.path.join(right, a), ignore=ignore) 
                for a in dcmp.common_dirs))


def replace_duplicates_with_symlinks(source, target):
    """
    INPUTS:

    - ``source``, ``target``: directories.

    If the two directories are identical, replace ``target`` with a
    symlink pointing to ``source``. Otherwise, for each file in
    ``target``, if a copy of it exists in ``source``, replace the copy
    in ``target`` with a symlink pointing to ``source``.  
    """
    if directories_equal(source, target, ignore=['pdf.png']):
        if not os.path.islink(target):
            shutil.rmtree(target)
            os.symlink(source, target)
    else:
        # compare file by file, doing the replacement
        dcmp = dircmp(source, target)
        for d in dcmp.common_dirs:
            replace_duplicates_with_symlinks(os.path.join(source, d),
                                             os.path.join(target, d))
        for f in dcmp.common_files:
            os.remove(os.path.join(target, f))
            os.symlink(os.path.join(source, f),
                       os.path.join(target, f))
    

def replace_with_master_directory(top_dir):
    """
    top_dir: top of html doc directory (so typically 
    top_dir = local/share/doc/sage/html)
    """
    master = os.path.join(top_dir, 'en', '_static')
    for lang in os.listdir(top_dir):
        for d in os.listdir(os.path.join(top_dir, lang)):
            target = os.path.join(top_dir, lang, d, '_static')
            if (os.path.isdir(target) 
                and not os.path.islink(target) 
                and not os.path.samefile(master, target)):
                replace_duplicates_with_symlinks(master, target)

@jhpalmieri
Copy link
Member Author

comment:3

This saves me almost 400 MB, by the way. ("This" = replace_with_master_directory(os.path.join(SAGE_LOCAL, 'share', 'doc', 'sage', 'html')).)

@kiwifb
Copy link
Member

kiwifb commented Apr 6, 2018

comment:4

No I don't use python code because I do it within the packaging script in bash

			# Prune _static folders
			cp -r build_doc/html/en/_static build_doc/html/ || die "failed to copy _static folder"
			for sdir in `find build_doc/html -name _static` ; do
				if [ $sdir != "build_doc/html/_static" ] ; then
					rm -rf $sdir || die "failed to remove $sdir"
					ln -rst ${sdir%_static} build_doc/html/_static
				fi
			done

because I have the mathjax fonts by default and they are copied in all _static directories, the saving is in GB.

The last touch is replacing most of the mathjax stuff by symlink in the master _static folder

			# Linking to local copy of mathjax folders rather than copying them
			local mathjax_folders="config extensions fonts jax localization unpacked"
			for sdir in ${mathjax_folders} ; do
				rm -rf build_doc/html/_static/${sdir} \
					|| die "failed to remove mathjax folder $sdir"
				ln -st build_doc/html/_static/ ../../../../mathjax/$sdir
			done

@slel
Copy link
Member

slel commented Apr 7, 2018

comment:5

See possibly related discussion at #25089.

@embray
Copy link
Contributor

embray commented Apr 9, 2018

comment:6

I'm confused by this ticket, because it already does that, per #25089...

@embray
Copy link
Contributor

embray commented Apr 9, 2018

comment:7

I see the difference--it does already do this within the en/reference docs, where each "reference" section is treated as a sub-document of the reference "master document", and in that case the _static directories get symlinked up to the master document. My assumption was that all of the Sage docs (including "reference") were in turn treated as sub-documents of a higher-level master document but apparently that's not the case.

IMO treating the entire tree of Sage docs as such a hierarchy with shared static resources would be the best approach.

@jhpalmieri
Copy link
Member Author

comment:8

Some parts of the documentation tree have slightly different _static directories, which is where the approach in comment 2 comes from: compare each _static directory to the top-level one, replacing files (and directories) with symlinks when possible.

@jhpalmieri
Copy link
Member Author

comment:9

Is it worth pursuing this? It could be part of the docbuild process, or it could be done only when you use make to build all of the Sage docs. I'm leaning toward the latter approach. In either case, all of the _static directories will be produced and then cleaned up later, so disk usage will increase during the build process before dropping at the end, although this happens throughout the build process. (I don't know how to deal with the symlinks on the fly. I also don't know if there is a way to tell Sphinx to look mainly in one place for shared static resources. Since documentation in different languages have different _static/translations.js files, we can't rely solely on a single _static folder.)

I also don't know what to do about Windows/cygwin and symbolic links.

@jhpalmieri
Copy link
Member Author

comment:10

Sphinx has a configuration option html_static_path which might do what we want. I'll look into it.

Edit: or maybe not: the documentation says that the files "are copied to the output’s _static directory after the theme’s static files". We don't want files copied, we want a single _static directory.

@embray
Copy link
Contributor

embray commented May 31, 2018

comment:11

Couldn't a different _static/translations.js be used per language? That is, somehow namespace that file by the language in the first place. That or at least have an alternate location for it. html_static_path can be a list.

@jhpalmieri
Copy link
Member Author

comment:12

I don't think that html_static_path will help: it provides a list from which the output _static directories are produced – it does not provide a list of directories to use instead of the output _static directories. In fact, we already set html_static_path in src/doc/common/conf.py.

@jdemeyer

This comment has been minimized.

@jdemeyer

This comment has been minimized.

@jdemeyer
Copy link

comment:14

Too bad that I missed this ticket earlier. This should have been an 8.3 blocker.

@embray
Copy link
Contributor

embray commented Aug 24, 2018

comment:15

As a workaround to #25089, for the Windows build I run a script that deletes all the duplicate _static directories and instead modifies links in the HTML to reference a single _static directory. My script just runs over the docs after they've been built, but it would probably be better to figure out how to do this directly in the Sphinx build.

@kiwifb
Copy link
Member

kiwifb commented Aug 24, 2018

comment:16

Replying to @embray:

As a workaround to #25089, for the Windows build I run a script that deletes all the duplicate _static directories and instead modifies links in the HTML to reference a single _static directory. My script just runs over the docs after they've been built, but it would probably be better to figure out how to do this directly in the Sphinx build.

I do exactly that in sage-on-gentoo as well. Would be great to know how to tell sphinx to create symlinks instead of copying.

@jdemeyer

This comment has been minimized.

@embray
Copy link
Contributor

embray commented Aug 24, 2018

comment:18

Replying to @jdemeyer:

Too bad that I missed this ticket earlier. This should have been an 8.3 blocker.

By the way, that 20GB figure is not really how much disk space is being taken up. There is a bug (I have no idea where this comes from) that puts a "mathjax" directory in each _static, which is a hard link to the mathjax sources (i.e. under $SAGE_LOCAL/share/mathjax). This inexplicably contains a symlink to itself as $SAGE_LOCAL/share/mathjax/mathjax. In the _static directories, however, this symlink is being dereferenced and converted to a hard link as well, so you end up with an infinite loop of hardlinks, which tools like du don't handle well when counting (the size it's reporting is probably just being limited by some max depth parameter).

If I delete all those nonsense mathjax hardlinks I then get:

$ du -sh local/share/doc/sage/html/
779M    local/share/doc/sage/html/

and

$ du -sh local/share/doc/sage/
2.0G    local/share/doc/sage/

so I think it's not all as bad as it seems.

@embray

This comment has been minimized.

@embray
Copy link
Contributor

embray commented Aug 24, 2018

comment:20

I see now your sage-devel post where you reported the same.

@jdemeyer
Copy link

comment:21

It seems to be more subtle: sometimes the symlinks are correctly generated and sometimes not.

@jhpalmieri
Copy link
Member Author

comment:22

Replying to @embray:

If I delete all those nonsense mathjax hardlinks I then get:

$ du -sh local/share/doc/sage/html/
779M    local/share/doc/sage/html/

By the way, after using the script in comment:2, I get

$ du -s -h local/share/doc/sage/html/
315M	local/share/doc/sage/html/

There are lots of symlinks, though.

@jdemeyer
Copy link

comment:23

I'm getting really confused here. Initially I thought that the problem was the _static directories in the reference manual were no longer symlinked, but that's not the problem.

The problem seems to be the few copies of _static for the various documents (one for each document). This was never a problem before, as long as _static remained small. But because of the mathjax issue, every _static directory contains a million copies of mathjax.

@jdemeyer
Copy link

comment:24

Replying to @embray:

There is a bug (I have no idea where this comes from) that puts a "mathjax" directory in each _static, which is a hard link to the mathjax sources (i.e. under $SAGE_LOCAL/share/mathjax).

I'm not sure why you think that it's a hard link. On my system (and probably most Unix-like systems), creating a hardlink to a directory is not even allowed. See https://askubuntu.com/questions/210741/why-are-hard-links-not-allowed-for-directories/525129

@jdemeyer
Copy link

comment:25

I created a new ticket #26152 specifically for the mathjax symlink issue.

@embray
Copy link
Contributor

embray commented Aug 28, 2018

comment:26

Replying to @jdemeyer:

Replying to @embray:

There is a bug (I have no idea where this comes from) that puts a "mathjax" directory in each _static, which is a hard link to the mathjax sources (i.e. under $SAGE_LOCAL/share/mathjax).

I'm not sure why you think that it's a hard link. On my system (and probably most Unix-like systems), creating a hardlink to a directory is not even allowed. See https://askubuntu.com/questions/210741/why-are-hard-links-not-allowed-for-directories/525129

To clarify: The directory is not a hard link but the files under it are, and there were deeply nested directories (I didn't confirm how deep) each containing what I presume were hard links to the files (since deleting them did not actually release much usage of my disk).

@jhpalmieri
Copy link
Member Author

comment:27

Replying to @jdemeyer:

I'm getting really confused here. Initially I thought that the problem was the _static directories in the reference manual were no longer symlinked, but that's not the problem.

The problem seems to be the few copies of _static for the various documents (one for each document). This was never a problem before, as long as _static remained small. But because of the mathjax issue, every _static directory contains a million copies of mathjax.

And as noted above, the _static directories for the different documents can actually differ, so we can't (I think) have a single one. But we can symlink the mathjax parts in each one. They take up the bulk of the disk space.

@jhpalmieri
Copy link
Member Author

comment:28

In a recently built copy of the Sage documentation, there seem to be 27 copies of MathJax installed in various _static directories, which translates into about 430 MB of disk space on my computer.

@kwankyu
Copy link
Collaborator

kwankyu commented Aug 24, 2023

This problem does not exist now. Roughly, the total size of all _static directories is 11 (number of languages) * 10 (number of documents) * 18M (size of _static directory for the reference manual in English), which is less than 2G.

Of 18M, mathjax takes 17M. Hence after #36098, the total size would be reduced to something much less than 100M.

@kwankyu kwankyu closed this as completed Aug 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants