Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Doc fixes and separate workflow for building docs via CI #3462

Merged
merged 8 commits into from
May 25, 2023

Conversation

pabs3
Copy link
Contributor

@pabs3 pabs3 commented Mar 18, 2023

This set of changes fixes the current docs build failures in GitHub workflows
and makes it easier for contributors to submit docs changes, by doing the docs
build for them and exporting the changes to the prebuilt docs to a patch,
which they can then incorporate into their pull request.

  • Use new smart_open compression parameter instead of ignore_ext when possible
  • Download the NLTK WordNet data before using the WordNet lemmatizer
  • Add missing documentation build dependency scikit-learn
  • Timeout GitHub workflow documentation build after two hours instead of 10 min
  • Upload the changes to the generated docs to GitHub artifacts

@pabs3 pabs3 force-pushed the fix-docs-build branch 11 times, most recently from 827e41b to 0c480a3 Compare March 19, 2023 05:54
@pabs3 pabs3 marked this pull request as ready for review March 19, 2023 06:01
@pabs3
Copy link
Contributor Author

pabs3 commented Mar 19, 2023

I believe this pull request is now ready for merging.

Without these commits I will not be able to complete #3456 (updating links).

@piskvorky piskvorky requested a review from mpenkov March 19, 2023 08:09
Copy link
Owner

@piskvorky piskvorky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you as always Paul!

@@ -12,7 +12,7 @@ jobs:

docs:
name: build documentation
timeout-minutes: 10
timeout-minutes: 120
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a significant increase. Why is this necessary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of the runs for the update-links branch took over an hour, so I thought I would bump up the limit a bit.

https://github.com/RaRe-Technologies/gensim/actions/runs/4454218032/jobs/7823263904

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see now that the build from today only took a minute, so I'll drop the timeout bump commit.

https://github.com/RaRe-Technologies/gensim/actions/runs/4459623150/jobs/7832189964

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the problem here is that the build can take hours in theory. Some of the documentation examples take a really long time to build: that's why we cache their build output. So if somebody updates those examples, a rebuild will require a long time, and the build documentation action will time out.

We're kind of stuck between a rock and a hard place here.

  1. We don't want to run these build steps for hours all the time (e.g. for every commit)
  2. We do want to make it easier for people to contribute to documentation, which occasionally requires a long time to run

I think the best way to compromise between the above would be to move this build-and-stash-artifacts step to a separate action that people would trigger explicitly via a comment. We could set a long timeout (e.g. 2 hours, longer if necessary). We could restrict the range of people who can issue that trigger to e.g. project maintainers (if possible).

What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds like a reasonable plan.

Do we know how long the build takes when starting from scratch
after you delete all of the cache from the git repository?

I don't know GitHub workflows well enough to implement the plan myself.

So perhaps you can merge the existing docs fixes commits,
then modify the workflows to implement the plan?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mpenkov mergeable here?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not quite. If we merge this, then it will change our existing build procedure, which isn't what we want.

I need to extract this contribution out into a separate github action (as we discussed above) before merging.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So perhaps you can merge the existing docs fixes commits,
then modify the workflows to implement the plan?

I'd rather modify the workflow first, then merge. That way, we won't break the existing workflow.

@pabs3

This comment was marked as duplicate.

pabs3 added 7 commits April 30, 2023 07:32
…ossible

Keep compatibility with old versions of smart_open just in case.

1.8.1 is required by the deps, 5.1.0 introduced the compression parameter,
6.0.0 dropped the ignore_ext parameter.
Fixes the docs build for the run_lda.py and run_ensemblelda.py tutorials.
The run_compare_lda.py howto and run_word2vec.py tutorial import it.

It was removed from docs_testenv when the scikit-learn wrapper was removed.

Fixes: commit a21d9cc
Add the ensemblelda and scm tutorials added in 2021.

Remove the summarization tutorial as it was removed in 2020.

Use the order from the existing prebuilt docs files.

Without a defined order they will be placed non-deterministically,
which means commits not changing docs will change prebuilt docs.

Fixes: commit 76579b3
Fixes: commit ddeeb12
Fixes: commit 2dcaaf8
Print the .md5 file when it is the stale file.
Print the source path for each stale file.
Print paths relative to the source tree.
Print only one stale file pair per line.
Building the docs often takes too long locally, so this allows
pull request submitters to build on GitHub, download the changes
and incorporate the changes to the generated docs in a commit,
then update their pull request with the generated docs commit.

Also check that the changes to the prebuilt docs are committed,
except for docs that change for every single rebuild.
@mpenkov mpenkov added this to the 4.3.2 milestone Apr 30, 2023
@mpenkov
Copy link
Collaborator

mpenkov commented May 21, 2023

@pabs3 Can you please enable commits from maintainers to this branch? I'd like to push some changes before merging.

@pabs3
Copy link
Contributor Author

pabs3 commented May 22, 2023

Enabled maintainer commits.

@mpenkov mpenkov changed the title Fix building the documentation Implement separate workflow for building docs via CI May 25, 2023
@mpenkov mpenkov added the documentation Current issue related to documentation label May 25, 2023
@mpenkov mpenkov changed the title Implement separate workflow for building docs via CI Doc fixes and separate workflow for building docs via CI May 25, 2023
@mpenkov mpenkov merged commit 3ae286e into piskvorky:develop May 25, 2023
@pabs3
Copy link
Contributor Author

pabs3 commented May 26, 2023 via email

@pabs3 pabs3 deleted the fix-docs-build branch May 26, 2023 02:06
@mpenkov
Copy link
Collaborator

mpenkov commented May 26, 2023

My personal preference is to squash and merge. There's a wide range of contributors, and everyone does things at their own granularity: some people make lots of small commits, others a few large ones. The squash brings everything to the same granularity: one PR, one commit. So, all PRs from the past several years have been squashed into a single commit before merging.

You make a good point about reviews and bisections, but we rarely do those after code has been merged in.

I know this preference is not universal (e.g. @piskvorky has mentioned he has a slight preference for not squashing) but there hasn't been an acute reason to change so far.

@piskvorky
Copy link
Owner

piskvorky commented May 26, 2023

Exactly right – I prefer full branch merging (with no-ff too). But in case of gensim, w/e, no practical difference in reality. Whatever makes @mpenkov 's life easier :)

@pabs3
Copy link
Contributor Author

pabs3 commented May 27, 2023 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Current issue related to documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants