Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DOC: Preview rendered documentation of PRs #50832

Closed
datapythonista opened this issue Jan 18, 2023 · 15 comments · Fixed by #57112
Closed

DOC: Preview rendered documentation of PRs #50832

datapythonista opened this issue Jan 18, 2023 · 15 comments · Fixed by #57112
Assignees
Labels
CI Continuous Integration Docs

Comments

@datapythonista
Copy link
Member

The idea is that when someone opens a PR 12345, after the CI finishes building the docs, we have a url https://pandas.pydata.org/docs/dev/pr/12345/ (we can probably think of a better url) to preview how are the docs with the pr changes.

We discussed this long ago, and couldn't find an easy way to implement it with the CI. The main challenge is that the CI of a PR doesn't have access to secrets, so we can't publish to a server in the same way we do for commits to main. But we've got options, like to add a webhook to the CI, and implement a small service in a server, that when it receives a request from a webhook, it finds the artifact for that job, downloads and decompresses it to a directory, which is served as a static http server to the desired url.

We will need a job to clean up docs after X days, since each copy of our web/docs is around 350M, but that should be easy.

In the future we can even try to detect where the changes happened, and add a comment to the PR with the links to the pages of interest. Not sure how difficult would that be for docstrings, but for regular rst files (e.g. the user guide) it should be quite easy.

Please let me know if there are ideas, or any objection. Otherwise I'll give this a try in few days.

@datapythonista datapythonista added Docs CI Continuous Integration labels Jan 18, 2023
@MarcoGorelli
Copy link
Member

no objections, this would be quite convenient!

@lithomas1
Copy link
Member

+1.

One thing I'm worried about though is PR docs showing up in Google search results, so we'll probably need to add in noindex or smth to prevent that.

@lithomas1
Copy link
Member

We discussed this long ago, and couldn't find an easy way to implement it with the CI. The main challenge is that the CI of a PR doesn't have access to secrets, so we can't publish to a server in the same way we do for commits to main. But we've got options, like to add a webhook to the CI, and implement a small service in a server, that when it receives a request from a webhook, it finds the artifact for that job, downloads and decompresses it to a directory, which is served as a static http server to the desired url.

We will need a job to clean up docs after X days, since each copy of our web/docs is around 350M, but that should be easy.

One way around this is to trigger the bot on a comment. GHA run these from main IIRC.

@datapythonista
Copy link
Member Author

One way around this is to trigger the bot on a comment. GHA run these from main IIRC.

That's a good point, thanks @lithomas1. I guess it can be done, but it's not immediately clear to me how.

One option is to build the docs when someone adds a comment /preview or something to a PR. Then I think you're right and we can probably use the secret to do an rsync of the web/docs to the preview server, as we do in production for commits to main.

What it's also not ideal, but maybe makes more sense is to use the already existing artifact when someone comments /perview, and simply uncompress it to the preview server (which should take seconds, instead of the one hours to run the docs). The problem with this, is that to find the artifact, I need the run id of the workflow that generated it I think. And I don't see how exactly to convert a PR number to the run id of the docs build of the last execution. I guess it can be done, but doesn't seem to be trivial.

I guess it makes sense to first discuss what are the options for the UX of this system. And depending on whether people prefer to autogenerate all them, do it on demand immediately once the CI is green, or do it on demand with a one hour delay, then we can probably focus the efforts for that solution. What do you think?

@jbrockmendel
Copy link
Member

ive seen people use the "/preview" comment. have people been happy with how that works?

@datapythonista
Copy link
Member Author

ive seen people use the "/preview" comment. have people been happy with how that works?

I've seen people using it, so I assume it's being useful at least to some (it is to me, but I implemented it, so very biased opinion). Do you find it useful @jbrockmendel, any feedback from your side? We're a bit constrained by what GitHub let us do, and the time it takes to complete the docs build, but happy to try to incorporate any feedback.

@jbrockmendel
Copy link
Member

I haven't used it myself

@mroeschke
Copy link
Member

I have been happy with /preview, I think that could sufficiently close this issue

@mroeschke mroeschke added the Closing Candidate May be closeable, needs more eyeballs label May 12, 2023
@datapythonista datapythonista removed the Closing Candidate May be closeable, needs more eyeballs label Oct 8, 2023
@datapythonista
Copy link
Member Author

I'll leave this issue open a bit longer, I'll be working on few more things:

  • Implement the approach described here by Thomas, which is much better than the one we've got now of publishing the preview for all PRs
  • Implement proper authentication / making it a GitHub app, as it's currently using my personal token expiring every 90 days
  • Make the whole thing a proper application, not the PoC it is now (it's running on a screen session, not much error control, the code is not published, as it contains my token in the previous point...)

@rhshadrach
Copy link
Member

it's currently using my personal token expiring every 90 days

Guessing this might be why it's currently broken? We haven't published a preview of a PR since the end of October.

@datapythonista
Copy link
Member Author

I just checked, and doesn't seem to be the case this time. Seems like GitHub allowed me to create a token with longer expiration last time, it'll expire on January 2024.

From the server logs it seems like there is a problem with the HTTP headers. I don't know where this is coming from yet. I don't think we changed anything on the docs preview site, so I guess what may have changed is either the request we do from GitHub actions (I don't think that changed), or maybe we did an upgrade of the docs server and the nginx we use to redirect the traffic changed, and there is something different on how the headers are sent (I remember we had to set up something in nginx for the headers when we first set up this service, so I guess it's related).

I'll have a look and try to identify and fix the problem, sorry for the inconvenience, I'll keep this issue updated with any finding.

@datapythonista
Copy link
Member Author

Seems like the headers problem are unrelated (probably requests made by bots). The problem seems to be that the cleaner process is not working properly, and the storage became full. I'm still on it, but I think the previews should be working again shortly. Note that the previews are still generated during the CI docs job, not when the /preview comment is added, so for old PRs it'll be needed to rerun that job.

I still have pending improving the preview service and make it production ready, besides implementing the approach that Thomas suggested which is better than the current implementation. I hope I have time in January to get it done.

@rhshadrach
Copy link
Member

Thanks @datapythonista - I was planning to use the /preview for work on #56511, but I've since realized that it's better (faster, less server traffic) for me to host my own docs for that PR on a local server. So it's not blocking anything on my end.

@datapythonista
Copy link
Member Author

It is fixed now, the previews are working again.

@datapythonista datapythonista self-assigned this Jan 22, 2024
@datapythonista
Copy link
Member Author

I'm finishing a new implementation of the preview server based on the approach described by @lithomas1 in #54589 (comment), and implemented in a more robust way.

The main change is that the preview is generated when the command /preview is called, not at the end of the workflow. I may release the new version tomorrow (some work will still be pending, but nothing that should impact you directly). In case you detect any problem, please let me know.

JFYI, the new service is project agnostic. If it's useful to other projects, it could be easily used without changes (other than enabling the repo in the server, as I don't think it's a good idea to just leave it open).

CC: @pandas-dev/pandas-core

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CI Continuous Integration Docs
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants