Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hosting: manual integrations via build contract #10127

Merged
merged 43 commits into from
Mar 20, 2023

Conversation

humitos
Copy link
Member

@humitos humitos commented Mar 8, 2023

Integration and design for the future of the Read the Docs features integrations.

Requires generating the readthedocs-client.js from (https://github.com/humitos/readthedocs-client) and copying to this repository and then collecting the statics:

cp ../readthedocs-client/dist/readthedocs-client.js readthedocs/core/static/core/js/readthedocs-client.js
inv docker.manage 'collectstatic --noinput'

Goals

  • Integrate all the Read the Docs readers' features (e.g. flyout, doc-diff, warning banners, etc) with any doctool.
  • Remove the requirement to install extra Python packages to integrate with us (e.g. readthedocs-sphinx-ext, readthedocs-sphinx-search)
  • Generate the exact same output when running the doctool locally than on Read the Docs (except for conditionals using READTHEDOCS environment variables)
  • Reduce complexity from the current hosting integrations (some happen inside the builder, others inside the Sphinx extension, others in the Sphinx theme, and others in the frontend)
  • Move all the logic for these integrations to the frontend
  • Use modern JavaScript patterns to write all these integrations (see https://github.com/readthedocs/meta/discussions/114)

Workflow

  1. User triggers a build with any doctool (common process, build.jobs or custom commands via build.commands) 1
  2. (Optional) The doctool creates a $READTHEDOCS_OUTPUT/html/readthedocs-build.yaml file at build time with a pretty known structure (the "build contract" that we need to define, but we have an idea of what fields are required already) 2
  3. Before finishing the build process, the builder reads readthedocs-build.yaml and stores its data into the Version object. This will be used to communicate data from the builder to the frontend.
  4. El Proxito injects, at serving time 3, the Read the Docs javascript client (readthedocs-client.js)
  5. The readthedocs-client.js file loads all the extra required JS and CSS files dynamically.
  6. These javascript files use an API call (/_/readthedocs-config.json) to get all the config required to perform the hosting integrations. The data generated at build time via readthedocs-build.yaml is also returned by this endpoint. Note that some of this "hosting integrations" are already implemented in our Sphinx extensions and they have to be re-written into Javascript. I migrated some basic one for now and put them into the client repository as an example.
  7. These integrations are: version warning, pull request warning, search, doc-diff, hoverxref, ethicalads, search/traffic analytics, etc

Integration points from a user's perspective

Nothing.

Optionally, the user could configure the build process to generate the readthedocs-build.yaml at build time with a specific structure (to be defined) to communicate some extra data that's only known at build time.

v2 and future projections

  • Make each hosting feature to be enabled/disabled at build time
  • Make each hosting feature to be enabled/disabled by the reader
  • Allow override the HTML templates injected (e.g. those used to generate the warning banners)
  • Allow override the logic used to inject those HTML blobs (e.g. use a different function to decide if the warning banner has to be injected or not --useful when not using semver)
  • Validate readthedocs-build.yaml and enforce a specific structure

Related

External related issues

Footnotes

  1. Read the Docs pass data to the doctool via environment variables. Exactly as we are doing now. However, we may want to add some extra variables (see Build: expose VCS-related environment variables #9423)

  2. This YAML file is the way for the doctool to pass back data to Read the Docs and configure its integrations

  3. It's done at NGINX for now, but it can be feature flagged and done at CloudFlare

@humitos humitos self-assigned this Mar 8, 2023
@ericholscher
Copy link
Member

This is a really interesting example. I like how it basically just duct tapes the pieces together, and doesn't require any user interaction. There's a bunch of design discussions downstream of this, but just having an example of how this might work is great. It's a lot less hacky if we ship integrations.js to the users and themes, and they optionally integrate it.

Let's chat more about this tomorrow, since I don't know what level of feedback is particularly useful at this stage, but I think we can get pretty close to something we could share with users as a demo at least.

humitos added 8 commits March 9, 2023 11:40
Instead of prepending all the commands with the `PATH=` variable, we pass the
environment variable directly to the Docker container.

This allow us to run multi-line shell script without failing with weird syntax
errors. Besides, the implementation is cleaner since all the environment
variables are passed to the commands in the same way.

I added some _default paths_ that I found by checking the local Docker
container. I'm also passing the users' path, depending if we are working locally
as root or in production.

This is not 100% complete and there may be some other issues that I'm not seeing
yet, but I think it's a first step to behave in a way our users are expecting.

Closes #10103
Locally it tries to reverted back 🤷
…s/readthedocs.org into humitos/hosting-integrations
X-RTD-Hosting-Integration: true/false

This can be used from CloudFlare to decide whether or not inject a `<script>`
into the resulting HTML or not.
@humitos
Copy link
Member Author

humitos commented Mar 10, 2023

It's a lot less hacky if we ship integrations.js to the users and themes, and they optionally integrate it.

We touched on this in our call already, but I want to make my position clear about this here. IMO, we have to force the injection of our own JavaScript file. Otherwise, many projects won't include our JavaScript and most of the good features we offer won't be present in their documentation. Even people wanting to integrate with us, will have a bigger barrier to do it. In particular, commercial customers. Besides, on the community side, most projects won't have ads --which is not a minor.

I think if we don't inject the JavaScript automatically, we won't differentiate ourselves from GitHub Pages or any other plain-text hosting, I'd say.

@humitos
Copy link
Member Author

humitos commented Mar 10, 2023

I updated the description of this PR with a lot more context and a more polished idea of how all these pieces fit together.

@humitos humitos changed the base branch from main to humitos/build-cmd-environment March 10, 2023 22:03
@agjohnson
Copy link
Contributor

agjohnson commented Mar 10, 2023

Overall, this implementation looks great, especially considering the boundaries on goals. I think it's helpful to get some experimentation here on the backend implementation, and have some direction at a high level on how we'll inject JS.

All of my questions, and all the places I want to put a bit more thought on patterns and API design are in your v2 list. The goals seem great, and for that, I'd 100% skip the V2 parts of the JS implementation -- HTML templates, API overrides, etc. This is where I'll have the most technical input, there are features from Webpack we'll use here, and extra libraries we'll rely on.


It's still fuzzy to me why a readthedocs-build.yaml file is required for this. I'm not against this design, it's just not clear to me. I probably haven't moved much here since #9088

Do you have a better idea now of what configuration options we need to be in an contract file?

The features list in the contract file stand out still -- and I'm just talking in context of a contract file, not feature enabling/disabling API design or patterns for the next iteration of this work. A contract file may help with reproducible builds somewhat, but there's also very real cases where configuring features at the request level will be much better UX:

  • How would an author disable search/docdiff/etc across multiple versions? They update an option in the admin UI and rebuild all of their versions?
  • If this is controlled in a version controlled file, how does an author disable search/docdiff/etc on tagged versions? They have to alter the tags and rebuild all tag versions?

I see some benefit to controlling features at the request level and configuring them in the admin dashboard. We'd use the same data structure, but included in the documentation via a https://docs.example.com/_/features.json type file that is dynamically generated.

IMO, we have to force the injection of our own JavaScript file

I'd agree, and I think we can very easily keep the footprint of this file very minimal as well.

@humitos
Copy link
Member Author

humitos commented Mar 11, 2023

@agjohnson

It's still fuzzy to me why a readthedocs-build.yaml file is required for this. I'm not against this design, it's just not clear to me. I probably haven't moved much here since #9088
Do you have a better idea now of what configuration options we need to be in an contract file?

This space is reserved for all the configuration that only the doctool knows at build time (or that needs to be generated dynamically). I understand why you are not finding it required and that's probably because all the data that I put as example there is not mandatory.

We need more feedback from other doctools (and users!) here to find out if there are things we strictly require from them. I don't have that information yet and we should keep trying out other doctools as I was doing previously and keeping this in mind.

For now, I refactored a little the readthedocs-build.yaml proposed to have only two main keys:

  1. telemetry: contains all the data the doctool can share with us for telemetry purposes 1
  2. features: communicate us what are the features the author wants to integrate with us 2

I see some benefit to controlling features at the request level and configuring them in the admin dashboard. We'd use the same data structure, but included in the documentation via a https://docs.example.com/_/features.json type file that is dynamically generated.

@ericholscher and I briefly touched on this idea on Thursday. I thought it could be a good idea to have the ability to decide things dynamically at request time. However, I didn't come up with good arguments while talking about this. Also, Eric mentioned it was simpler to just serve a .json file directly from S3, which was a good argument and we just moved on 😄 .

I think the questions you are rising here are important and may make us to use an endpoint as you are suggesting. If that's the case tho, all the feature key to decide the integrations don't make sense to be in the YAML file anymore and they would become db fields (at Project or Version level). That would be another long conversation to have that I'm not sure if it makes a lot of sense at this immediate time. I'm happy to have it, tho, but I don't want to detour too much from the other bigger concepts and be able to start moving forward making the first step of the plan.


That said, I understand that you agree with 80% of this proposal --without considering the specific JavaScript technical aspects. That missing 20% would be:

  • reduce complexity by skipping the readthedocs-build.yaml for now until we find it's strictly required
  • make the data coming dynamically from an endpoint under /_/ at docs domain instead of just a file in S3
  • decide how we will inject the JS (at build time on each HTML, NGINX, CloudFlare, etc)

Footnotes

  1. I know you said "we'd have to consider if we want community provided telemetry even" in https://github.com/readthedocs/readthedocs.org/issues/9088#issuecomment-1331246443. We can skip it for now if we want to, but probably worth to start having that conversation at least.

  2. you said this could be moved directly into the .readthedocs.yaml; which I see it totally doable

Copy link
Member

@ericholscher ericholscher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a really simple, beautiful implementation. I think we can trim down the amount in this PR already, and get something shipped.

dockerfiles/nginx/proxito.conf.template Show resolved Hide resolved
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to ship this with the app, or version it directly on S3, like we do with the ad client? I think keeping it deployable outside of the application seems right to me.

https://github.com/readthedocs/ethical-ad-client/tags

https://media.ethicalads.io/media/client/v1.4.0/ethicalads.min.js

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't thought too much about this and I don't have experience deploying the script with a different process than the normal one. I'm not sure about the pros/cons here.

I put this file here because we need it for development as well. We could just put this file in the local MinIO S3, tho, as well.

Note this file is generated by running npm run build from another repository.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to also think about what version are we going to serve by default. The latest one? Would people be able to pin to a particular version? Are we going to support multiple versions at the same time? How do we deploy new features to those that are pinned to an older version? Do we care? Too many questions 🤷🏼

Copy link
Member

@ericholscher ericholscher Mar 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we always deploy the latest, but if the goal is for other people to integrate the library, then it can be versioned based on their needs. The versioning is primarily valuable for strict SRI protection, like PyPI does, to validate the hash of the library hasn't changed for security reasons.

I think we should definitely deploy the client outside of our application. We don't need to decide on a proper deployment pattern yet though, but I think we should keep it out of the application from the start.

For now, we can just manually upload it to S3, and use that everywhere?

readthedocs/doc_builder/director.py Show resolved Hide resolved
Comment on lines +657 to +660
# Copy the YAML data into `Version.build_data`.
# It will be saved when the API is hit.
# This data will be used by the `/_/readthedocs-config.json` API endpoint.
self.data.version.build_data = data
Copy link
Member

@ericholscher ericholscher Mar 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to actually parse the data into memory, or just store the file contents directly as a string? I think we just want to store a string to start? I'd like to avoid as much YAML parsing as possible...

I also wonder if we should make this file JSON, instead of YAML? If the goal is for it to be aligned with the JSON data returned via the API, I think that makes more sense. But if it's closer to our .readthedocs.yaml config, then YAML makes sense 🤔

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to actually parse the data into memory, or just store the file contents directly as a string? I think we just want to store a string to start? I'd like to avoid as much YAML parsing as possible...

I think we will need to parse it so we can validate it at some point, anyways.

This also allows us to use a JSON field in the database that we can query in the future, looking for answers.

I also wonder if we should make this file JSON, instead of YAML? If the goal is for it to be aligned with the JSON data returned via the API, I think that makes more sense. But if it's closer to our .readthedocs.yaml config, then YAML makes sense thinking

I decide to use YAML here on purpose. It's a lot easier to write than JSON, a lot less nit picking (e.g. requires no trailing comma in the last element of a list), supports comments, works better with more data types, and others. The structure is going to be just a dictionary, YAML is going to be just the representation/serialization of it

In particular, being able to put comments in .readthedocs.yaml is important for ourselves and our users as well. That was one of the reasons why I picked YAML for this file as well. Otherwise, you end up with things like this in JSON:

"comment": (
"THIS RESPONSE IS IN ALPHA FOR TEST PURPOSES ONLY"
" AND IT'S GOING TO CHANGE COMPLETELY -- DO NOT USE IT!"
),

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea.. let's use YAML for anything human-writable, and JSON for machine-writable 👍

readthedocs/proxito/middleware.py Outdated Show resolved Hide resolved
Comment on lines 21 to 24
# TODO: why the UnresolvedURL object is not injected in the `request` by the middleware.
# Is is fine to calculate it here?
unresolved_url = unresolver.unresolve_url(request.headers.get("Referer"))
version = unresolved_url.version
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stsewd seems like one for you.

@humitos I see we're using the referer here. Is this because we don't want to pass in the domain explicitly as a GET arg? I think we likely want to actually use the request.host here, instead of a header that the browser is setting.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@humitos I see we're using the referer here. Is this because we don't want to pass in the domain explicitly as a GET arg? I think we likely want to actually use the request.host here, instead of a header that the browser is setting.

Not strong preference. We can use something different here. I used the Referer because it was what I had at hand and easy to get.

A url= GET attribute probably seems fine as well, since it's what we are using in other API endpoints (analytics and embed api, for example). Referer also makes testing harder since "copying and pasting a URL doesn't work" --I've ended up hardcoding the URL in the code because of this 😅

I think we should combine both, url= GET attribute with a validation with request.host on that URL.

# TODO: define how it will be the exact JSON object returned here
# NOTE: we could use the APIv3 serializers for some of these objects if we want to keep consistency.
# However, those may require some extra db calls that we probably want to avoid.
data = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like we're re-creating the footer API, but with JSON instead of HTML. That seems reasonable, but I do think we'll want to think more about what this API is returning.

It seems like we actually want the client to be able to tell us exactly what data points they want? Something similar to what we're doing with apiv3 with the expand and fields parameters? That way we give the caller control over how much data they want...

That might be an over-optimization at this stage, but I think we want to design this with expandability and performance in mind.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do think we'll want to think more about what this API is returning.

Yes. This is one part of the design decisions I mentioned we have to sit down, talk together, think about and decide what's the best structure.

It seems like we actually want the client to be able to tell us exactly what data points they want? Something similar to what we're doing with apiv3 with the expand and fields parameters? That way we give the caller control over how much data they want...

This is a good point to start considering and thinking about, yeah. However, I don't think we will require that complexity because:

  • the API endpoint will be cached
  • most of the values are static
  • there are a few db queries only (Project and Project.versions(active=True))
  • we could decide whether or not return some specific data based on the feature being enabled. If non_latest_version_warning is disabled, we don't calculate/query nor return the list of active versions this feature consumes in the client.

In any case, I think we will arrive at the best answer by exploration, experimentation and brainstorming. Also, once we deploy this, we will have some data in NR to know how much time this endpoint takes on a real production db.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, I don't think the performance of this approach is inherently bad, but more that we might want to make this look more like an API for additional UX options in the future. No user will expect passing GET args to a .json file. However, if the file doesn't grow too large, always returning all of it too bad. But I do expect we'll feel constrained by the design in the future, as we want to make this endpoint smarter, so starting with an API instead of a JSON file seems like the proper design to me.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear, I'm fine shipping this as a test for now.

But I do think over time if this is the "one endpoint to get all the data you need in the page", there will be demands on this functionality that are resource intensive, vary by user, etc. That is where we'll start to get value from the API approach.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point about the .json name in the URL endpoint. We should definitely change that as first step.

I will think a little more about the expand argument. I don't have a strong opinion on that yet 😅

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we get to discussing the API structure in the next phase, this would be a good conversation. This API could help us avoid APIv3 complexity.

Also on the difference between this and our API, would it make sense to use DRF APIv3 resources directly in the response here, or make this an APIv3 endpoint? This way weren't not making a 4th API version of sorts. The response for versions could be identical to the APIv3 response for versions as well.

We'll come back to the data security question around private projects/versions pretty quick here, but we need to already for this API endpoint and private projects.

Base automatically changed from humitos/build-cmd-environment to main March 16, 2023 11:44
Copy link
Member

@ericholscher ericholscher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm good to ship this initial experimentation next week so we can play with it in production.

Alternatively, we could just use defaults in the readthedocs-client, and not require a server response for initial testing? We could also just return some JSON data via a CF Worker to make it faster to iterate with this, instead of requiring a full Django deploy.

readthedocs/doc_builder/director.py Show resolved Hide resolved
readthedocs/proxito/views/hosting.py Outdated Show resolved Hide resolved
@humitos
Copy link
Member Author

humitos commented Mar 17, 2023

Alternatively, we could just use defaults in the readthedocs-client, and not require a server response for initial testing?

I'd prefer if we can get this from the server itself. This will also help me to understand how much time this endpoint takes to return via NR and also confirm that everything is on it's place.

We could also just return some JSON data via a CF Worker to make it faster to iterate with this, instead of requiring a full Django deploy.

This is a good option in case we need it, yeah. I'm not sure how many times we will be changing this response. It's definitely gonna change, and big, probably. But maybe it changes once, it's done. The main changes are gonna happen in the js client, without requiring too much changes in the API response, I'd guess.

In any case, we can keep doing all the test of these stuffs locally while developing. We can also deploy more often if required 😉

@ericholscher
Copy link
Member

In any case, we can keep doing all the test of these stuffs locally while developing. We can also deploy more often if required 😉

But then I can't play with it :) I guess there's always ngrok, or I just get it running locally :)

@humitos
Copy link
Member Author

humitos commented Mar 17, 2023

He he. Note that you can add/update data from the build process itself, by creating a YAML file 😉. That will give us some extra control to perform changes in the API response without deploying 😏

@humitos
Copy link
Member Author

humitos commented Mar 17, 2023

So, we can change the js client by just uploading a new js file to S3.

We can change the API response by generating a YAML file in the build process

@humitos
Copy link
Member Author

humitos commented Mar 20, 2023

Latest updates:

  • rename the endpoint to remove .json from it
  • fix the serializer to save Version.build_data
  • remove readthedocs-client.js from this repository
  • use development URL for js client to work locally

I'm going to merge this PR once all tests pass.

@humitos humitos enabled auto-merge (squash) March 20, 2023 10:41
@humitos
Copy link
Member Author

humitos commented Mar 20, 2023

I don't understand why I'm receiving eslint issues now 🤷🏼 . I'm merging this PR for now because those errors look pretty unrelated to my work here. We can solve the issue in another PR.

@humitos humitos disabled auto-merge March 20, 2023 13:04
@humitos humitos merged commit ed732c2 into main Mar 20, 2023
@humitos humitos deleted the humitos/hosting-integrations branch March 20, 2023 13:04
Copy link
Member

@stsewd stsewd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is one call to open that we should replace with safe_open before deploying this change

return

try:
with open(yaml_path, "r") as f:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should use readthedocs.core.utils.filesystem.safe_open here

Comment on lines +29 to +30
# TODO: why the UnresolvedURL object is not injected in the `request` by the middleware.
# Is is fine to calculate it here?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only the unresolved domain is injected for all proxito views, the unresolved URL is needed only when serving docs, so you need to call it manually only when you need it. And you are parsing the URL from a query param, so you need to call it manually anyway.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

4 participants