-
-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support custom robots.txt #5086
Conversation
readthedocs/core/views/serve.py
Outdated
default_robots_fullpath = os.path.join(settings.MEDIA_ROOT, 'robots.txt') | ||
|
||
if not version_slug: | ||
version_slug = project.get_default_version() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should consider make the same decision as for custom 404 pages here (#2551 (comment)) about which version to use.
This logic could be extended for the |
I believe
|
I think the implementation here needs to be at the nginx level, and we can only serve one for a project, so it either needs to be configured in the YAML/DB, or from the "default version". Not sure the best implementation. |
This is only for the .org, I think our existing robots.txt file makes sense on the .org, but I don't believe we need anything custom for ourselves for subdomains. This is indeed a bug. |
Serving from the default version makes sense. If not we would end having another setting in the DB allowing the user to choose from what version the want to serve |
I was actually thinking a text field in the DB with the contents of the |
readthedocs/core/views/serve.py
Outdated
symlink = PublicSymlink(project) | ||
if (settings.DEBUG or constants.PRIVATE in serve_docs) and privacy_level == constants.PRIVATE: # yapf: disable # noqa | ||
symlink = PrivateSymlink(project) | ||
basepath = symlink.project_root |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this file ever exist? I feel like we should be finding it from the default version's HTML root, not from the project_root
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default version is appended by the resolve_path
.
At this point, filename
is /en/latest/robots.txt
in my case. Then, I remove the initial /
and join with project_root
which ends being /home/humitos/rtfd/code/readthedocs.org/public_web_root/test-builds/en/latest/robots.txt
(in my local instance) and that file does exist.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see. 👍
e233b13
to
0590d04
Compare
readthedocs/core/urls/subdomain.py
Outdated
@@ -22,6 +22,10 @@ | |||
handler404 = server_error_404 | |||
|
|||
subdomain_urls = [ | |||
url((r'robots.txt$'.format(**pattern_opts)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't believe we need the format here.
readthedocs/core/views/serve.py
Outdated
if os.path.exists(fullpath): | ||
return HttpResponse(open(fullpath).read(), content_type='text/plain') | ||
|
||
raise Http404() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we want to 404 here, or return a default Allow: *
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the robots.txt
is not found, it's assumed that the crawler can access to all the content. Although, I think it's better to make it explicit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good. Needs a guide for users, and then I think we can ship it. 👍
If we add similar logic for the 404, we should also write up a blog post about it.
readthedocs/core/views/serve.py
Outdated
""" | ||
if project.privacy_level == constants.PRIVATE: | ||
# If project is private, there is nothing to communicate to the bots. | ||
raise Http404() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Related, what do we do if the project's default version is private? Seems we'll be exposing a something potentially private wihtout this check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point.
I think we need to make a decision here:
- expose the
robots.txt
(I don't really think that this will "expose anything sensitive") --even if your default version is private you will want to communicate what to do with the other ones. - disallow the whole site (doesn't make too much sense to me)
- other?
I'd go for 1).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exposing the robots.txt
exposes the fact that the project exists, which is definitely a security issue in some cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mmm... good point.
So, for those cases (default version private or project private), we should probably want to return 404. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems safest, especially since it's a new feature. If we get more requests from users we can add more logic here, but doing the safest thing to start feels right.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added more cases, like version is not active or version is not built. Which I think it also makes sense to return 404.
I wrote the documentation as a FAQ. Please take a look and let me know if you consider that it has to be written in another way. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this could pretty easily be a Guide instead of a FAQ. Though I don't feel strongly. FAQ's just feel like something that will not be as easily found as a guide on the topic. Not going to block shipping on it.
This looks 💯 for the .org, but @agjohnson might have other concerns around privacy, so probably good to get his thoughts before merge.
It could also be extended for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great and I'm excited by this as I think it will let docs authors have a little bit more power over their SEO and appearance to search engines. For small projects, that probably isn't a huge deal but for bigger ones I think it's important.
Yes! We have an issue for this at #557. I want to work on this sooner than later. |
ae1e1b8
to
7921ade
Compare
OK! Now that we have consensus on this, I will add some test cases to be safe with the logic and merge it after that. |
Check for a custom `robots.txt` on the default version and if it does exist serve it. Otherwise, return 404.
Co-Authored-By: humitos <[email protected]>
ba139e0
to
335b99c
Compare
Tests added. I'm merging after tests pass. |
335b99c
to
3e4b1a4
Compare
My idea behind supporting this is,
robots.txt
fileby appending our own at the endwe need to disallow(check Support custom robots.txt #5086 (comment))/sustainability/click/
returns 404allow all the agents and pagesIf we agree on this, we will need to remove our NGINX rules from here and here and here
Another thing to consider is that we are adding(see Eric's comment below)/builds/
to therobots.txt
file, so if the user has a/builds/
directory on their documentation it will be ignored by robots. We should probably want to split ourrobots.txt
into one forreadthedocs.org
and another one forreadthedocs.io
.Closes #3161