-
Notifications
You must be signed in to change notification settings - Fork 909
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix sitemap indexing #3708
Comments
Taking the liberty here of prioritizing this as High. |
Maybe a manual reindex is what's required here? Or a submission of a sitemap? https://developers.google.com/search/docs/crawling-indexing/ask-google-to-recrawl |
I do the site map reindex via Google Search Console all the time. |
We had a URL prefix property, so only Requested a DNS change to LF AI & Data https://jira.linuxfoundation.org/plugins/servlet/desk/portal/2/IT-26615 |
https://support.google.com/webmasters/answer/7440203#indexed_though_blocked_by_robots_txt
|
https://developers.google.com/search/docs/crawling-indexing/block-indexing |
Previous discussion about this on RTD readthedocs/readthedocs.org#10648 |
We got some good advice readthedocs/readthedocs.org#10648 (comment) But blocking this on #3586 |
Potentially related:
|
robots.txt
robots.txt
@astrojuanlu It will be very helpful to have access of the Google search console, can we catch up sometime this week? In addition, despite #3729, it appears the robots.txt isn't updated. I am not super clear about RTD build, do we need to manually refresh the |
https://docs.readthedocs.io/en/stable/guides/technical-docs-seo-guide.html#use-a-robots-txt-file The default version (currently |
We need to make sure sitemap is crawled. See example of
Ours is blocked currently. This isn't the primary goal of this ticket but we can also look into it. The main goal of the ticket is "Why URLs that we don't want to be indexed get index", though we would definitely love to improve the opposite "Why URLS that we want to be indexed isn't". |
This comment was marked as outdated.
This comment was marked as outdated.
Mind you, we don't want to index |
Updated |
Our sitemap still cannot be indexed |
Renaming this issue, because there's nothing else to investigate - search engines (well, Google) will index pages blocked by |
robots.txt
Addressed in #3885, keeping this open until we're certain the sitemap has been indexed. |
( |
|
Descriptions
Even with
robots.txt
search engines still index pages that are listed as disallowed.Task
"We need to upskill ourselves on how Google indexes the pages, RTD staff suggested we add a conditional
<meta>
tag for older versions but there's a chance this requires rebuilding versions that are really old, which might be completely impossible. At least I'd like engineering to get familiar with the docs building process, formulate what can reasonably be done, and state whether we need to make any changes going forward." @astrojuanluContext and example
https://www.google.com/search?q=kedro+parquet+dataset&sca_esv=febbb2d9e55257df&sxsrf=ACQVn0-RnsYyvwV7QoZA7qtz0NLUXLTsjw%3A1710343831093&ei=l8bxZfueBdSU2roPgdabgAk&ved=0ahUKEwi7xvujx_GEAxVUilYBHQHrBpAQ4dUDCBA&uact=5&oq=kedro+parquet+dataset&gs_lp=Egxnd3Mtd2l6LXNlcnAiFWtlZHJvIHBhcnF1ZXQgZGF0YXNldDILEAAYgAQYywEYsAMyCRAAGAgYHhiwAzIJEAAYCBgeGLADMgkQABgIGB4YsANI-BBQ6A9Y6A9wA3gAkAEAmAEAoAEAqgEAuAEDyAEA-AEBmAIDoAIDmAMAiAYBkAYEkgcBM6AHAA&sclient=gws-wiz-serp (thanks @noklam)
Result: https://docs.kedro.org/en/0.18.5/kedro.datasets.pandas.ParquetDataSet.html
However, that version is no longer allowed in our
robots.txt
:kedro/docs/source/robots.txt
Lines 1 to 9 in 1f2adf1
And in fact, according to https://technicalseo.com/tools/robots-txt/,
The text was updated successfully, but these errors were encountered: