Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document search.gov xml sitemap requirements #5501

Closed
Tracked by #226
patphongs opened this issue Nov 30, 2022 · 4 comments
Closed
Tracked by #226

Document search.gov xml sitemap requirements #5501

patphongs opened this issue Nov 30, 2022 · 4 comments
Assignees

Comments

@patphongs
Copy link
Member

patphongs commented Nov 30, 2022

Summary
Determine/document search.gov xml sitemap requirements

Related issue: Research sitemap generation: #5499

  • document how we're using similar indexing in other parts of the website (e.g. policy guidance search, what does search.gov need?)

Completion criteria

  • documentation of requirements
  • confirmation from search.gov that our documentation and plan will work
@patphongs patphongs added this to the Sprint 20.1 milestone Nov 30, 2022
@rfultz rfultz changed the title Search.gov document xml sitemap requirements Document search.gov xml sitemap requirements Dec 1, 2022
@pkfec pkfec modified the milestones: Sprint 20.2, Sprint 20.3 Jan 3, 2023
@pkfec pkfec modified the milestones: Sprint 20.3, Sprint 20.4 Jan 17, 2023
@johnnyporkchops
Copy link
Contributor

johnnyporkchops commented Jan 25, 2023

search.goiv docs:

https://search.gov/indexing/sitemaps.html
https://search.gov/indexing/robotstxt.html

Correspondence with search.gov support:

"Overall, we just need to make sure that the sitemaps contain the comprehensive list of URLs that you want searchable. We can accept multiple sitemaps without issue, so there shouldn't be any impact to the existing sites.
Once you have the sitemap(s) created, either post them on your robots.txt file or just email our team and we can add those manually. We'll begin indexing content immediately once that happens. After that's good to go, you should delete the i14y drawers and stop sending requests via i14y.

Once we do have all this content in our index though, anything we have indexed for fec.gov will appear in these search sites you've set up. You can limit by subfolder if needed via the Domains tab."

Important caveats:

crawl delay:
From search,gov docs https://search.gov/indexing/robotstxt.html
"We recommend a crawl-delay of 2 seconds for our usasearch user agent, and setting a higher crawl delay for all other bots. The lower the crawl delay, the faster Search.gov will be able to index your site. In the robots.txt file, it would look like this:"

User-agent: usasearch  
Crawl-delay: 2

User-agent: *
Crawl-delay: 10

Metadata for sitemaps supported by search.gov
"From search.gov docs https://search.gov/indexing/sitemaps.html#what-metadata-does-searchgov-require-for-each-xml-sitemap-url
For each URL. We recommend including the <lastmod> value (the date of last modification of the file) whenever possible, to indicate when a file has been updated and needs to be re-indexed."

We do not have plans to support the <priority> tag, which is no longer used by search engines like Google. We may support the <changefreq> tag in the future, but the <lastmod> tag is more accurate and supported by more search engines.

https:
From Wagtail Sitemap docs : "If you change the site’s port to 443, the https scheme will be used. Find out more about working with Sites."

  • Change settings > sites > fec.gov (or localhost for local) to port 443

Changes to commands that we make to i14y endpoint:
Antxra at search.gov wrote:
"We do have a bit of confusing naming for our API endpoints, I'll admit, but you can continue calling the "GET" requests to i14y. Our Results API documentation (https://open.gsa.gov/api/searchgov-results/) lists the endpoint we prefer that users hit, but the response should be the same as what your application is expecting. The only API calls we would want you to stop are the PUT/POST/DELETE calls to edit the content in your index. ."

@johnnyporkchops
Copy link
Contributor

johnnyporkchops commented Jan 25, 2023

Issue with pollcy-guidance search:

From Amani at Search.gov Support ([email protected]):
"The only way you'd be able to limit the policy and guidance search is via folder paths. Would it be possible to put the s3 docs that should belong to this search in their own subfolder? That way, you can indicate the folder paths in the "Domains" section of your admin center (https://search.gov/admin-center/content/domains.html )and limit the content.
Let me know if that makes sense. Also happy to hop on a call if that's easier!"


One thought is to move them to a folder after upload with an S3 script since Wagtail does not allow users to specify a folder upon upload.

@johnnyporkchops
Copy link
Contributor

Based on correspondence with search.gov support (Amani) , we will put lasmod dates on the static sitemap only when applicable --when we make a change to a static template that we want to e re-indexed. There is also the option to email seearch.gov to request a URL e re-indexed.

Amani wrote:

Having on only some of your sitemap items would still be better than nothing! That'll help us at least trigger reindexes for those resources when updated.
AND
Without the lastmod, we won't have an automatic way to reindex those, so if there are a bunch of changes to the non-CMS pages, just email our team and we can trigger a reindex of a subset of URLs.

@rfultz rfultz removed this from the Sprint 21.2 milestone Apr 18, 2023
@pkfec
Copy link
Contributor

pkfec commented Jun 15, 2023

Research done! Closing this issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants