Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automate updates to guidance search sitemap #3804

Open
4 tasks
Tracked by #173
dorothyyeager opened this issue Jun 2, 2020 · 3 comments
Open
4 tasks
Tracked by #173

Automate updates to guidance search sitemap #3804

dorothyyeager opened this issue Jun 2, 2020 · 3 comments
Labels
Guidance Search Work associated with Executive Order requiring Policy and Guidance search page Pipeline: PI Backlog Technical debt

Comments

@dorothyyeager
Copy link
Contributor

dorothyyeager commented Jun 2, 2020

Summary

What we are after:
As a user, I want to make sure the latest version of a guidance document appears in the results for the sitemap. And as a content team person, it's hard to remember to update the sitemap manually each time a revised Form or Guide PDF is updated and uploaded.

Background: Right now the guidance docs have two sitemaps - one for html files and one for PDF files. Whenever a file is altered, we need to manually edit the code for the sitemap and then re-upload the code into Wagtail. Since this happens on an occasional and not regular basis, we risk the content team forgetting to do this step whenever it uploads a new or replacement PDF or edits an html page that is included in the guidance search.

Related issues

#3793 - Update guidance sitemap for updated date of one of the documents (example of what the content team has to remember to do each time)

How tos: https://docs.google.com/document/d/1hfvlYhGVNF0Km5vrAYtJak2cdAYLYQPbiPUfLxbRN1E/edit#

Completion criteria

  • Sitemaps are automatically updated with new modification dates when revised document files are uploaded or edited.

Tech steps or considerations (optional)

List any considerations the tech team should know. Additionally, any specific tech steps can be included here.

  • First
  • Next
  • Then
@rfultz
Copy link
Contributor

rfultz commented Mar 12, 2021

Knowledge sharing

Sitemaps are a way to highlight pages and content that are being skipped for whatever reason.

Our search.gov data is boosted by data we feed it through sitemap files, though those sitemaps don't actually go to our website—they're only fed into search.gov for its purposes.

Sitemaps don't include information about files other than url and, optionally, a date, change frequency, and/or importance. There are no titles, descriptions, keywords, etc.

/sitemap.xml is the only sitemap file name / path that matters—the standard location for if one exists.

Sitemaps can link to other sitemaps. My original goal was to have /sitemap.xml only include links to other sitemap files, and those sitemaps would be named/divided however made most sense to we human reviewers.

Discovery (@dorothyyeager @kathycarothers @djgarr @zsmith-fec)

For Guidance,
- are ALL Guidance PDFs items/pages/docs/whatever inside Wagtail?
- would there be any reason not to use Wagtail's datestamp in the sitemap?
- are there any Guidance PDFs included in the sitemap that are NOT in Wagtail? Is there a reason?
- should any Guidance PDFs in Wagtail NOT be included in the sitemap?

For other PDFs
- are there others we want to include in other sitemaps? (Is there a reason for a separate file, other than easier human review?)
- are they all files inside Wagtail?
- any that aren't?

Would there be a use for a kind of admin interface to update data that gets included in sitemaps? e.g. URL-date pairs for resources that live outside Wagtail

If we're going to automate some sitemaps, we could assign a priority to some pages that are already known but whose importance we'd like to emphasize.

Possible complications

Right now, sitemap_xml and sitemap_html are used for search.gov's data. Would search.gov scream if we renamed those files or made it so they're not, say, saved on our local drives, but generated by the website itself? @patphongs ?

@kathycarothers
Copy link
Contributor

For Guidance,

  • are ALL Guidance PDFs items/pages/docs/whatever inside Wagtail? Yes, they are all in Wagtail
  • would there be any reason not to use Wagtail's datestamp in the sitemap? I can't answer this question
  • are there any Guidance PDFs included in the sitemap that are NOT in Wagtail? Is there a reason? All in Wagtail
  • should any Guidance PDFs in Wagtail NOT be included in the sitemap? I think this is more for Pat and Dorothy

For other PDFs

  • are there others we want to include in other sitemaps? (Is there a reason for a separate file, other than easier human review?)
  • are they all files inside Wagtail? I believe all files are in Wagtail
  • any that aren't?

@dorothyyeager
Copy link
Contributor Author

dorothyyeager commented Mar 12, 2021

Answers in bold for @rfultz
For Guidance,

  • are ALL Guidance PDFs items/pages/docs/whatever inside Wagtail? No. There's one webform that's not. Content can give you the full list of what's in there.
  • would there be any reason not to use Wagtail's datestamp in the sitemap? Wagtail doesn't update time stamps for replacement files (which is actually really annoying because we can't tell when a document was replaced.)
  • are there any Guidance PDFs included in the sitemap that are NOT in Wagtail? Is there a reason? No - all PDFs needed for the guidance search project were uploaded into Wagtail if they weren't there already. Note that all PDFs that in Wagtail (the great majority) should not show up in the Guidance search. Guidance search is limited to a very specific set of documents.
  • should any Guidance PDFs in Wagtail NOT be included in the sitemap? No. All of the guidance PDFs should be in that sitemap. (Note there's also a separate sitemap for HTML pages that are in guidance search too.)

For other PDFs

  • are there others we want to include in other sitemaps? (Is there a reason for a separate file, other than easier human review?) The sitemaps (PDF and HTML) in Wagtail right now are only for guidance search documents and files, because we need to ensure they show up in that particular search which is powered by search.gov. The GH tickets on Guidance search may be helpful on why it needed to be a separate file (I only know that it does.). If another sitemap would be beneficial, we're all for it, but definitely would love to automate updating the guidance search sitemaps.
  • are they all files inside Wagtail? There are many PDFs that are not. PDFs are living in S3 bucket, legal database, SERS, SAOS, EQS. Definitely not all in Wagtail
  • any that aren't? Many, see last answer.

Would there be a use for a kind of admin interface to update data that gets included in sitemaps? e.g. URL-date pairs for resources that live outside Wagtail

If we're going to automate some sitemaps, we could assign a priority to some pages that are already known but whose importance we'd like to emphasize.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Guidance Search Work associated with Executive Order requiring Policy and Guidance search page Pipeline: PI Backlog Technical debt
Projects
Status: 🗄️ PI backlog
Development

No branches or pull requests

5 participants