Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Research sitemap generation #5499

Closed
Tracked by #226 ...
patphongs opened this issue Nov 30, 2022 · 2 comments
Closed
Tracked by #226 ...

Research sitemap generation #5499

patphongs opened this issue Nov 30, 2022 · 2 comments
Assignees
Milestone

Comments

@patphongs
Copy link
Member

patphongs commented Nov 30, 2022

Investigate if there are sitemap generation changes for Wagtail V3 or V4.

Related issue: Document searech.gov sitemap requirements #5501

Completion criteria

  • Checked into sitemap changes from v3 to v4
  • Any changes (additional functionality) has been implemented
  • New version of sitemap has been/will be generated
@johnnyporkchops
Copy link
Contributor

johnnyporkchops commented Dec 10, 2022

  • Wagtail CMS pages: Using Wagtail and Django sitemap docs, we can create a sitemap of all pages in Wagtail cms which will be automatically updated and available at fec.gov/sitemap.xml
  • Data and legal pages : We can create another sitemap for /data and /legal pages manually or using an online sitemap generator (or using Django static-pages sitemap generator (not recommended))
    • The online generator referenced on search.gov (I used this one)limits the number of pages based on current server load, starting at 3500 pages and going to up to about 5500 (on a good day) . When running our site, I got up to 4650 which is less than our 10,000+ pages. However, I think it did capture all /data , and most if not all data/legal; and /regulations pages. So it might work to use the free online generator to pull out the pages we need for the static non-wagtail sitemap. Also we could consider paying to generate a one-time sitemap that we can use just to pull out the data and legal pages.
    • another online generator not suggested by search.gov https://xml-sitemaps.com/
    • The Django static-pages sitemap generator (not recommended), would require us naming each path in data/urls.py and then adding each to a static sitemap view class. This worked OK when testing locally, but seems unnecessary and less inclusive than the above options
  • Sitemap index option : We can create a sitemap index to reference the above sitemaps by either:
    • using the Django/Wagtail sitemap-index utility to point search.gov and search engines to both of our sitemaps
    • Or more simply manually create it using protocol on sitemaps.org (which search.gov suggests) https://www.sitemaps.org/protocol.html#index
    • NOTE: The manual creation of the index works better for us because the Django auto-creation process is more complicated and expects all the sitemaps in the index to be auto-generated by Wagtail/Django sitemap frsmework (I thnk?) and we will have at least one manually-created one,
  • No Sitemap index option: We can also forego the sitemap index, since we will likely only have two sitemaps, and just reference both in robots.txt instead on referencing the index in robots.txt
    Robots.txt:
    • Since we currently conditionally expose a robots.txt in non-prod envs to keep search engines from crawling them, we would need to conditionally expose prod-only robots.txt that allows crawling and lists sitemaps.
    • We can also leave our robots.txt connfig alone and just point search.gov to the new sitemaps and continue with our cuirrent non-sitemap approach to search-engines

I have a WIP branch that generates the Wagtail cms-sitemap and the partial online-generated one mentioned above saved if you would like to see that.


Related issue: Document searech.gov sitemap requirements #5501

@johnnyporkchops
Copy link
Contributor

Options have been documented above and the process tested in WIP branch.
Waiting to hear back from search.gov support for next issue: #5501-Document search.gov xml sitemap requirements

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants