Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement prototype search for guidance policy #3554

Closed
10 tasks done
Tracked by #173
jason-upchurch opened this issue Feb 17, 2020 · 8 comments
Closed
10 tasks done
Tracked by #173

Implement prototype search for guidance policy #3554

jason-upchurch opened this issue Feb 17, 2020 · 8 comments
Assignees
Labels
Guidance Search Work associated with Executive Order requiring Policy and Guidance search page Work: CMS back-end Work: Content Work: Front-end
Milestone

Comments

@jason-upchurch
Copy link
Contributor

jason-upchurch commented Feb 17, 2020

Summary

This ticket is a prototype implementation ticket to continue work done under issue #3489, #3488, #3527. The main focus of this ticket is to test searchability of a handful of pdfs using a limited scope and functionality as a proof-of-concept/prototype. Follow-on work/tickets is/are identified in the high-level completion criteria below.

High-level completion criteria

  • identify some (e.g., 5-10) native pdfs from content-s3 that should be made searchable (content work, cc @dorothyyeager @kathycarothers )
    • technical considerations: should these be ones already uploaded through wagtail? Is it easiest just to identify and not necessarily deal with organizational questions for prototype, e.g., file system layout, name mapping/convention, etc?
  • provide sitemap to search.gov (back-end cms: cc @jason-upchurch )
    • technical considerations: what tags, metadata
  • use search.gov platform as initial platform for returning search results (unsure, possibly coordination between design, back-end cms, front-end cms: cc @JonellaCulmer @patphongs @johnnyporkchops @jason-upchurch )
    • technical considerations: does the platform sufficiently allow user feedback on functionality or search expectations?
  • open additional tickets as needed. Some examples (not necessarily in sequential order):
@dorothyyeager
Copy link
Contributor

With the list above and a related spreadsheet shared with @jason-upchurch, I'm passing this issue on to him to work on the rest! I'm available for any content-type questions that might come up. cc: @JonellaCulmer

@jason-upchurch
Copy link
Contributor Author

jason-upchurch commented Feb 18, 2020

@dorothyyeager @JonellaCulmer thank you--I have issue now. Following are the locations where the pdfs @dorothyyeager identified live:

For the time being, I can limit scope to pdfs. I'm not quite sure the press release directories given above map to the s3 bucket at all (for example, there is no statement-on-carey-fec "directory" in the bucket, so I'm not sure the underlying document type/location and may be out of scope for the prototype (that is, I don't think that the preceding research ticket investigated anything beyond making pdfs in s3 searchable).

@jason-upchurch
Copy link
Contributor Author

jason-upchurch commented Feb 19, 2020

tentative xml sitemap. Sent to search.gov and awaiting next steps to begin search testing.

<?xml version="1.0" encoding="UTF-8"?>
<urlset>
<url>
<loc>https://cg-47928592-406c-4536-8234-99b896e8d57d.s3-us-gov-west-1.amazonaws.com/cms-content/documents/fecfrm1m.pdf</loc>
<lastmod>2017-08-17T09:25:48+00:00</lastmod>
</url>
<url>
<loc>https://cg-47928592-406c-4536-8234-99b896e8d57d.s3-us-gov-west-1.amazonaws.com/cms-content/documents/fedreg_notice_2019-10_07292019.pdf</loc>
<lastmod>2019-09-17T08:00:33+00:00</lastmod>
</url>
<url>
<loc>https://cg-47928592-406c-4536-8234-99b896e8d57d.s3-us-gov-west-1.amazonaws.com/cms-content/documents/candgui.pdf</loc>
<lastmod>2018-10-03T11:07:29+00:00</lastmod>
</url>
<url>
<loc>https://cg-47928592-406c-4536-8234-99b896e8d57d.s3-us-gov-west-1.amazonaws.com/cms-content/documents/guideline-for-presentation-good-order.pdf</loc>
<lastmod>2020-02-05T13:12:21+00:00</lastmod>
</url>
<url>
<loc>https://cg-47928592-406c-4536-8234-99b896e8d57d.s3-us-gov-west-1.amazonaws.com/cms-content/documents/fecfrm1mi.pdf</loc>
<lastmod>2018-09-21T14:50:43+00:00</lastmod>
</url>
<url>
<loc>https://cg-47928592-406c-4536-8234-99b896e8d57d.s3-us-gov-west-1.amazonaws.com/cms-content/documents/fecfrm5.pdf</loc>
<lastmod>2018-03-16T10:58:46+00:00</lastmod>
</url>
<url>
<loc>https://cg-47928592-406c-4536-8234-99b896e8d57d.s3-us-gov-west-1.amazonaws.com/cms-content/documents/fedreg_notice2003-9.pdf</loc>
<lastmod></lastmod>
</url>
<url>
<loc>https://cg-47928592-406c-4536-8234-99b896e8d57d.s3-us-gov-west-1.amazonaws.com/cms-content/documents/enforcementprocedures_hearingtranscript-6-11-2003.pdf</loc>
<lastmod></lastmod>
</url>
<url>
<loc>https://cg-47928592-406c-4536-8234-99b896e8d57d.s3-us-gov-west-1.amazonaws.com/cms-content/documents/comment_democracy21_05222019.pdf</loc>
<lastmod></lastmod>
</url>
<url>
<loc>https://cg-47928592-406c-4536-8234-99b896e8d57d.s3-us-gov-west-1.amazonaws.com/cms-content/documents/comment_campaign_legal_center_05232019.pdf</loc>
<lastmod></lastmod>
</url>
</urlset>

@jason-upchurch
Copy link
Contributor Author

jason-upchurch commented Feb 27, 2020

For posterity: search.gov attempted to index based on above but hit 403 error. We asked cloud.gov to whitelist search.gov's IP address. This was done and we are awaiting the indexing to test success/failure.

coordinating with @dorothyyeager @AmyKort to plan for next steps per @PaulClark2 cc @pkfec

@dorothyyeager
Copy link
Contributor

sitemap ticket that @rfultz started: #3280

@dorothyyeager
Copy link
Contributor

robots ticket from @rfultz that could play into this work: #2849

@jason-upchurch
Copy link
Contributor Author

Prototype complete and functional. Continuing feature work under new issue(s), e.g., #3580 and #3581

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Guidance Search Work associated with Executive Order requiring Policy and Guidance search page Work: CMS back-end Work: Content Work: Front-end
Projects
None yet
Development

No branches or pull requests

3 participants