-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Research making PDF files in S3 searchable #3489
Comments
I suggest that we research using search.gov for this capability first. We currently use it as part of our site wide global search. |
@patphongs do we have existing documentation or references on how we integrate search.gov into the site-wide global search? I'm looking in parallel to this question, but your answer may expedite that. |
@dorothyyeager @patphongs cc @johnnyporkchops @PaulClark2 , question on this issue: The research not only involves making the pdfs searchable, but also how to upload them as well? Or are the pdfs already there? If the former, maybe we can discuss whether or not it makes sense to break this into 2 separate issues: 1) how to upload (perhaps by automated process) and 2) how to search what is uploaded. What do you think? |
@jason-upchurch The PDFs are all over the place (some in the CMS and some in transition. Inconsistent names and file directories, and lots of accessibility issues). We're working on cleaning them up now in my current ticket (#3488). There are a lot of them, so @AmyKort and @patphongs and I decided we'd figure out the best place to store them for purposes of searchability and making the new page, and then once my current ticket is done, the cleaned up files will be moved to wherever is determined best. Pat, let me know if I summarized this correctly. |
Thank you @dorothyyeager! Depending on what everyone (@AmyKort @patphongs @PaulClark2 ) decide, I might suggest splitting this in that case. A "search" issue could assume the pdfs exist, and perhaps the "cleaning up" or storing component is addressed by #3488 or a second issue? I missed yesterday so there is some catch up I'm doing on this. |
Thank you @dorothyyeager. @jason-upchurch just to add, these PDF files either already exist or will exist in our content-S3 bucket. I imagine we would like to move them from our transition bucket to our content-S3 bucket. These PDFs can be uploaded to our content-S3 via Wagtail to the bucket or via direct S3 upload if there are a lot of files that need to be uploaded. These resources are mapped to www.fec.gov/resources/*. If you look inside our fec-proxy github prod manifest files, you’ll see the exact content buckets and how it’s mapped. |
docs for site-wide search @patphongs referenced (search.gov): https://github.com/fecgov/fec-cms/blob/fe4399c2ae538540eadf35177b47a2f8e8ecd2bf/fec/search/management/instructions.md |
Possible wagtail solution: https://github.com/fourdigits/wagtail_textract |
research questions:
|
Right now the PDFs on our website live in many different areas: I am not thinking of everything, and not every PDF will be needed for the new Policy and Guidance page. For the Policy and Guidance page, we will need to make a judgment call since some of the required documents live in the CMS and others now live on transition but will need to be moved. Also, we need to determine whether we just move the Policy Statement or all of the related files (for example, public comments received, drafts that were in the form of agenda documents). cc: @AmyKort @patphongs @kathycarothers |
From @dorothyyeager 's google doc inventory, I found a case where one of the document names matched one is the s3 bucket: /cms-content/documents/fedreg_notice_2019-10_07292019.pdf Note this particular doc is located in the
|
@dorothyyeager thank you for this! I don't want to make too big a deal of identifying everything in precise detail as a component of the research ticket, but one recommendation I will probably make is that as a team we eventually decide precisely how all of the pdfs we want to make searchable are organized (I can help with this effort). The reason for this is that this addressing is exactly what we'll need to provide search.gov to be able to index our documents (and we'll probably need to script the daily building of a new sitemap which will require a predictable place for new documents to get stored as well). I know this issue is much bigger than my research ticket, but I wanted to give at least some background on why I'm digging around on that issue. An ultimate goal of designing this functionality could be that when someone uploads a new document to wagtail that we want to be searchable, we build a way to automate it so that we don't have to keep a running inventory and manually add/remove pdfs from our sitemap. |
[ this section is WIP @dorothyyeager @patphongs ]Conclusions:Wagtail
s3 bucket (content-s3)
search.gov
pdfs
Recommended next steps: build a prototype to test search functionality
Recommended longer-term steps (can be done in parallel to above): create a workflow/framework/logical organization for pdfs
Unknowns |
@jason-upchurch One consideration: Especially for things like the Guides and the forms, they get changed occasionally. They are also linked widely and bookmarked by people. So just want to ensure their URLs remain the same as now, and also that content/Info Div staff can be the ones who upload them (don't want to have to stick the front end with a task we've always been able to do). When we've been going through the docs for an accessibility check and fix, we have been selecting the make "searchable" option on the ones that were scanned images (Most were not). So not sure we will need to do anything else? cc: @patphongs |
@dorothyyeager thank you for noting this. These are somewhat independent issues (yet with our current setup one cannot quite ignore the other). One issue (this issue) has to do with searches and ensuring that search results are based on specified pdfs whose location is explicit. The other has to do with keeping url links to specified resources in place. I think both can be satisfied, but it's important to note the implication of what you say, i.e., a search solution cannot necessarily rely on wholesale moving a file from one place to another. The thing I'm most apprehensive about is that no matter what the solution is, we'll need exact locations of all the documents that need to be searchable, and ideally a way to systematize things so that permanence as searchability are both satisfied. |
Summary
What we are after:
To comply with the executive order and best practices, we need to figure out how to make our PDF files (ones uploaded to Wagtail and ones to the s3 bucket) searchable and how to index them for search.gov.
Related issues
List any relevant related issue(s)
Completion criteria
List action items that need to be done to complete this task. List in checklist formatting.
Future work
The text was updated successfully, but these errors were encountered: