Skip to content

Decision: How to handle resources search

Britta edited this page Apr 25, 2023 · 56 revisions
Thing Info
Relevant features Full-text resources search
Date started 2023-04-11
Date finished
Decision status Working on it
Summary of outcome

Background/context

Outside of the regulation text itself, CMCS policy staff need to find and reference Medicaid & CHIP policy information in a wide variety of materials that are hosted on many websites, including statute (uscode.house.gov; govinfo.gov; congress.gov), rules (federalregister.gov), subregulatory guidance and implementation resources (medicaid.gov), GAO reports (gao.gov), and other informational materials (cms.gov; hhs.gov). It can take a lot of experience and time to find what you need. People also sometimes use unreliable websites when they don't have experience with better options.

We're trying to reduce their burden by providing a "one stop shop" experience for Medicaid & CHIP policy information. We've already made a curated, annotated, cross-referenced list of about 2600 links to resources.

We need to improve our resources search system, because it is currently only meeting some user needs.

We have two technologies we can use for resources search, and each has significant benefits but significant limitations.

Core questions

How do we provide a resources search experience that meets most of our user needs?

What we know

User needs

When searching guidance, rules, and other subregulatory or supplemental materials, policy researchers need relevant and comprehensive search results.

Relevant means: When a user enters a query and looks at the top results, those documents should be substantially related to the topic described in the query. Lower results should be moderately related to the query. In other words: if we have anything in our collection that answers their question or provides what they're looking for, it should be in the top results.

Comprehensive means: When a user enters a query, our search system returns results from across our entire collection of resources. Within our entire collection, the system looks for the query words in the entire document text. No documents in our collection are omitted from the search.

We don't have to provide a perfect search system. But before we change our search system, we need to be confident that the change makes it better.

Reference materials: supplemental content search stories on Dovetail, some quotes about related needs, comparison of results for recent searches.

In-house metadata search

Using Postgres search, we return results from our collection of document names and descriptions.

Pros:

  • Relevance:
    • It produces highly relevant results for many organic searches, because if a keyword is in a document name or description, that's a very strong signal that the document is relevant to that keyword.
    • We control the metadata that we're searching (by editing items in our admin panel), which enables us to improve relevance. Examples:
      • Some documents, such as older SMDLs, don't have titles in the document, so we write brief descriptions in our admin panel.
      • If a document title only uses an abbreviation for a term, we can write a modified description that spells out the term and includes the abbreviation as well.
      • We have many links to extremely long PDFs of old Federal Register documents, which were scanned and may not even be OCRed. We hand-wrote the descriptions for those documents into our database, and we index that metadata, so our search consistently returns those results when their descriptions match query keywords.
  • Comprehensiveness:
    • It reliably produces results from our complete index of documents.
  • Cost:
    • This is our existing low-cost solution.

Cons:

  • Relevance:
    • None
  • Comprehensiveness:
    • Because search is limited to document metadata, not the text of the documents, this search is not sufficiently comprehensive. Many organic queries return zero or few results, even when we have documents with contents that contain the query keywords, because the keywords are not in the document name or description. This challenge means that this search frequently does not produce the results that our users are looking for.
  • Cost:
    • None

Search.gov full-text search

Search.gov tries to index the full text of our complete collection of document links.

Pros:

  • Relevance:
    • Due to its comprehensiveness, it produces relevant results for many organic searches that produce zero or few results with metadata search.
  • Comprehensiveness:
    • It indexes the complete text of HTML pages, PDFs, Word docs, Excel sheets, and more. This comprehensiveness is super valuable for our users.
  • Cost:
    • This service is free to us, including crawling, document storage, and indexing.
  • The Search.gov team plans to improve this tool. They have their own engineering team with specialized skill in search engines.

Cons:

  • Relevance:
    • For some queries, such as "postpartum" or "dental", the results have a confusing ranking order -- the top result seems less relevant than the last result (estimated by counting the number of times the term appears in the document compared to its number of pages).
    • For multi-word queries, the initial results can be mostly irrelevant, even though a quoted search for that phrase generates relevant results.
    • It has a naive form of stemming that creates irrelevant matches. Example: search for "community first choice" (without quotes) and get results with the term "communications".
    • It indexes navigation menus for the Federal Register and other websites, so it produces a lot of irrelevant results if your query word happens to be in the navigation menu.
  • Comprehensiveness:
    • It is not able to consistently index our entire collection of documents. About half of our documents seem to be missing from its results. Many organic queries return incomplete results, missing many documents with descriptions and contents that contain the query keywords. We don't fully understand why this is happening. See "What we don't know" below for opportunities to learn more.
    • It cannot index certain items hosted on sites that block its crawler, mainly MACPro training videos hosted on YouTube and Streamlined Modular Certification Word/Excel documents hosted on GitHub.
  • Cost:
    • None
  • The Search.gov team does not plan to do substantial work on this cherry-picked index tool this quarter. This cherry-picked list of URLs (across many websites) is a less-common use case for them than "search all of the pages within this one website", so it's a lower priority, but they do plan to improve its relevance calculations (such as by not indexing navigation menus).

Hypothetical hybrid search

We have an experiment showing that if you combine the results from both systems, you can get relevant and comprehensive results. The concept:

  • Our metadata results show up as the top items, because they always have the strongest relevance.
  • After that, we display the search.gov results, because they give comprehensiveness to the results. (We remove any items that already appeared in the metadata results, to avoid duplicates.)

Pros:

  • Always delivers results that are at least as relevant and comprehensive as our current results, while meeting user needs for increased comprehensiveness.

Cons:

  • Not best practice from an engineering perspective, could be fragile and tricky to maintain (see "what we don't know" below).

Hypothetical development of our own custom full-text resources search system

If we wanted to crawl and index documents ourselves, we would need to estimate the potential AWS costs of that work and review it before proceeding. We are not likely to get approved for any non-trivial AWS cost increases.

What we don't know

About Search.gov

  • When search.gov tries to index Medicaid.gov documents, why does Medicaid.gov often return a 403 error even for documents available to the public, which prevents indexing of those documents?
    • We're scheduled to learn more about this on Wednesday. We're hoping we can resolve this.
  • Why does search.gov not index many of the pages in the RSS feed that we send them, especially in the second half of the feed?
    • We'll try a delete-and-reindex after we learn more about the 403 error issue. We've resolved a lot of issues with our RSS feed since their initial indexing of it, so we're hoping a fresh start will increase the comprehensiveness of their index.
  • If we get the comprehensiveness issues fixed, would the Search.gov method become sufficient?
    • We'll have to try this and see, but it'll probably still have relevancy downsides when compared to our metadata search.
  • How long will it take for search.gov to improve relevancy in the feature that we're using?
    • Not this quarter, maybe next quarter, but they don't know for sure.

About impacts on our system

  • What would be the maintenance and technical debt implications of a hybrid option? (We only have a prototype implementation right now; we haven't determined what a production implementation would look like.)
    • Would a hybrid implementation make it harder for us to apply routine updates to Postgres, Django, Vue, or any of our other components?
    • If a hybrid implementation didn't end up working well for us, could we remove it relatively easily and revert back to our metadata-only search?
    • Would the complexity make it buggy and hard to debug?
    • Would the complexity make the system hard to learn for new developers?
    • Would the relevancy combination be confusing for users?
    • Could we create clear requirements and architectural design for this feature, and maintain our coding standards?
    • Are there things we could we do to mitigate technical downsides?
  • Could we produce an interim hybrid solution that would hold us over until search.gov modernizes its system?
  • What would be the efforts, time, costs, and opportunity costs involved in an alternate option?
    • We have a hypothesis that we could crawl, store, and use Postgres full-text search on the URLs (PDFs, HTML, etc), in a relatively low-cost way.
    • There are open source search solutions that we could run, such as Elasticsearch (Haystack for Django).

Things we need to decide + options for them

How do we best meet our user needs?

Decision

Consequences

Overview

Data

Features

Decisions

User research

Usability studies

Design

Development

Clone this wiki locally