Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Search service proposal #2203

Merged
merged 2 commits into from
Nov 17, 2022
Merged

Search service proposal #2203

merged 2 commits into from
Nov 17, 2022

Conversation

pawel-big-lebowski
Copy link
Collaborator

Signed-off-by: Pawel Leszczynski [email protected]

Problem

Search service proposal

Closes #2180

Solution

Within PR.

Note: All database schema changes require discussion. Please link the issue for context.

Checklist

  • You've signed-off your work
  • [] Your changes are accompanied by tests (if relevant)
  • Your change contains a small diff and is self-contained
  • You've updated any relevant documentation (if relevant)
  • You've updated the CHANGELOG.md with details about your change under the "Unreleased" section (if relevant, depending on the change, this may not be necessary)
  • You've versioned your .sql database schema migration according to Flyway's naming convention (if relevant)
  • You've included a header in any source code files (if relevant)

@codecov
Copy link

codecov bot commented Oct 19, 2022

Codecov Report

Merging #2203 (727e6a9) into main (7885c8c) will increase coverage by 0.01%.
The diff coverage is n/a.

❗ Current head 727e6a9 differs from pull request most recent head fc851d3. Consider uploading reports for the commit fc851d3 to get more accurate results

@@             Coverage Diff              @@
##               main    #2203      +/-   ##
============================================
+ Coverage     76.72%   76.73%   +0.01%     
- Complexity     1147     1148       +1     
============================================
  Files           219      219              
  Lines          5318     5318              
  Branches        423      423              
============================================
+ Hits           4080     4081       +1     
  Misses          763      763              
+ Partials        475      474       -1     
Impacted Files Coverage Δ
...pi/src/main/java/marquez/db/mappers/RunMapper.java 90.56% <0.00%> (+1.88%) ⬆️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@pawel-big-lebowski pawel-big-lebowski marked this pull request as ready for review October 20, 2022 10:04
proposals/2180-search-service.md Outdated Show resolved Hide resolved
proposals/2180-search-service.md Outdated Show resolved Hide resolved
proposals/2180-search-service.md Outdated Show resolved Hide resolved
```
* In order to find datasets ES will query `dataset` index. When searching for jobs, `run` and `job` subtrees of `lineage` index will be queried.
* A response from ES will contain runId for jobs, and (namespace, name) tuples for datasets.
* Based on that, additional query to PostgreSQL will be sent to convert this into list of `SearchResult` entries.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would be returned by querying postgres? Wouldn't the object be stored in ES and returned to the caller?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, it's not necessary. SearchResult can be evaluated from ElasticSearch backend.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think as a follow up, we can discuss how we might want to revisit the SearchResult model and was to extend it.

proposals/2180-search-service.md Outdated Show resolved Hide resolved
@mobuchowski
Copy link
Contributor

I discussed this with Paweł offline, but I believe there are few things missing:

  • we shouldn't index everything, but whitelisted facets
  • we should know where we've found the keyword and somehow describe it in the search result

@pawel-big-lebowski pawel-big-lebowski force-pushed the proposals/search-service-proposal branch 2 times, most recently from 5961a04 to 8344749 Compare October 28, 2022 12:41
@pawel-big-lebowski
Copy link
Collaborator Author

pawel-big-lebowski commented Oct 28, 2022

I discussed this with Paweł offline, but I believe there are few things missing:

  • we shouldn't index everything, but whitelisted facets

Blacklisting spark_plan and spark_unknown definitely makes sense. I would be in favor of blacklisting just to allow searching on user's facets.

  • we should know where we've found the keyword and somehow describe it in the search result

As discussed, ElasticSeach highlight feature enriches document with <em> tag which is not super-clean and is tightly-coupled with ElasticSearch backend. I would prefer to start without it, we can add event preview on a list of search result to help user find out how the search lead to such results.

Signed-off-by: Pawel Leszczynski <[email protected]>
Copy link
Member

@wslulciuc wslulciuc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A well thought-out and written proposal. Great job, @pawel-big-lebowski 💯 💯 🥇

@wslulciuc wslulciuc enabled auto-merge (squash) November 17, 2022 08:14
@wslulciuc wslulciuc merged commit 1d8334b into main Nov 17, 2022
@wslulciuc wslulciuc deleted the proposals/search-service-proposal branch November 17, 2022 08:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Search service proposal
3 participants