Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(seer grouping): Add Seer-related ingest helpers #70999

Merged
merged 4 commits into from
May 16, 2024

Conversation

lobsterkatie
Copy link
Member

@lobsterkatie lobsterkatie commented May 16, 2024

This adds two helpers, should_call_seer_for_grouping and get_seer_similar_issues, to be used when we (maybe) call Seer as part of event ingestion.

should_call_seer_for_grouping does exactly what you'd think given the name, right now only basing the decision on feature flags and whether or not the event has a usable title and/or stacktrace. In the future we'll also include rate limit and killswitch checks, and any other criteria which it makes sense to add.

get_seer_similar_issues is a wrapper around get_similarity_data_from_seer (which is what actually makes the API call to Seer). It extracts request data from the given event, makes the request, pulls together metadata about the results, and if a matching group is found and the flag is on, pulls the Group record out of the database. (I chose to put the feature flag check there rather than in the code where the the the grouping actually happens so that we can save the trip to the database if we're not going to end up using the results for grouping.)

Code to actually use these helpers is added in #71026.

@github-actions github-actions bot added the Scope: Backend Automatically applied to PRs that change backend components label May 16, 2024
Copy link

codecov bot commented May 16, 2024

Codecov Report

Attention: Patch coverage is 92.59259% with 2 lines in your changes are missing coverage. Please review.

Project coverage is 77.89%. Comparing base (1e61cb7) to head (2d9484b).
Report is 1 commits behind head on master.

Current head 2d9484b differs from pull request most recent head b2bc90b

Please upload reports for the commit b2bc90b to get more accurate results.

Additional details and impacted files
@@             Coverage Diff             @@
##           master   #70999       +/-   ##
===========================================
+ Coverage        0   77.89%   +77.89%     
===========================================
  Files           0     6515     +6515     
  Lines           0   290393   +290393     
  Branches        0    50252    +50252     
===========================================
+ Hits            0   226215   +226215     
- Misses          0    57936    +57936     
- Partials        0     6242     +6242     
Files Coverage Δ
src/sentry/grouping/ingest/seer.py 92.59% <92.59%> (ø)

... and 6514 files with indirect coverage changes

num_neighbors: int = 1,
) -> tuple[
dict[
str, str | list[dict[str, float | bool | int | str]]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

worth making a TypedDict for this type?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sigh This is why I say Python is the IHOP of typing. I agree this isn't great, but there's no way that I've been able to figure out (and googling suggests it's because no way exists) to create a typeddict out of a dataclass.

So I can, but it'll just mean hard coding a second copy of all of the attributes from both the SeerSimilarIssuesMetadata and SeerSimilarIssueData dataclasses, which feels worse to me (I think because a change to either one of those classes would be guaranteed to make the hypothetical typeddict need updating whereas there's a good chance they could be changed without this type needing to be any different).

WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah totally. its good as is!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Googling is so 2010s :-P Just ask Copilot ;-)

def dataclass_to_typeddict(cls):
    return TypedDict(cls.__name__ + 'Dict', {k: v for k, v in get_type_hints(cls).items()})

SeerSimilarIssuesMetadataDict = dataclass_to_typeddict(SeerSimilarIssuesMetadata)
SeerSimilarIssueDataDict = dataclass_to_typeddict(SeerSimilarIssueData)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If only, @vartec. TIL about get_type_hints, but my very first attempt to wrestle with the dataclass/typeddict divide was to do something like this:

fields = {"a": int, "b": str}
DerivedTypedDict = TypedDict("DerivedTypedDict", fields)

and then construct the dataclass from the same fields value. Alas, mypy will have none of it: error: TypedDict() expects a dictionary literal as the second argument.

Here's an open issue about it in the mypy repo: python/mypy#4128.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mypy is the worst...

@lobsterkatie lobsterkatie force-pushed the kmclb-add-seer-ingest-helpers branch from 7df4e9e to 49ec31c Compare May 16, 2024 17:17
Base automatically changed from kmclb-add-SeerSimilarIssuesMetadata-type to master May 16, 2024 17:28
@lobsterkatie lobsterkatie force-pushed the kmclb-add-seer-ingest-helpers branch from 49ec31c to ebf05ed Compare May 16, 2024 17:32
Comment on lines +30 to +36
if (
event.title in PLACEHOLDER_EVENT_TITLES
and not get_path(event.data, "exception", "values", -1, "stacktrace", "frames")
and not get_path(event.data, "threads", "values", -1, "stacktrace", "frames")
):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a comment explaining this condition, please?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (
event.title in PLACEHOLDER_EVENT_TITLES
and not get_path(event.data, "exception", "values", -1, "stacktrace", "frames")
and not get_path(event.data, "threads", "values", -1, "stacktrace", "frames")
):
if not (
get_path(event.data, "exception", "values", -1, "stacktrace", "frames")
or get_path(event.data, "threads", "values", -1, "stacktrace", "frames")
or event.title not in PLACEHOLDER_EVENT_TITLES
):

@lobsterkatie lobsterkatie force-pushed the kmclb-add-seer-ingest-helpers branch from 2d9484b to b2bc90b Compare May 16, 2024 19:56
@lobsterkatie lobsterkatie merged commit 08a47c9 into master May 16, 2024
48 checks passed
@lobsterkatie lobsterkatie deleted the kmclb-add-seer-ingest-helpers branch May 16, 2024 20:56
lobsterkatie added a commit that referenced this pull request May 17, 2024
This uses the helpers added in #70999 to - depending on the state of the `projects:similarity-embeddings-metadata` and `projects:similarity-embeddings-grouping` flags - decide whether we should call Seer before creating a new group, make the API call if so, and then store the results and/or use them to actually prevent new group creation in favor of using an existing similar issue. The behavior is as follows:


| metadata  | grouping | call  | metadata in | metadata in | use Seer-matched |
|   flag    |  flag    | Seer? |   event?    |   group?    |  group, if any?  |
|-----------|----------|-------|-------------|-------------|------------------|
| off       | off      | no    | -           | -           | -                |
| on        | off      | yes   | yes *       | yes         | no               |
| on or off | on       | yes   | yes *       | only if new | yes              |

* For now, the only event with the data will be the event which triggers the Seer 
call, not subsequent events with that hash. In the long run we will probably need
to store the data on the `GroupHash` record itself. 
See #70454.


This should be enough for us to run a POC on S4S and measure the effect on grouping.
cmanallen pushed a commit that referenced this pull request May 21, 2024
This adds two helpers, `should_call_seer_for_grouping` and `get_seer_similar_issues`, to be used when we (maybe) call Seer as part of event ingestion. 

`should_call_seer_for_grouping` does exactly what you'd think given the name, right now only basing the decision on feature flags and whether or not the event has a usable title and/or stacktrace. In the future we'll also include rate limit and killswitch checks, and any other criteria which it makes sense to add.

`get_seer_similar_issues` is a wrapper around `get_similarity_data_from_seer` (which is what actually makes the API call to Seer). It extracts request data from the given event, makes the request, pulls together metadata about the results, and if a matching group is found and the flag is on, pulls the `Group` record out of the database. (I chose to put the feature flag check there rather than in the code where the the the grouping actually happens so that we can save the trip to the database if we're not going to end up using the results for grouping.)

Code to actually use these helpers is added in #71026.
cmanallen pushed a commit that referenced this pull request May 21, 2024
This uses the helpers added in #70999 to - depending on the state of the `projects:similarity-embeddings-metadata` and `projects:similarity-embeddings-grouping` flags - decide whether we should call Seer before creating a new group, make the API call if so, and then store the results and/or use them to actually prevent new group creation in favor of using an existing similar issue. The behavior is as follows:


| metadata  | grouping | call  | metadata in | metadata in | use Seer-matched |
|   flag    |  flag    | Seer? |   event?    |   group?    |  group, if any?  |
|-----------|----------|-------|-------------|-------------|------------------|
| off       | off      | no    | -           | -           | -                |
| on        | off      | yes   | yes *       | yes         | no               |
| on or off | on       | yes   | yes *       | only if new | yes              |

* For now, the only event with the data will be the event which triggers the Seer 
call, not subsequent events with that hash. In the long run we will probably need
to store the data on the `GroupHash` record itself. 
See #70454.


This should be enough for us to run a POC on S4S and measure the effect on grouping.
@github-actions github-actions bot locked and limited conversation to collaborators Jun 1, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Scope: Backend Automatically applied to PRs that change backend components
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants