Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Line deduplication filter #3110

Closed

Conversation

dannykopping
Copy link
Contributor

@dannykopping dannykopping commented Jan 3, 2021

What this PR does / why we need it:

Introduces a dedup filter to reduce a log stream by grouping on unique label dimensions.

Which issue(s) this PR fixes:

None. I implemented this quickly on a whim so have not discussed whether we actually want this; feel free to reject on this basis.

Special notes for your reviewer:

I wrote this feature to scratch an itch I had when using Loki as a datasource for annotations in Grafana; LogQL only offers line/label filtering for reducing the returned lines in a log stream, which doesn't quite fit my use-case.

In my use-case, I have multiple log lines that are "duplicated", with only a unique cluster label in each; I didn't want to force the user of my dashboard to have to choose a cluster to filter on since this data is generally consistent across clusters (and choosing an arbitrary cluster to filter on to achieve the same result doesn't express my intent clearly); figured this feature might be useful to other folks too.

I tried to mirror the nomenclature used in the Grafana Explore view, although this may cause confusion:
image
Open to suggestions on this.

If this PR is approved, I'm happy to implement the deduplication by exact/number/signature logic in LogQL.

Checklist

  • Documentation added
  • Tests updated

@CLAassistant
Copy link

CLAassistant commented Jan 3, 2021

CLA assistant check
All committers have signed the CLA.

@dannykopping dannykopping changed the title Dannykopping/line dedup filter Line deduplication filter Jan 3, 2021
@dannykopping dannykopping force-pushed the dannykopping/line_dedup_filter branch from c1aed95 to b9b54e4 Compare January 3, 2021 22:49
@dannykopping dannykopping marked this pull request as ready for review January 4, 2021 08:29
@cyriltovena cyriltovena self-assigned this Jan 5, 2021
@owen-d owen-d self-assigned this Jan 5, 2021
Copy link
Member

@owen-d owen-d left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've left a few comments/suggestions here and there but overall it's well done. Great job figuring out how the sausage is made and extending this.

Todo (Owen/Cyril):

  1. How will this affect sharding? Presumably dedup would need to be applied twice (at the edge and at the aggregation layer.
  2. The new node types need to be added to the new Walk functions I think.


func NewLineDedupFilter(labelFilters []string, inverted bool) *LineDedupFilter {
// create a map of labelFilters for O(1) lookups instead of O(n)
var filterMap = make(map[string]interface{})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can use map[string]struct{} here.

Copy link
Contributor Author

@dannykopping dannykopping Jan 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes, I missed this in my refactor - thanks 👍

pkg/logql/log/dedup_filter.go Outdated Show resolved Hide resolved
pkg/logql/log/dedup_filter_test.go Outdated Show resolved Hide resolved
pkg/logql/log/dedup_filter_test.go Outdated Show resolved Hide resolved
pkg/logql/parser_test.go Outdated Show resolved Hide resolved
pkg/logql/parser_test.go Outdated Show resolved Hide resolved
Danny Kopping added 8 commits January 5, 2021 22:05
Signed-off-by: Danny Kopping <[email protected]>
Signed-off-by: Danny Kopping <[email protected]>
Signed-off-by: Danny Kopping <[email protected]>
Signed-off-by: Danny Kopping <[email protected]>
Signed-off-by: Danny Kopping <[email protected]>
Signed-off-by: Danny Kopping <[email protected]>
Signed-off-by: Danny Kopping <[email protected]>
Signed-off-by: Danny Kopping <[email protected]>
@dannykopping dannykopping force-pushed the dannykopping/line_dedup_filter branch from 0c37f18 to 198640e Compare January 5, 2021 20:06
@dannykopping
Copy link
Contributor Author

@owen-d thought I'd also just drop in the benchmarking results to see if this is any cause for concern:

goos: linux
goarch: amd64
pkg: github.com/grafana/loki/pkg/logql/log
Benchmark_Deduplication
Benchmark_Deduplication     	10071492	       102 ns/op	      32 B/op	       1 allocs/op
Benchmark_Deduplication-2   	13297992	        93.6 ns/op	      32 B/op	       1 allocs/op
Benchmark_Deduplication-4   	11089706	       101 ns/op	      32 B/op	       1 allocs/op
Benchmark_Deduplication-8   	11677495	        97.4 ns/op	      32 B/op	       1 allocs/op
PASS

Signed-off-by: Danny Kopping <[email protected]>

func NewLineDedupFilter(labelFilters []string, inverted bool) *LineDedupFilter {
// create a map of labelFilters for O(1) lookups instead of O(n)
var filterMap = make(map[string]struct{})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
var filterMap = make(map[string]struct{})
var filterMap = make(map[string]struct{}, len(labelFilters))

@cyriltovena
Copy link
Contributor

I think this is a great, I wanted to implement something similar.

I'm wondering though, wouldn't be this more useful if it includes somehow the number of duplicates reduced for each line ?

I think this would help the client to emphasis on a specific message, but then there some UI work.

I'm also wondering how it will looks like in Explore histogram, we could get this deployed in dev and give it a try.

In my use-case, I have multiple log lines that are "duplicated", with only a unique cluster label in each; I didn't want to force the user of my dashboard to have to choose a cluster to filter on since this data is generally consistent across clusters (and choosing an arbitrary cluster to filter on to achieve the same result doesn't express my intent clearly); figured this feature might be useful to other folks too.

Can you share some example of that use-case ? I'm curious to compare to my use-cases, filtering duplicate errors like below using the dedupe by (msg):

image

Signed-off-by: Danny Kopping <[email protected]>
@dannykopping
Copy link
Contributor Author

I'm wondering though, wouldn't be this more useful if it includes somehow the number of duplicates reduced for each line ?

I think this would help the client to emphasis on a specific message, but then there some UI work.

I'm also wondering how it will looks like in Explore histogram, we could get this deployed in dev and give it a try.

Funnily enough, I implemented that initially and then disregarded it.

How do you see this working though? I'm thinking of a meta-label like __dedup_count__ (similar in nature to __error__).

Can you share some example of that use-case ? I'm curious to compare to my use-cases, filtering duplicate errors like below using the dedupe by (msg):

That's almost exactly my use-case too; in my use-case I want to reduce the stream of logs by using dedup without (cluster) as it's the only unique label in an otherwise identical log entry (reducing cardinality would be the overarching goal, you could say).

@dannykopping
Copy link
Contributor Author

dannykopping commented Jan 6, 2021

I tried to quickly hack together a dedup counter, with that __dedup_count__ meta-label attached to the returned line, but I've run into an issue I don't understand yet.

dannykopping/loki@dannykopping/line_dedup_filter...dannykopping:dannykopping/line_dedup_filter_count

Even if I keep a reference to the LabelsBuilder instance associated to the line that is eventually returned, I cannot update it later on (I always end up with __dedup_count__=1).

Update:

I see now that p.builder.Reset() is being called in func (p *streamPipeline) Process(line []byte), and the labels are processed to a result once the line is returned, so that explains it:

return line, p.builder.LabelsResult(), true

What would suggest here? I can't see a way right now to attach a meta-label to the 1 returned line using my existing mechanism of deduplication.

@cyriltovena
Copy link
Contributor

I tried to quickly hack together a dedup counter, with that __dedup_count__ meta-label attached to the returned line, but I've run into an issue I don't understand yet.

dannykopping/loki@dannykopping/line_dedup_filter...dannykopping:dannykopping/line_dedup_filter_count

Even if I keep a reference to the LabelsBuilder instance associated to the line that is eventually returned, I cannot update it later on (I always end up with __dedup_count__=1).

Update:

I see now that p.builder.Reset() is being called in func (p *streamPipeline) Process(line []byte), and the labels are processed to a result once the line is returned, so that explains it:

return line, p.builder.LabelsResult(), true

What would suggest here? I can't see a way right now to attach a meta-label to the 1 returned line using my existing mechanism of deduplication.

Not sure yet, but this is not all your problems. Right now, those pipelines are run on ingesters too, so you will be deduping in multiple place, you'll need to merge result a bit differently, this is trickier than you would expect.

There's also dedupe from replication that needs to happens before otherwise they will count tower duplicates and they should not, currently it's not a big problem because we only add labels, but I want to solve this separately.

I'm fine with the meta label solution, but we should see with everyone in the team, you should join our weekly or community meeting.

@cyriltovena
Copy link
Contributor

2. The new node types need to be added to the new Walk functions I think.

children(expr SampleExpr) []SampleExpr it's only for sample for now.

@stale
Copy link

stale bot commented Feb 9, 2021

This issue has been automatically marked as stale because it has not had any activity in the past 30 days. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale A stale issue or PR that will automatically be closed. label Feb 9, 2021
@stale stale bot closed this Feb 21, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size/XXL stale A stale issue or PR that will automatically be closed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants