-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pkg/chunkenc: ignore duplicate lines pushed to a stream #1519
pkg/chunkenc: ignore duplicate lines pushed to a stream #1519
Conversation
instead of ignoring, should we consider fudging the timestamp? |
in addition to not applying to flushed chunks, it also doesn't apply to the last entry in the previous block, though this we could probably address? |
I'm not sure I understand why we'd want to do that? Consider what happens today:
There's a few issues with this:
|
We could, I didn't bother because it requires decompressing the entire chunk. It probably wouldn't be required that often, but it sounds like a non-zero performance hit, right? |
there is a different case here, the client sends two different log lines with the same timestamp, this is the case were I think we could consider fudging the timestamp for them |
yeah, I don't know, that sounds hard to get right. What if all their logs are only precise to the second and they send 50 log lines with the same timestamp? We'd have to track the fudged timestamp over time and keep that running across cut chunks. |
i think this is why nobody has addressed this before, currently we would store these logs for them, this would start dropping them. |
this only drops the lines if the content matches the last stored line, though? |
fwiw, Cortex drops all duplicates with the same timestamp, even if the value doesn't match. i don't think that's the right thing for us to do, though |
ah right, yeah this would be beneficial and less risky/impacting |
after talking to Ed, I'm going to enhance this commit to work across flushes and cut chunks by tracking the most recently written line and timestamp per stream. |
4d9884b
to
3c6a05f
Compare
This commit changes the behavior of appending to a stream to ignore an incoming line if its timestamp and contents match the previous line received. This should reduce the chances of ingesters storing duplicate log lines when clients retry push requests whenever a 500 is returned. Fixes grafana#1517
3c6a05f
to
ad8bcde
Compare
@slim-bean PTAL, this should cover more edge cases better now |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
We're running into a similar issue, except the exact same timestamp applies to two legitimately different log lines.
In Loki, we may see the logs ordered like so
I think this could be solved in a more traditional index-heavy log aggregator by incorporating a |
This commit changes the behavior of appending to a stream to ignore an incoming line if its timestamp and contents match the previous line received. This should reduce the chances of ingesters storing duplicate log lines when clients retry push requests whenever a 500 is returned.
Fixes #1517.