-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(file source)!: use uncompressed content for fingerprinting files (lines and ignored_header_bytes) #22050
Conversation
ccdfe60
to
ae39cc0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work @roykim98 ! I left a couple of comments about error handling, but overall this approach seems good to me. I appreciate the added tests.
Made some final tests here:
Made a 3-line file called
Loaded them up in the source, and then compressed it to a new inode, Created a similarly named file
All had the same fingerprint
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @roykim98 ! I refactored the check
function, but otherwise this looks good to me! I appreciate the thorough testing.
Head branch was pushed to by a user without write access
One of the workflows was failing due to some whitespaces in the *.cue files, which I just removed, but it revoked the automerge label and build checks. Looks like you may have to re-add the automerge label, sorry for the inconvenience @jszwedko . Ran |
These additional checks pass locally. |
Also ran these |
Apologies if I missed something here, @jszwedko it seems like all your review points were addressed in this one and we are willing to break the existing behavior? |
@pront @jszwedko It looks like one check failed, but I suspect it actually doesn't have anything to do with my PR given the only thing that changed is the year date. There is probably some logic that uses an unmocked timestamp and current year. |
True. Can you please |
Signed-off-by: Jesse Szwedko <[email protected]>
Signed-off-by: Jesse Szwedko <[email protected]>
c14294c
to
ef6809f
Compare
@pront I just rebased off the latest master and force-pushed. No conflicts so no new changed lines. |
Apologies, I still have to circle back to this. I think some the comment you left, @roykim98 , is relevant and the error handling isn't quite right at the moment. |
I presume you mean this comment? #22050 (comment)
If this is the final thing, I can add a quick commit that will change this to:
|
Right, yeah, something like that is what I was thinking; however, thinking about it more, I think the caller needs to know if the error occurred during the seek or during the |
Regarding the breaking change, I think it is an appropriate one to make the fingerprinting currently isn't working as expected for compressed input files. |
@jszwedko Added a doc comment and added the attempt to reset on read error. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great, thanks @roykim98 !
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Left a couple of nits.
Co-authored-by: Pavlos Rontidis <[email protected]>
@pront Addressed your changes, |
@pront Is this good to merge now? Anything else required on my end? |
Doing a final check now. |
…(lines and ignored_header_bytes) (vectordotdev#22050) * Check the crc of uncompressed lines when reading from compressed gzip files * Update documentation * Handle header bytes properly and add a test * Use fingerprint instead of fingerprinter * Clarify that only gzip is supported at this time * Propagate errors upward as alluded to in @jszwedko comments * Refactor check Signed-off-by: Jesse Szwedko <[email protected]> * Have to seek after finding algorithm too Signed-off-by: Jesse Szwedko <[email protected]> * Remove whitespace * Clippy * make check-component-docs * Attempt to reset the position and also add a doc comment * Update changelog.d/22050-fingerprint-uncompressed-file-content.fix.md Co-authored-by: Pavlos Rontidis <[email protected]> * Use GZIP_MAGIC * Update lib/file-source/src/fingerprinter.rs --------- Signed-off-by: Jesse Szwedko <[email protected]> Co-authored-by: Jesse Szwedko <[email protected]> Co-authored-by: Pavlos Rontidis <[email protected]>
Summary
This PR addresses #13193.
This PR will handle the case where attempting to fingerprint compressed content. It will also handle a use case with log rotation, where an uncompressed active file might be monitored and then rotated into a compressed file with a new inode by a log rotation service. By comparing the uncompressed lines as a source of truth, it prevents messages from processing the files 2X.
I chose to explicitly check the magic bytes header by enumerating the signatures to make checking the different headers consistent. I checked a couple compression libraries, and I figured it would be complicated to try to move the ownership of a reader into multiple decoders with inconsistent functions that did not all permit easy checking of the magic header. Of course, if only gzip is expected to be supported for the foreseeable future, this is a moot point and we can just fall back to instantiating the gzip library to check
BREAKING CHANGE: When sourcing from compressed files,
ignored_header_bytes
will no longer look at the compressed file's bytes, which would include the magic bytes for the compression header. Instead, it will ignore the bytes from the uncompressed content. Similarly,lines
will no longer look for new line delimiters in the compressed content, but the uncompressed content. Arguably, both of these current mechanisms as a bug, as compressed content would not have any explicit lines or intentional header aside from the magic bytes.Change Type
Is this a breaking change?
How did you test this PR?
Unit tests
Does this PR include user facing changes?
Checklist
References
file
source:checksum
fingerprint is not correct with gzipped files #13193