Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] S3 Source fails to load all records for large compressed newline-delimited logs #1568

Closed
dlvenable opened this issue Jul 5, 2022 · 2 comments · Fixed by #1570
Closed
Assignees
Labels
bug Something isn't working

Comments

@dlvenable
Copy link
Member

dlvenable commented Jul 5, 2022

Describe the bug

Large log files in S3 with gzip compression do not have all records come through.

One example log file (from Application Load Balancer) has about 80k lines in it. However, Data Prepper only reports about 2000 of these.

@dlvenable dlvenable added bug Something isn't working untriaged and removed untriaged labels Jul 5, 2022
@dlvenable
Copy link
Member Author

From testing, it appears that large JSON files do process all lines. And uncompressed log files also process all lines. So this appears to be isolated to compressed files with the newline: codec.

@dlvenable
Copy link
Member Author

From our investigation it appears that the Java GZIPInputStream does not support concatenated gzip files.

See the documentation for GzipCompressorInputStream at https://commons.apache.org/proper/commons-compress/javadocs/api-1.21/index.html for information on the problem with Java's implementation.

See the following post for an explanation of the file structure. https://stackoverflow.com/a/8005155/650176

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

2 participants