[BUG] S3 Source fails to load all records for large compressed newline-delimited logs #1568

dlvenable · 2022-07-05T19:25:19Z

Describe the bug

Large log files in S3 with gzip compression do not have all records come through.

One example log file (from Application Load Balancer) has about 80k lines in it. However, Data Prepper only reports about 2000 of these.

dlvenable · 2022-07-06T17:38:42Z

From testing, it appears that large JSON files do process all lines. And uncompressed log files also process all lines. So this appears to be isolated to compressed files with the newline: codec.

dlvenable · 2022-07-07T16:58:20Z

From our investigation it appears that the Java GZIPInputStream does not support concatenated gzip files.

See the documentation for GzipCompressorInputStream at https://commons.apache.org/proper/commons-compress/javadocs/api-1.21/index.html for information on the problem with Java's implementation.

See the following post for an explanation of the file structure. https://stackoverflow.com/a/8005155/650176

dlvenable added bug Something isn't working untriaged and removed untriaged labels Jul 5, 2022

dlvenable assigned asifsmohammed Jul 6, 2022

asifsmohammed mentioned this issue Jul 7, 2022

Fix: Updated GZipCompressionEngine to use GzipCompressorInputStream #1570

Merged

4 tasks

dlvenable closed this as completed in #1570 Jul 8, 2022

dlvenable mentioned this issue Nov 19, 2022

[BUG] S3 Source fails to load all records for large compressed logs with automatic compression #2026

Closed

asifsmohammed moved this to Done in Data Prepper Tracking Board Dec 9, 2022

asifsmohammed added this to Data Prepper Tracking Board Dec 9, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] S3 Source fails to load all records for large compressed newline-delimited logs #1568

[BUG] S3 Source fails to load all records for large compressed newline-delimited logs #1568

dlvenable commented Jul 5, 2022 •

edited

Loading

dlvenable commented Jul 6, 2022

dlvenable commented Jul 7, 2022

[BUG] S3 Source fails to load all records for large compressed newline-delimited logs #1568

[BUG] S3 Source fails to load all records for large compressed newline-delimited logs #1568

Comments

dlvenable commented Jul 5, 2022 • edited Loading

dlvenable commented Jul 6, 2022

dlvenable commented Jul 7, 2022

dlvenable commented Jul 5, 2022 •

edited

Loading