Unable to properly read Gzipped access.log file #301

dmilith · 2022-05-10T11:05:14Z

To make a long story short… I used code examples to load text from Nginx gzipped log file…
But 3k lines of text are completely gone after loading it via GzDecoder.

So I wrote a test case for that issue… and it confirmed that the output is not complete (missing 3k lines). After removing the GzDecoder and loading plaintext access.log whole the input is fine.

#[test]
fn decode_file_test() {
    let access_log = Config::access_log();
    let decoded_log = File::open(&access_log).and_then(decode_file);
    let maybe_log = decoded_log
        .map(|input_contents| {
            String::from_utf8(input_contents)
                .unwrap_or_default()
                .split('\n')
                .filter_map(|line| {
                    if line.is_empty() || is_partial(line) {
                        None
                    } else {
                        Some(line.to_string())
                    }
                })
                .collect::<Vec<_>>()
        })
        .unwrap_or_default();

    let mut file = OpenOptions::new()
        .create(true)
        .write(true)
        .open("log1.log")
        .expect("log1.log has to be writable!");
    file.write_all(maybe_log.join("\n").as_bytes())
        .expect("Couldn't write log1.log file!!");

    assert_eq!(maybe_log.len(), 407166);
}

My decode file function basically does this:

fn decode_file(mut file: File) -> io::Result<Vec<u8>> {
    let mut buf = vec![];
    match file.read_to_end(&mut buf) {
        Ok(bytes_read) => {
            info!("Input file read bytes: {bytes_read}");
            let mut gzipper = GzDecoder::new(&*buf);
            let mut output_buf = vec![];
            gzipper.read_to_end(&mut output_buf)?;
            Ok(output_buf)
        }
        Err(err) => Err(err),
    }
}

The access.log is a standard Nginx access log, with over 60MiBs of text inside.

…ate2-rs#301

tatsuya6502 · 2022-08-09T08:58:39Z

I ran into the same issue. I tried to decode a database dump from Wikipedia, but I got only the first line (7 bytes) decoded by the read method of GzDecoder. The file is in XML format and has ~2 million lines. It is compressed by gzip.

Here is the reproducible code:

Cargo.toml

[dependencies]
flate2 = "1.0.24"
# flate2 = { version = "1.0.24", default-features = false, features = ["zlib-ng"] }

src/main.rs

use flate2::read::GzDecoder;
use std::{fs::File, io::prelude::*};

// Download the file from:
// https://dumps.wikimedia.org/enwiki/20220801/enwiki-20220801-abstract27.xml.gz
//
// The size of the file will be 20.4MiB.
//
const DATA_FILE: &str = "./data/enwiki-20220801-abstract27.xml.gz";

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let file = File::open(DATA_FILE)?;
    let mut decoder = GzDecoder::new(file);
    let mut buf = vec![0u8; 1024];

    for _ in 0..5 {
        match decoder.read(&mut buf) {
            Ok(n) => println!("{} bytes: {:?}", n, String::from_utf8_lossy(&buf[..n])),
            Err(e) => {
                eprintln!("Error: {}", e);
                break;
            }
        }
    }

    Ok(())
}

Download the file from https://dumps.wikimedia.org/enwiki/20220801/enwiki-20220801-abstract27.xml.gz (20.4MiB), and place it in the data directory.

You can use gunzip command to see first few lines of the file:

$ gunzip -dkc data/enwiki-20220801-abstract27.xml.gz | head
<feed>
<doc>
<title>Wikipedia: Kalkandereh</title>
<url>https://en.wikipedia.org/wiki/Kalkandereh</url>
<abstract>Kalkandereh may refer to:</abstract>
<links>
<sublink linktype="nav"><anchor>All article disambiguation pages</anchor><link>https://en.wikipedia.org/wiki/Category:All_article_disambiguation_pages</link></sublink>
<sublink linktype="nav"><anchor>All disambiguation pages</anchor><link>https://en.wikipedia.org/wiki/Category:All_disambiguation_pages</link></sublink>
<sublink linktype="nav"><anchor>Disambiguation pages</anchor><link>https://en.wikipedia.org/wiki/Category:Disambiguation_pages</link></sublink>
<sublink linktype="nav"><anchor>Disambiguation pages with short descriptions</anchor><link>https://en.wikipedia.org/wiki/Category:Disambiguation_pages_with_short_descriptions</link></sublink>

When you run the program, you will get the following output, showing only the first line of the decompressed file:

7 bytes: "<feed>\n"
0 bytes: ""
0 bytes: ""
0 bytes: ""
0 bytes: ""

Note that the buffer can contain up to 1,024 bytes, not 7 bytes.

    let mut buf = vec![0u8; 1024];

    for _ in 0..5 {
        match decoder.read(&mut buf) {

I also tried zlib-ng backend, but it did not solve the issue.

I found that the same program can decode other gzip files. For example, it can decode the file I created from src/main.rs.

$ gzip -k src/main.rs
$ mv src/main.rs.gz data/

746 bytes: "use flate2::read::GzDecoder;\nuse std::{fs::File, io::prelude::*};\n\n// Download this file from:\n// https://dumps.wikimedia.org/enwiki/20220801/enwiki-20220801-abstract27.xml.gz\n//\n// The size of the file will be 20.4MiB.\n//\nconst DATA_FILE: &str = \"./data/enwiki-20220801-abstract27.xml.gz\";\n\nfn main() -> Result<(), Box<dyn std::error::Error>> {\n    let file = File::open(DATA_FILE)?;\n    let mut decoder = GzDecoder::new(file);\n    let mut buf = vec![0u8; 1024];\n\n    for _ in 0..5 {\n        match decoder.read(&mut buf) {\n            Ok(n) => println!(\"{} bytes: {:?}\", n, String::from_utf8_lossy(&buf[..n])),\n            Err(e) => {\n                eprintln!(\"Error: {}\", e);\n                break;\n            }\n        }\n    }\n\n    Ok(())\n}\n"
0 bytes: ""
0 bytes: ""
0 bytes: ""
0 bytes: ""

Environment

macOS 12.5 (arm64)
Rust 1.62.1
flate2 1.0.24

tatsuya6502 · 2022-08-09T11:05:19Z

Somebody told me that he was able to read the all contents of the Wikipedia database dump by replacing GzDecoder with MultiGzDecoder. I confirmed it by myself.

The download page of the dumps does not tell if this .gz file has multiple streams, but tells other .bz2 files have multiple streams. So it seems I should have used MultiGzDecoder and my comment above is invalid.

@dmilith — If you have chance, can you please check whether MultiGzDecoder can read your Nginx log files or not? Thanks!

oyvindln · 2022-08-09T11:59:18Z

Given that this seems to come up quite a bit so we might want to add a note about it on the gzdecoder docs.

This may help dealing with multi-stream gzip files. `MultiGzDecoder` documentation was also improved to further clarify why such files would exist.

dmilith added a commit to dmilith/pff that referenced this issue May 10, 2022

new: Dropped flate2 crate, since GzDecoder seems broken: rust-lang/fl…

e22f065

…ate2-rs#301

JohnTitor mentioned this issue May 7, 2023

Add notes about multiple streams to GzDecoder #347

Merged

Byron mentioned this issue May 22, 2023

Export all headers from MultiGzDecoder #348

Closed

JohnTitor closed this as completed in #347 May 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to properly read Gzipped access.log file #301

Unable to properly read Gzipped access.log file #301

dmilith commented May 10, 2022 •

edited

Loading

tatsuya6502 commented Aug 9, 2022 •

edited

Loading

tatsuya6502 commented Aug 9, 2022

oyvindln commented Aug 9, 2022

Unable to properly read Gzipped access.log file #301

Unable to properly read Gzipped access.log file #301

Comments

dmilith commented May 10, 2022 • edited Loading

tatsuya6502 commented Aug 9, 2022 • edited Loading

tatsuya6502 commented Aug 9, 2022

oyvindln commented Aug 9, 2022

dmilith commented May 10, 2022 •

edited

Loading

tatsuya6502 commented Aug 9, 2022 •

edited

Loading