Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to properly read Gzipped access.log file #301

Closed
dmilith opened this issue May 10, 2022 · 3 comments · Fixed by #347
Closed

Unable to properly read Gzipped access.log file #301

dmilith opened this issue May 10, 2022 · 3 comments · Fixed by #347

Comments

@dmilith
Copy link

dmilith commented May 10, 2022

To make a long story short… I used code examples to load text from Nginx gzipped log file…
But 3k lines of text are completely gone after loading it via GzDecoder.

So I wrote a test case for that issue… and it confirmed that the output is not complete (missing 3k lines). After removing the GzDecoder and loading plaintext access.log whole the input is fine.

#[test]
fn decode_file_test() {
    let access_log = Config::access_log();
    let decoded_log = File::open(&access_log).and_then(decode_file);
    let maybe_log = decoded_log
        .map(|input_contents| {
            String::from_utf8(input_contents)
                .unwrap_or_default()
                .split('\n')
                .filter_map(|line| {
                    if line.is_empty() || is_partial(line) {
                        None
                    } else {
                        Some(line.to_string())
                    }
                })
                .collect::<Vec<_>>()
        })
        .unwrap_or_default();

    let mut file = OpenOptions::new()
        .create(true)
        .write(true)
        .open("log1.log")
        .expect("log1.log has to be writable!");
    file.write_all(maybe_log.join("\n").as_bytes())
        .expect("Couldn't write log1.log file!!");

    assert_eq!(maybe_log.len(), 407166);
}

My decode file function basically does this:

fn decode_file(mut file: File) -> io::Result<Vec<u8>> {
    let mut buf = vec![];
    match file.read_to_end(&mut buf) {
        Ok(bytes_read) => {
            info!("Input file read bytes: {bytes_read}");
            let mut gzipper = GzDecoder::new(&*buf);
            let mut output_buf = vec![];
            gzipper.read_to_end(&mut output_buf)?;
            Ok(output_buf)
        }
        Err(err) => Err(err),
    }
}

The access.log is a standard Nginx access log, with over 60MiBs of text inside.

@tatsuya6502
Copy link

tatsuya6502 commented Aug 9, 2022

I ran into the same issue. I tried to decode a database dump from Wikipedia, but I got only the first line (7 bytes) decoded by the read method of GzDecoder. The file is in XML format and has ~2 million lines. It is compressed by gzip.

Here is the reproducible code:

Cargo.toml

[dependencies]
flate2 = "1.0.24"
# flate2 = { version = "1.0.24", default-features = false, features = ["zlib-ng"] }

src/main.rs

use flate2::read::GzDecoder;
use std::{fs::File, io::prelude::*};

// Download the file from:
// https://dumps.wikimedia.org/enwiki/20220801/enwiki-20220801-abstract27.xml.gz
//
// The size of the file will be 20.4MiB.
//
const DATA_FILE: &str = "./data/enwiki-20220801-abstract27.xml.gz";

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let file = File::open(DATA_FILE)?;
    let mut decoder = GzDecoder::new(file);
    let mut buf = vec![0u8; 1024];

    for _ in 0..5 {
        match decoder.read(&mut buf) {
            Ok(n) => println!("{} bytes: {:?}", n, String::from_utf8_lossy(&buf[..n])),
            Err(e) => {
                eprintln!("Error: {}", e);
                break;
            }
        }
    }

    Ok(())
}

Download the file from https://dumps.wikimedia.org/enwiki/20220801/enwiki-20220801-abstract27.xml.gz (20.4MiB), and place it in the data directory.

You can use gunzip command to see first few lines of the file:

$ gunzip -dkc data/enwiki-20220801-abstract27.xml.gz | head
<feed>
<doc>
<title>Wikipedia: Kalkandereh</title>
<url>https://en.wikipedia.org/wiki/Kalkandereh</url>
<abstract>Kalkandereh may refer to:</abstract>
<links>
<sublink linktype="nav"><anchor>All article disambiguation pages</anchor><link>https://en.wikipedia.org/wiki/Category:All_article_disambiguation_pages</link></sublink>
<sublink linktype="nav"><anchor>All disambiguation pages</anchor><link>https://en.wikipedia.org/wiki/Category:All_disambiguation_pages</link></sublink>
<sublink linktype="nav"><anchor>Disambiguation pages</anchor><link>https://en.wikipedia.org/wiki/Category:Disambiguation_pages</link></sublink>
<sublink linktype="nav"><anchor>Disambiguation pages with short descriptions</anchor><link>https://en.wikipedia.org/wiki/Category:Disambiguation_pages_with_short_descriptions</link></sublink>

When you run the program, you will get the following output, showing only the first line of the decompressed file:

7 bytes: "<feed>\n"
0 bytes: ""
0 bytes: ""
0 bytes: ""
0 bytes: ""

Note that the buffer can contain up to 1,024 bytes, not 7 bytes.

    let mut buf = vec![0u8; 1024];

    for _ in 0..5 {
        match decoder.read(&mut buf) {

I also tried zlib-ng backend, but it did not solve the issue.

I found that the same program can decode other gzip files. For example, it can decode the file I created from src/main.rs.

$ gzip -k src/main.rs
$ mv src/main.rs.gz data/
746 bytes: "use flate2::read::GzDecoder;\nuse std::{fs::File, io::prelude::*};\n\n// Download this file from:\n// https://dumps.wikimedia.org/enwiki/20220801/enwiki-20220801-abstract27.xml.gz\n//\n// The size of the file will be 20.4MiB.\n//\nconst DATA_FILE: &str = \"./data/enwiki-20220801-abstract27.xml.gz\";\n\nfn main() -> Result<(), Box<dyn std::error::Error>> {\n    let file = File::open(DATA_FILE)?;\n    let mut decoder = GzDecoder::new(file);\n    let mut buf = vec![0u8; 1024];\n\n    for _ in 0..5 {\n        match decoder.read(&mut buf) {\n            Ok(n) => println!(\"{} bytes: {:?}\", n, String::from_utf8_lossy(&buf[..n])),\n            Err(e) => {\n                eprintln!(\"Error: {}\", e);\n                break;\n            }\n        }\n    }\n\n    Ok(())\n}\n"
0 bytes: ""
0 bytes: ""
0 bytes: ""
0 bytes: ""

Environment

  • macOS 12.5 (arm64)
  • Rust 1.62.1
  • flate2 1.0.24

@tatsuya6502
Copy link

Somebody told me that he was able to read the all contents of the Wikipedia database dump by replacing GzDecoder with MultiGzDecoder. I confirmed it by myself.

The download page of the dumps does not tell if this .gz file has multiple streams, but tells other .bz2 files have multiple streams. So it seems I should have used MultiGzDecoder and my comment above is invalid.

@dmilith — If you have chance, can you please check whether MultiGzDecoder can read your Nginx log files or not? Thanks!

@oyvindln
Copy link
Contributor

oyvindln commented Aug 9, 2022

Given that this seems to come up quite a bit so we might want to add a note about it on the gzdecoder docs.

Byron added a commit to JohnTitor/flate2-rs that referenced this issue May 22, 2023
This may help dealing with multi-stream gzip files.
`MultiGzDecoder` documentation was also improved to further clarify
why such files would exist.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants