-
Notifications
You must be signed in to change notification settings - Fork 163
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unable to properly read Gzipped access.log file #301
Comments
I ran into the same issue. I tried to decode a database dump from Wikipedia, but I got only the first line (7 bytes) decoded by the Here is the reproducible code:
[dependencies]
flate2 = "1.0.24"
# flate2 = { version = "1.0.24", default-features = false, features = ["zlib-ng"] }
use flate2::read::GzDecoder;
use std::{fs::File, io::prelude::*};
// Download the file from:
// https://dumps.wikimedia.org/enwiki/20220801/enwiki-20220801-abstract27.xml.gz
//
// The size of the file will be 20.4MiB.
//
const DATA_FILE: &str = "./data/enwiki-20220801-abstract27.xml.gz";
fn main() -> Result<(), Box<dyn std::error::Error>> {
let file = File::open(DATA_FILE)?;
let mut decoder = GzDecoder::new(file);
let mut buf = vec![0u8; 1024];
for _ in 0..5 {
match decoder.read(&mut buf) {
Ok(n) => println!("{} bytes: {:?}", n, String::from_utf8_lossy(&buf[..n])),
Err(e) => {
eprintln!("Error: {}", e);
break;
}
}
}
Ok(())
} Download the file from https://dumps.wikimedia.org/enwiki/20220801/enwiki-20220801-abstract27.xml.gz (20.4MiB), and place it in the You can use $ gunzip -dkc data/enwiki-20220801-abstract27.xml.gz | head
<feed>
<doc>
<title>Wikipedia: Kalkandereh</title>
<url>https://en.wikipedia.org/wiki/Kalkandereh</url>
<abstract>Kalkandereh may refer to:</abstract>
<links>
<sublink linktype="nav"><anchor>All article disambiguation pages</anchor><link>https://en.wikipedia.org/wiki/Category:All_article_disambiguation_pages</link></sublink>
<sublink linktype="nav"><anchor>All disambiguation pages</anchor><link>https://en.wikipedia.org/wiki/Category:All_disambiguation_pages</link></sublink>
<sublink linktype="nav"><anchor>Disambiguation pages</anchor><link>https://en.wikipedia.org/wiki/Category:Disambiguation_pages</link></sublink>
<sublink linktype="nav"><anchor>Disambiguation pages with short descriptions</anchor><link>https://en.wikipedia.org/wiki/Category:Disambiguation_pages_with_short_descriptions</link></sublink> When you run the program, you will get the following output, showing only the first line of the decompressed file: 7 bytes: "<feed>\n"
0 bytes: ""
0 bytes: ""
0 bytes: ""
0 bytes: "" Note that the buffer can contain up to 1,024 bytes, not 7 bytes. let mut buf = vec![0u8; 1024];
for _ in 0..5 {
match decoder.read(&mut buf) { I also tried I found that the same program can decode other gzip files. For example, it can decode the file I created from $ gzip -k src/main.rs
$ mv src/main.rs.gz data/ 746 bytes: "use flate2::read::GzDecoder;\nuse std::{fs::File, io::prelude::*};\n\n// Download this file from:\n// https://dumps.wikimedia.org/enwiki/20220801/enwiki-20220801-abstract27.xml.gz\n//\n// The size of the file will be 20.4MiB.\n//\nconst DATA_FILE: &str = \"./data/enwiki-20220801-abstract27.xml.gz\";\n\nfn main() -> Result<(), Box<dyn std::error::Error>> {\n let file = File::open(DATA_FILE)?;\n let mut decoder = GzDecoder::new(file);\n let mut buf = vec![0u8; 1024];\n\n for _ in 0..5 {\n match decoder.read(&mut buf) {\n Ok(n) => println!(\"{} bytes: {:?}\", n, String::from_utf8_lossy(&buf[..n])),\n Err(e) => {\n eprintln!(\"Error: {}\", e);\n break;\n }\n }\n }\n\n Ok(())\n}\n"
0 bytes: ""
0 bytes: ""
0 bytes: ""
0 bytes: "" Environment
|
Somebody told me that he was able to read the all contents of the Wikipedia database dump by replacing The download page of the dumps does not tell if this @dmilith — If you have chance, can you please check whether |
Given that this seems to come up quite a bit so we might want to add a note about it on the gzdecoder docs. |
This may help dealing with multi-stream gzip files. `MultiGzDecoder` documentation was also improved to further clarify why such files would exist.
To make a long story short… I used code examples to load text from Nginx gzipped log file…
But 3k lines of text are completely gone after loading it via GzDecoder.
So I wrote a test case for that issue… and it confirmed that the output is not complete (missing 3k lines). After removing the GzDecoder and loading plaintext access.log whole the input is fine.
My decode file function basically does this:
The access.log is a standard Nginx access log, with over 60MiBs of text inside.
The text was updated successfully, but these errors were encountered: