Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problems reading GZIPPED file with libz-sys > 1.1.5 and zlib-ng-compat #104

Open
ghuls opened this issue Jul 5, 2022 · 0 comments
Open

Comments

@ghuls
Copy link

ghuls commented Jul 5, 2022

Problems reading GZIPPED file with libz-sys > 1.1.5 withzlib-ng-compat.
With zlib and it works fine. With libz-sys == 1.1.5andzlib-ng-compat` it also works fine.

See: pola-rs/polars#3895

Problematic gzipped CSV file: https://temp.aertslab.org/.tsv/atac_fragments.head40000000.tsv.gz

import polars as pl

df = pl.read_csv(
    'atac_fragments.head40000000.tsv.gz',
    skip_rows=52,
    has_headers=False, 
    sep="\t",
    use_pyarrow=False
)

Cargo.toml

[package]
name = "polars_read_csv"
version = "0.1.0"
edition = "2021"


[dependencies]
polars-core = { path = "../polars/polars/polars-core", default-features = false }

[features]


[dependencies.polars]
path = "../polars/polars"
default-features = false
features = [
  "simd",
  "strings",
  "csv-file",
  "performant",
  "decompress-fast",
]


[profile.release]
target-cpu = "native"
debug = 1

src/main.rs

use polars::prelude::*;

fn main() -> Result<()> {
    let df = CsvReader::from_path(
        "atac_fragments.head40000000.tsv.gz",
    )?
    .with_delimiter(b'\t')
    .with_skip_rows(52)
    .finish()?;

    println!("{:?}", df.shape());
    println!("{}", df.tail(Some(30)));

    Ok(())
}

Get polars:

cd ..
git clone --maxdepth 100 https://github.com/pola-rs/polars

Ttry dffirent flate2 backends in polars-core by changing: decompress-fast

❯ git diff polars/polars-io/Cargo.toml
diff --git a/polars/polars-io/Cargo.toml b/polars/polars-io/Cargo.toml
index 3461c37e7d..a39b89b06c 100644
--- a/polars/polars-io/Cargo.toml
+++ b/polars/polars-io/Cargo.toml
@@ -28,7 +28,9 @@ dtype-categorical = ["polars-core/dtype-categorical"]
 csv-file = ["csv-core", "memmap", "lexical", "polars-core/rows", "lexical-core"]
 fmt = ["polars-core/fmt"]
 decompress = ["flate2/miniz_oxide"]
-decompress-fast = ["flate2/zlib-ng-compat"]
+#decompress-fast = ["flate2/zlib"]
+#decompress-fast = ["flate2/zlib-ng"]
 temporal = ["dtype-datetime", "dtype-date", "dtype-time"]
 partition = ["polars-core/partition_by"]
 # don't use this
# Compile each version with differnt decompress-fast settings:
cargo build --release

# with "flate2/zlib-ng-compat"
$ target/release/polars_read_csv 
Error: ComputeError("invalid utf8 data in csv")

# with "flate2/zlib-ng"
$ target/release/polars_read_csv 
Error: ComputeError("invalid utf8 data in csv")

# with "flate2/zlib"
$ target/release/polars_read_csv
(39999947, 5)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant