Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Streaming join after scan_parquet failing #9605

Closed
2 tasks done
mishpat opened this issue Jun 28, 2023 · 0 comments · Fixed by #9612
Closed
2 tasks done

Streaming join after scan_parquet failing #9605

mishpat opened this issue Jun 28, 2023 · 0 comments · Fixed by #9612
Labels
bug Something isn't working python Related to Python Polars

Comments

@mishpat
Copy link
Contributor

mishpat commented Jun 28, 2023

Polars version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of Polars.

Issue description

Streaming join only ouputs a small fraction of the output when scanning from a parquet file.
If this is done on an in-memory lazy frame it works fine, e.g. lf0.collect().lazy().join(lf1.collect().lazy(), ...).collect(streaming=True).

Reproducible example

import numpy as np
import polars as pl

np.random.seed(111)

n = 500_000
k = 100
d0 = {f"x{i}": np.random.random(n) for i in range(k)}
d0.update({"id": np.arange(n)})
df0 = pl.DataFrame(d0)
df1 = df0.clone().select(pl.all().shuffle(111))
print("lf0 shape:", df0.shape)
df0.write_parquet("df0.parquet")
df1.write_parquet("df1.parquet")

lf0 = pl.scan_parquet("df0.parquet")
lf1 = pl.scan_parquet("df1.parquet").select(pl.all().suffix("_r"))  # haven't merged the fix to this yet
df2 = lf0.join(lf1, left_on="id", right_on="id_r").collect(streaming=True)
print("post-streaming join shape:", df2.shape)

lf0 shape: (500000, 101)
post-streaming join shape: (816, 201)

Expected behavior

post-streaming join shape: (500000, 201)

Installed versions

--------Version info---------
Polars:      0.18.4
Index type:  UInt32
Platform:    Windows-10-10.0.19044-SP0
Python:      3.10.12 | packaged by conda-forge | (main, Jun 23 2023, 22:34:57) [MSC v.1936 64 bit (AMD64)]

----Optional dependencies----
numpy:       1.25.0
pandas:      <not installed>
pyarrow:     12.0.1
connectorx:  <not installed>
deltalake:   <not installed>
fsspec:      <not installed>
matplotlib:  <not installed>
xlsx2csv:    <not installed>
xlsxwriter:  <not installed>
@mishpat mishpat added bug Something isn't working python Related to Python Polars labels Jun 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working python Related to Python Polars
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant