We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of Polars.
import polars as pl from tempfile import mkdtemp from pathlib import Path root = Path(mkdtemp()) pl.DataFrame({"a": []}).write_parquet(root / "a.parquet") pl.DataFrame({"b": []}).write_parquet(root / "b.parquet") dfs = [ pl.scan_parquet(path).with_row_count("idx") for path in (root / "a.parquet", root / "b.parquet") ] df = pl.concat(dfs, how="align") df.collect()
join parallel: true Traceback (most recent call last): File "example.py", line 15, in <module> df.collect() File ".virtualenvs/default/lib/python3.11/site-packages/polars/utils/deprecation.py", line 100, in wrapper return function(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^ File ".virtualenvs/default/lib/python3.11/site-packages/polars/lazyframe/frame.py", line 1787, in collect return wrap_df(ldf.collect()) ^^^^^^^^^^^^^ polars.exceptions.ColumnNotFoundError: idx
pl.read_parquet
pl.scan_parquet
pl.scan_parquet(..., row_count_name="idx")
pl.read_parquet(..., row_count_name="idx")
Expected behaviour would be consistency and receive an empty frame with the joined schema from both frames.
In [2]: pl.show_versions() --------Version info--------- Polars: 0.19.12 Index type: UInt32 Platform: macOS-11.7.10-x86_64-i386-64bit Python: 3.11.5 (main, Aug 24 2023, 15:23:14) [Clang 13.0.0 (clang-1300.0.29.30)] ----Optional dependencies---- adbc_driver_sqlite: <not installed> cloudpickle: <not installed> connectorx: <not installed> deltalake: <not installed> fsspec: <not installed> gevent: <not installed> matplotlib: <not installed> numpy: 1.26.1 openpyxl: <not installed> pandas: <not installed> pyarrow: <not installed> pydantic: <not installed> pyiceberg: <not installed> pyxlsb: <not installed> sqlalchemy: <not installed> xlsx2csv: <not installed> xlsxwriter: <not installed>
The text was updated successfully, but these errors were encountered:
I should add that the following also works
dfs = [ pl.read_parquet(path).lazy().with_row_count("idx") for path in (root / "a.parquet", root / "b.parquet") ] df = pl.concat(dfs, how="align") df.collect()
Sorry, something went wrong.
It seems the problem is that .scan_parquet cannot add a row count when the input is empty?
.scan_parquet
The schema is correct, but the column is never added.
dfs[0].schema # OrderedDict([('idx', UInt32), ('a', Float32)]) dfs[0].collect() # shape: (0, 1) # ┌─────┐ # │ a │ # │ --- │ # │ f32 │ # ╞═════╡ # └─────┘
Ai, it should add the column regardless.
Successfully merging a pull request may close this issue.
Checks
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of Polars.
Reproducible example
Log output
Issue description
pl.read_parquet
instead ofpl.scan_parquet
pl.scan_parquet(..., row_count_name="idx")
orpl.read_parquet(..., row_count_name="idx")
Expected behavior
Expected behaviour would be consistency and receive an empty frame with the joined schema from both frames.
Installed versions
The text was updated successfully, but these errors were encountered: