Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(duckdb): disable batching for calls to arrow #8942

Closed
wants to merge 1 commit into from

Conversation

gforsyth
Copy link
Member

xref #8896

This one took me a while to figure out.

the arrow method has a batch_size kwarg that defaults to 1_000_000,
but the batch sizes aren't actually deterministic, they're just close to
that size. If the result-set is larger than batch_size, you can get
slightly different results because of the batching and where the
boundaries are.

I can add a test for this, but it will require a query that returns
north of a million rows and I don't know if we want that in our unit tests.

xref ibis-project#8896

This one took me a _while_ to figure out.

the `arrow` method has a `batch_size` kwarg that defaults to 1_000_000,
but the batch sizes aren't actually deterministic, they're just close to
that size. If the result-set is larger than `batch_size`, you can get
slightly different results because of the batching and where the
boundaries are.

I can add a test for this, but it will require a query that returns
north of a million rows and I don't know if we want that in our unit tests.
@cpcloud
Copy link
Member

cpcloud commented Apr 11, 2024

Thanks for the PR.

Any idea why batch size affects ... output schema? I think I'm having some trouble following what the underlying problem is here.

@gforsyth gforsyth linked an issue Apr 11, 2024 that may be closed by this pull request
@gforsyth
Copy link
Member Author

dammit, this is fixing the wrong thing

@gforsyth
Copy link
Member Author

batch_size=0 just zeroes out all the results

@gforsyth
Copy link
Member Author

Here's an example (I think) of the failure mode:

[ins] In [1]: import duckdb

[ins] In [2]: query = """SELECT
         ...:   "t5"."c_custkey",
         ...: FROM (
         ...:   SELECT DISTINCT * FROM (
         ...:       SELECT *
         ...:       FROM "customer" AS "t2"
         ...:       INNER JOIN "orders" AS "t3"
         ...:         ON "t2"."c_custkey" = "t3"."o_custkey"
         ...:       ORDER BY ALL
         ...:       )
         ...:   ) as "t5"
         ...: ORDER BY
         ...:   "t5"."c_nationkey" ASC
         ...: """

[ins] In [3]: con = duckdb.connect()

[ins] In [4]: con.load_extension("tpch")

[ins] In [5]: con.sql("CALL dbgen(sf=1)")
100% ▕████████████████████████████████████████████████████████████▏ 
Out[5]: 
┌─────────┐
│ Success │
│ boolean │
├─────────┤
│ 0 rows  │
└─────────┘

[nav] In [6]: results = [con.sql(query).arrow() for _ in range(5)]

[ins] In [7]: results
Out[7]: 
[pyarrow.Table
 c_custkey: int32
 ----
 c_custkey: [[147511,147511,147511,147511,147617,...,86623,86623,86623,86650,86650],[86650,86747,86795,86837,86926,...,147272,147272,147274,147400,147400]],
 pyarrow.Table
 c_custkey: int32
 ----
 c_custkey: [[147511,147511,147511,147511,147511,...,89101,89158,89254,89254,89294],[89300,89399,89426,89441,89488,...,110575,110575,110575,110584,110584]],
 pyarrow.Table
 c_custkey: int32
 ----
 c_custkey: [[147511,147511,147511,147511,147511,...,30229,30229,30251,30251,30251],[30304,30304,30304,30311,30322,...,135104,135112,135112,135155,135155]],
 pyarrow.Table
 c_custkey: int32
 ----
 c_custkey: [[147511,147511,147511,147511,147511,...,36109,36139,36139,36139,36139],[36175,36175,36221,36221,36221,...,122704,122770,122803,122803,122834]],
 pyarrow.Table
 c_custkey: int32
 ----
 c_custkey: [[147511,147511,147511,147511,147511,...,75778,75778,75806,75907,75907],[75907,75907,75958,75958,75958,...,122803,122803,122803,122834,122834]]]

@gforsyth
Copy link
Member Author

yeah, batch_size is a red herring... back to the drawing board

@gforsyth gforsyth closed this Apr 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

bug(duckdb): inconsistent ordering with large result sets
2 participants