bug(duckdb): inconsistent ordering with large result sets #8943

gforsyth · 2024-04-11T16:13:49Z

Discussed in #8896

^{Originally posted by ivirshup April 5, 2024}
Hi,

I would like to make a query where the result is ordered by some columns which won't actually end up in the result. E.g.:

(
    table
    .order_by([col1, col2])
    .drop([col1, col2])
    .execute()
)

However, at least with the duckdb backend, this doesn't seem to maintain the requested ordering. Is there any way to enforce this?

I've included a reproducible example below, apologies for not being able to cut it down further, but I think the query has to already be a bit complicated to see this behaviour. I also can't seem to grab the demo data from the docs (update: that was #8874)

Example

import ibis
from pathlib import Path

DB_PTH = Path("EnsDb.Hsapiens.v108.sqlite")

if not DB_PTH.exists():
    !wget "https://bioconductorhubs.blob.core.windows.net/annotationhub/AHEnsDbs/v108/EnsDb.Hsapiens.v108.sqlite"

conn = ibis.duckdb.connect(":memory:", extensions=["sqlite"])
conn.attach_sqlite(DB_PTH)

genes = conn.table("gene")
tx = conn.table("tx")

query = (
    genes
    .join(tx, predicates=["gene_id"])
    .order_by(["seq_name", "tx_seq_start", "tx_seq_end"])
    .select(["gene_id", "tx_id"])
)

def check_consistent_order(table):
    return table.execute().equals(table.execute())

assert not all(check_consistent_order(query) for _ in range(5))

Thanks for any help!

gforsyth · 2024-04-11T16:16:20Z

copied over from discussion in #8896 (minus a few missteps in diagnosing this)
Thanks for raising this @ivirshup !

So, yes, this should absolutely be possible -- I'm still poking at this but I think it's a DuckDB issue, and it might be specific to using their arrow format for returning results (which we always do, even if out eventual target is a pandas dataframe).

See this duckdb-only reproducer:

[ins] In [1]: import duckdb

[ins] In [2]: con = duckdb.connect()

[ins] In [3]: con.sql("ATTACH 'EnsDb.Hsapiens.v108.sqlite' (TYPE SQLITE)")

[ins] In [4]: con.sql("USE EnsDb")

[ins] In [5]: query = """SELECT
         ...:   "t4"."gene_id",
         ...:   "t4"."tx_id"
         ...: FROM (
         ...:   SELECT
         ...:     "t2"."gene_id",
         ...:     "t2"."gene_name",
         ...:     "t2"."gene_biotype",
         ...:     "t2"."gene_seq_start",
         ...:     "t2"."gene_seq_end",
         ...:     "t2"."seq_name",
         ...:     "t2"."seq_strand",
         ...:     "t2"."seq_coord_system",
         ...:     "t2"."description",
         ...:     "t2"."gene_id_version",
         ...:     "t2"."canonical_transcript",
         ...:     "t3"."tx_id",
         ...:     "t3"."tx_biotype",
         ...:     "t3"."tx_seq_start",
         ...:     "t3"."tx_seq_end",
         ...:     "t3"."tx_cds_seq_start",
         ...:     "t3"."tx_cds_seq_end",
         ...:     "t3"."tx_support_level",
         ...:     "t3"."tx_id_version",
         ...:     "t3"."gc_content",
         ...:     "t3"."tx_external_name",
         ...:     "t3"."tx_is_canonical"
         ...:   FROM "gene" AS "t2"
         ...:   INNER JOIN "tx" AS "t3"
         ...:     ON "t2"."gene_id" = "t3"."gene_id"
         ...: ) AS "t4"
         ...: ORDER BY
         ...:   "t4"."seq_name" ASC,
         ...:   "t4"."tx_seq_start" ASC,
         ...:   "t4"."tx_seq_end" ASC"""

[ins] In [6]: results= [con.sql(query).arrow() for i in range(5)]

[ins] In [7]: for i in range(1, 5):
         ...:     print(results[0].equals(results[i]))
True
False
True
False

I'm going to try to see if I can reproduce this using DuckDB tables, then we can report it upstream

Ok! Latest update!

I'm pretty sure this is an Ibis bug, or a bit of missing Ibis surface area because the arrow() method in duckdb accepts a batch_size argument that we weren't passing, so getting back slightly nondeterministic batch sizes was leading to the mismatches. PR incoming

Sweet! Here's the original reproducer, now with the expected AssertionError:

[ins] In [1]: import ibis
         ...: from pathlib import Path
         ...: 
         ...: DB_PTH = Path("EnsDb.Hsapiens.v108.sqlite")
         ...: 
         ...: if not DB_PTH.exists():
         ...:     !wget "https://bioconductorhubs.blob.core.windows.net/annotationhub
         ...: /AHEnsDbs/v108/EnsDb.Hsapiens.v108.sqlite"
         ...: 
         ...: conn = ibis.duckdb.connect(":memory:", extensions=["sqlite"])
         ...: conn.attach_sqlite(DB_PTH)
         ...: 
         ...: genes = conn.table("gene")
         ...: tx = conn.table("tx")
         ...: 
         ...: query = (
         ...:     genes
         ...:     .join(tx, predicates=["gene_id"])
         ...:     .order_by(["seq_name", "tx_seq_start", "tx_seq_end"])
         ...:     .select(["gene_id", "tx_id"])
         ...: )
         ...: 
         ...: def check_consistent_order(table):
         ...:     return table.execute().equals(table.execute())
         ...: 
         ...: assert not all(check_consistent_order(query) for _ in range(5))
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
Cell In[1], line 25
     22 def check_consistent_order(table):
     23     return table.execute().equals(table.execute())
---> 25 assert not all(check_consistent_order(query) for _ in range(5))

AssertionError:

gforsyth · 2024-04-11T16:23:55Z

Yet another update -- the "same results" in the comment above are because batch_size=0 zeroes out all result sets, it doesn't disable the batching.

So I still don't know what's going on here.

gforsyth · 2024-04-12T15:31:27Z

Ok, this was a bit of a journey and thanks to @cpcloud for providing some much-needed clarity on it.

The selected columns gene_tx and tx_id are pairwise unique, so you would expect the output to be consistently ordered. However, the ordering columns seq_name, tx_seq_start, tx_seq_end are not pairwise unique, there are a few overlaps, so the ordering in those overlaps can vary because there's not sufficient information for a deterministic ordering.

To get around that, we can add row numbers before the initial ordering, and then use the row number as the final ordering column to provide a tie-break when the other three columns aren't unique. With that added (below) this query is fully deterministic.

import ibis
from pathlib import Path

DB_PTH = Path("EnsDb.Hsapiens.v108.sqlite")

if not DB_PTH.exists():
    !wget "https://bioconductorhubs.blob.core.windows.net/annotationhub/AHEnsDbs/v108/EnsDb.Hsapiens.v108.sqlite"

conn = ibis.duckdb.connect(":memory:", extensions=["sqlite"])
conn.attach_sqlite(DB_PTH)

genes = conn.table("gene")
tx = conn.table("tx")

query = (
    genes
    .join(tx, predicates=["gene_id"]).mutate(_id=ibis.row_number())
    .order_by(["seq_name", "tx_seq_start", "tx_seq_end", "_id"])
    .select(["gene_id", "tx_id"])
)

def check_consistent_order(table):
    return table.execute().equals(table.execute())

assert not all(check_consistent_order(query) for _ in range(5))

I'm closing this out now as it's not a bug, but it was more than a bit hairy. I'll also post the answer over in the original discussion.

ivirshup · 2024-04-12T16:30:20Z

Thank you for the in depth investigation and response!

I was quite confused, as I thought I had checked to make sure that results were distinct by the columns being sorted on, but am now realizing there was a bug in that code, and I was always returning distinct cases.

However, investigating this made find something, if I add a distinct call here, the results are not ordered correctly:

import ibis
from pathlib import Path

DB_PTH = Path("EnsDb.Hsapiens.v108.sqlite")

if not DB_PTH.exists():
    !wget "https://bioconductorhubs.blob.core.windows.net/annotationhub/AHEnsDbs/v108/EnsDb.Hsapiens.v108.sqlite"

conn = ibis.duckdb.connect(":memory:", extensions=["sqlite"])
conn.attach_sqlite(DB_PTH)


genes = conn.table("gene")
tx = conn.table("tx")
tx2exon = conn.table("tx2exon")
exon = conn.table("exon")

cols = ["seq_name", "gene_seq_start", "gene_seq_end", "tx_seq_start", "tx_seq_end", "exon_seq_start", "exon_seq_end"]

query = (
    genes
    .join(tx, predicates=["gene_id"])
    .join(tx2exon, predicates=["tx_id"])
    .join(exon, predicates=["exon_id"])
    .order_by(cols)
    .select(["gene_id", "tx_id"] + cols)
)

distinct_query = query.distinct()

# Note that the results are already distinct
assert distinct_query.count().execute() == query.count().execute()

query_result = query.execute()
assert query_result.equals(query_result.sort_values(cols, kind="stable"))

distinct_result = distinct_query.execute()
assert distinct_result.equals(distinct_result.sort_values(cols, kind="stable"))

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
Cell In[1], [line 37](vscode-notebook-cell:?execution_count=1&line=37)
     [34](vscode-notebook-cell:?execution_count=1&line=34) assert query_result.equals(query_result.sort_values(cols, kind="stable"))
     [36](vscode-notebook-cell:?execution_count=1&line=36) distinct_result = distinct_query.execute()
---> [37](vscode-notebook-cell:?execution_count=1&line=37) assert distinct_result.equals(distinct_result.sort_values(cols, kind="stable"))

AssertionError:

In my initial testing, I was getting results that just weren't in the order I was expecting at all (e.g. not ordered by seq_name) which would be explained by this.

Is this behaviour expected?

gforsyth · 2024-04-12T16:56:51Z

In that case, yes, the distinct call is going to be the outermost query and most engines are going to handle that operation without ordering guarantees (beyond things like "keep_first" or "keep_last" in the implicit groupby that occurs there).

You'll need to move the order_by to after the distinct call to maintain the ordering in the output.

gforsyth added bug Incorrect behavior inside of ibis duckdb The DuckDB backend labels Apr 11, 2024

gforsyth added this to the 9.0 milestone Apr 11, 2024

gforsyth self-assigned this Apr 11, 2024

github-project-automation bot added this to Ibis planning and roadmap Apr 11, 2024

github-project-automation bot moved this to backlog in Ibis planning and roadmap Apr 11, 2024

gforsyth linked a pull request Apr 11, 2024 that will close this issue

fix(duckdb): disable batching for calls to arrow #8942

Closed

gforsyth moved this from backlog to cooking in Ibis planning and roadmap Apr 11, 2024

gforsyth closed this as completed Apr 12, 2024

github-project-automation bot moved this from cooking to done in Ibis planning and roadmap Apr 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug(duckdb): inconsistent ordering with large result sets #8943

bug(duckdb): inconsistent ordering with large result sets #8943

gforsyth commented Apr 11, 2024

gforsyth commented Apr 11, 2024

gforsyth commented Apr 11, 2024

gforsyth commented Apr 12, 2024

ivirshup commented Apr 12, 2024

gforsyth commented Apr 12, 2024

bug(duckdb): inconsistent ordering with large result sets #8943

bug(duckdb): inconsistent ordering with large result sets #8943

Comments

gforsyth commented Apr 11, 2024

Discussed in #8896

gforsyth commented Apr 11, 2024

gforsyth commented Apr 11, 2024

gforsyth commented Apr 12, 2024

ivirshup commented Apr 12, 2024

gforsyth commented Apr 12, 2024