-
Notifications
You must be signed in to change notification settings - Fork 608
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bug(duckdb): inconsistent ordering with large result sets #8943
Comments
copied over from discussion in #8896 (minus a few missteps in diagnosing this) So, yes, this should absolutely be possible -- I'm still poking at this but I think it's a DuckDB issue, and it might be specific to using their See this duckdb-only reproducer: [ins] In [1]: import duckdb
[ins] In [2]: con = duckdb.connect()
[ins] In [3]: con.sql("ATTACH 'EnsDb.Hsapiens.v108.sqlite' (TYPE SQLITE)")
[ins] In [4]: con.sql("USE EnsDb")
[ins] In [5]: query = """SELECT
...: "t4"."gene_id",
...: "t4"."tx_id"
...: FROM (
...: SELECT
...: "t2"."gene_id",
...: "t2"."gene_name",
...: "t2"."gene_biotype",
...: "t2"."gene_seq_start",
...: "t2"."gene_seq_end",
...: "t2"."seq_name",
...: "t2"."seq_strand",
...: "t2"."seq_coord_system",
...: "t2"."description",
...: "t2"."gene_id_version",
...: "t2"."canonical_transcript",
...: "t3"."tx_id",
...: "t3"."tx_biotype",
...: "t3"."tx_seq_start",
...: "t3"."tx_seq_end",
...: "t3"."tx_cds_seq_start",
...: "t3"."tx_cds_seq_end",
...: "t3"."tx_support_level",
...: "t3"."tx_id_version",
...: "t3"."gc_content",
...: "t3"."tx_external_name",
...: "t3"."tx_is_canonical"
...: FROM "gene" AS "t2"
...: INNER JOIN "tx" AS "t3"
...: ON "t2"."gene_id" = "t3"."gene_id"
...: ) AS "t4"
...: ORDER BY
...: "t4"."seq_name" ASC,
...: "t4"."tx_seq_start" ASC,
...: "t4"."tx_seq_end" ASC"""
[ins] In [6]: results= [con.sql(query).arrow() for i in range(5)]
[ins] In [7]: for i in range(1, 5):
...: print(results[0].equals(results[i]))
True
False
True
False I'm going to try to see if I can reproduce this using DuckDB tables, then we can report it upstream Ok! Latest update! I'm pretty sure this is an Ibis bug, or a bit of missing Ibis surface area because the Sweet! Here's the original reproducer, now with the expected AssertionError: [ins] In [1]: import ibis
...: from pathlib import Path
...:
...: DB_PTH = Path("EnsDb.Hsapiens.v108.sqlite")
...:
...: if not DB_PTH.exists():
...: !wget "https://bioconductorhubs.blob.core.windows.net/annotationhub
...: /AHEnsDbs/v108/EnsDb.Hsapiens.v108.sqlite"
...:
...: conn = ibis.duckdb.connect(":memory:", extensions=["sqlite"])
...: conn.attach_sqlite(DB_PTH)
...:
...: genes = conn.table("gene")
...: tx = conn.table("tx")
...:
...: query = (
...: genes
...: .join(tx, predicates=["gene_id"])
...: .order_by(["seq_name", "tx_seq_start", "tx_seq_end"])
...: .select(["gene_id", "tx_id"])
...: )
...:
...: def check_consistent_order(table):
...: return table.execute().equals(table.execute())
...:
...: assert not all(check_consistent_order(query) for _ in range(5))
---------------------------------------------------------------------------
AssertionError Traceback (most recent call last)
Cell In[1], line 25
22 def check_consistent_order(table):
23 return table.execute().equals(table.execute())
---> 25 assert not all(check_consistent_order(query) for _ in range(5))
AssertionError: |
Yet another update -- the "same results" in the comment above are because So I still don't know what's going on here. |
Ok, this was a bit of a journey and thanks to @cpcloud for providing some much-needed clarity on it. The selected columns To get around that, we can add row numbers before the initial ordering, and then use the row number as the final ordering column to provide a tie-break when the other three columns aren't unique. With that added (below) this query is fully deterministic. import ibis
from pathlib import Path
DB_PTH = Path("EnsDb.Hsapiens.v108.sqlite")
if not DB_PTH.exists():
!wget "https://bioconductorhubs.blob.core.windows.net/annotationhub/AHEnsDbs/v108/EnsDb.Hsapiens.v108.sqlite"
conn = ibis.duckdb.connect(":memory:", extensions=["sqlite"])
conn.attach_sqlite(DB_PTH)
genes = conn.table("gene")
tx = conn.table("tx")
query = (
genes
.join(tx, predicates=["gene_id"]).mutate(_id=ibis.row_number())
.order_by(["seq_name", "tx_seq_start", "tx_seq_end", "_id"])
.select(["gene_id", "tx_id"])
)
def check_consistent_order(table):
return table.execute().equals(table.execute())
assert not all(check_consistent_order(query) for _ in range(5)) I'm closing this out now as it's not a bug, but it was more than a bit hairy. I'll also post the answer over in the original discussion. |
Thank you for the in depth investigation and response! I was quite confused, as I thought I had checked to make sure that results were distinct by the columns being sorted on, but am now realizing there was a bug in that code, and I was always returning distinct cases. However, investigating this made find something, if I add a import ibis
from pathlib import Path
DB_PTH = Path("EnsDb.Hsapiens.v108.sqlite")
if not DB_PTH.exists():
!wget "https://bioconductorhubs.blob.core.windows.net/annotationhub/AHEnsDbs/v108/EnsDb.Hsapiens.v108.sqlite"
conn = ibis.duckdb.connect(":memory:", extensions=["sqlite"])
conn.attach_sqlite(DB_PTH)
genes = conn.table("gene")
tx = conn.table("tx")
tx2exon = conn.table("tx2exon")
exon = conn.table("exon")
cols = ["seq_name", "gene_seq_start", "gene_seq_end", "tx_seq_start", "tx_seq_end", "exon_seq_start", "exon_seq_end"]
query = (
genes
.join(tx, predicates=["gene_id"])
.join(tx2exon, predicates=["tx_id"])
.join(exon, predicates=["exon_id"])
.order_by(cols)
.select(["gene_id", "tx_id"] + cols)
)
distinct_query = query.distinct()
# Note that the results are already distinct
assert distinct_query.count().execute() == query.count().execute()
query_result = query.execute()
assert query_result.equals(query_result.sort_values(cols, kind="stable"))
distinct_result = distinct_query.execute()
assert distinct_result.equals(distinct_result.sort_values(cols, kind="stable")) ---------------------------------------------------------------------------
AssertionError Traceback (most recent call last)
Cell In[1], [line 37](vscode-notebook-cell:?execution_count=1&line=37)
[34](vscode-notebook-cell:?execution_count=1&line=34) assert query_result.equals(query_result.sort_values(cols, kind="stable"))
[36](vscode-notebook-cell:?execution_count=1&line=36) distinct_result = distinct_query.execute()
---> [37](vscode-notebook-cell:?execution_count=1&line=37) assert distinct_result.equals(distinct_result.sort_values(cols, kind="stable"))
AssertionError: In my initial testing, I was getting results that just weren't in the order I was expecting at all (e.g. not ordered by Is this behaviour expected? |
In that case, yes, the You'll need to move the |
Discussed in #8896
Originally posted by ivirshup April 5, 2024
Hi,
I would like to make a query where the result is ordered by some columns which won't actually end up in the result. E.g.:
However, at least with the duckdb backend, this doesn't seem to maintain the requested ordering. Is there any way to enforce this?
I've included a reproducible example below, apologies for not being able to cut it down further, but I think the query has to already be a bit complicated to see this behaviour. I also can't seem to grab the demo data from the docs (update: that was #8874)
Example
Thanks for any help!
The text was updated successfully, but these errors were encountered: