Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug(duckdb): inconsistent ordering with large result sets #8943

Closed
gforsyth opened this issue Apr 11, 2024 Discussed in #8896 · 5 comments
Closed

bug(duckdb): inconsistent ordering with large result sets #8943

gforsyth opened this issue Apr 11, 2024 Discussed in #8896 · 5 comments
Assignees
Labels
bug Incorrect behavior inside of ibis duckdb The DuckDB backend
Milestone

Comments

@gforsyth
Copy link
Member

Discussed in #8896

Originally posted by ivirshup April 5, 2024
Hi,

I would like to make a query where the result is ordered by some columns which won't actually end up in the result. E.g.:

(
    table
    .order_by([col1, col2])
    .drop([col1, col2])
    .execute()
)

However, at least with the duckdb backend, this doesn't seem to maintain the requested ordering. Is there any way to enforce this?

I've included a reproducible example below, apologies for not being able to cut it down further, but I think the query has to already be a bit complicated to see this behaviour. I also can't seem to grab the demo data from the docs (update: that was #8874)

Example
import ibis
from pathlib import Path

DB_PTH = Path("EnsDb.Hsapiens.v108.sqlite")

if not DB_PTH.exists():
    !wget "https://bioconductorhubs.blob.core.windows.net/annotationhub/AHEnsDbs/v108/EnsDb.Hsapiens.v108.sqlite"

conn = ibis.duckdb.connect(":memory:", extensions=["sqlite"])
conn.attach_sqlite(DB_PTH)

genes = conn.table("gene")
tx = conn.table("tx")

query = (
    genes
    .join(tx, predicates=["gene_id"])
    .order_by(["seq_name", "tx_seq_start", "tx_seq_end"])
    .select(["gene_id", "tx_id"])
)

def check_consistent_order(table):
    return table.execute().equals(table.execute())

assert not all(check_consistent_order(query) for _ in range(5))

Thanks for any help!

@gforsyth gforsyth added bug Incorrect behavior inside of ibis duckdb The DuckDB backend labels Apr 11, 2024
@gforsyth gforsyth added this to the 9.0 milestone Apr 11, 2024
@gforsyth gforsyth self-assigned this Apr 11, 2024
@gforsyth
Copy link
Member Author

copied over from discussion in #8896 (minus a few missteps in diagnosing this)
Thanks for raising this @ivirshup !

So, yes, this should absolutely be possible -- I'm still poking at this but I think it's a DuckDB issue, and it might be specific to using their arrow format for returning results (which we always do, even if out eventual target is a pandas dataframe).

See this duckdb-only reproducer:

[ins] In [1]: import duckdb

[ins] In [2]: con = duckdb.connect()

[ins] In [3]: con.sql("ATTACH 'EnsDb.Hsapiens.v108.sqlite' (TYPE SQLITE)")

[ins] In [4]: con.sql("USE EnsDb")

[ins] In [5]: query = """SELECT
         ...:   "t4"."gene_id",
         ...:   "t4"."tx_id"
         ...: FROM (
         ...:   SELECT
         ...:     "t2"."gene_id",
         ...:     "t2"."gene_name",
         ...:     "t2"."gene_biotype",
         ...:     "t2"."gene_seq_start",
         ...:     "t2"."gene_seq_end",
         ...:     "t2"."seq_name",
         ...:     "t2"."seq_strand",
         ...:     "t2"."seq_coord_system",
         ...:     "t2"."description",
         ...:     "t2"."gene_id_version",
         ...:     "t2"."canonical_transcript",
         ...:     "t3"."tx_id",
         ...:     "t3"."tx_biotype",
         ...:     "t3"."tx_seq_start",
         ...:     "t3"."tx_seq_end",
         ...:     "t3"."tx_cds_seq_start",
         ...:     "t3"."tx_cds_seq_end",
         ...:     "t3"."tx_support_level",
         ...:     "t3"."tx_id_version",
         ...:     "t3"."gc_content",
         ...:     "t3"."tx_external_name",
         ...:     "t3"."tx_is_canonical"
         ...:   FROM "gene" AS "t2"
         ...:   INNER JOIN "tx" AS "t3"
         ...:     ON "t2"."gene_id" = "t3"."gene_id"
         ...: ) AS "t4"
         ...: ORDER BY
         ...:   "t4"."seq_name" ASC,
         ...:   "t4"."tx_seq_start" ASC,
         ...:   "t4"."tx_seq_end" ASC"""

[ins] In [6]: results= [con.sql(query).arrow() for i in range(5)]

[ins] In [7]: for i in range(1, 5):
         ...:     print(results[0].equals(results[i]))
True
False
True
False

I'm going to try to see if I can reproduce this using DuckDB tables, then we can report it upstream

Ok! Latest update!

I'm pretty sure this is an Ibis bug, or a bit of missing Ibis surface area because the arrow() method in duckdb accepts a batch_size argument that we weren't passing, so getting back slightly nondeterministic batch sizes was leading to the mismatches. PR incoming

Sweet! Here's the original reproducer, now with the expected AssertionError:

[ins] In [1]: import ibis
         ...: from pathlib import Path
         ...: 
         ...: DB_PTH = Path("EnsDb.Hsapiens.v108.sqlite")
         ...: 
         ...: if not DB_PTH.exists():
         ...:     !wget "https://bioconductorhubs.blob.core.windows.net/annotationhub
         ...: /AHEnsDbs/v108/EnsDb.Hsapiens.v108.sqlite"
         ...: 
         ...: conn = ibis.duckdb.connect(":memory:", extensions=["sqlite"])
         ...: conn.attach_sqlite(DB_PTH)
         ...: 
         ...: genes = conn.table("gene")
         ...: tx = conn.table("tx")
         ...: 
         ...: query = (
         ...:     genes
         ...:     .join(tx, predicates=["gene_id"])
         ...:     .order_by(["seq_name", "tx_seq_start", "tx_seq_end"])
         ...:     .select(["gene_id", "tx_id"])
         ...: )
         ...: 
         ...: def check_consistent_order(table):
         ...:     return table.execute().equals(table.execute())
         ...: 
         ...: assert not all(check_consistent_order(query) for _ in range(5))
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
Cell In[1], line 25
     22 def check_consistent_order(table):
     23     return table.execute().equals(table.execute())
---> 25 assert not all(check_consistent_order(query) for _ in range(5))

AssertionError: 

@gforsyth gforsyth linked a pull request Apr 11, 2024 that will close this issue
@gforsyth gforsyth moved this from backlog to cooking in Ibis planning and roadmap Apr 11, 2024
@gforsyth
Copy link
Member Author

Yet another update -- the "same results" in the comment above are because batch_size=0 zeroes out all result sets, it doesn't disable the batching.

So I still don't know what's going on here.

@gforsyth
Copy link
Member Author

Ok, this was a bit of a journey and thanks to @cpcloud for providing some much-needed clarity on it.

The selected columns gene_tx and tx_id are pairwise unique, so you would expect the output to be consistently ordered. However, the ordering columns seq_name, tx_seq_start, tx_seq_end are not pairwise unique, there are a few overlaps, so the ordering in those overlaps can vary because there's not sufficient information for a deterministic ordering.

To get around that, we can add row numbers before the initial ordering, and then use the row number as the final ordering column to provide a tie-break when the other three columns aren't unique. With that added (below) this query is fully deterministic.

import ibis
from pathlib import Path

DB_PTH = Path("EnsDb.Hsapiens.v108.sqlite")

if not DB_PTH.exists():
    !wget "https://bioconductorhubs.blob.core.windows.net/annotationhub/AHEnsDbs/v108/EnsDb.Hsapiens.v108.sqlite"

conn = ibis.duckdb.connect(":memory:", extensions=["sqlite"])
conn.attach_sqlite(DB_PTH)

genes = conn.table("gene")
tx = conn.table("tx")

query = (
    genes
    .join(tx, predicates=["gene_id"]).mutate(_id=ibis.row_number())
    .order_by(["seq_name", "tx_seq_start", "tx_seq_end", "_id"])
    .select(["gene_id", "tx_id"])
)

def check_consistent_order(table):
    return table.execute().equals(table.execute())

assert not all(check_consistent_order(query) for _ in range(5))

I'm closing this out now as it's not a bug, but it was more than a bit hairy. I'll also post the answer over in the original discussion.

@github-project-automation github-project-automation bot moved this from cooking to done in Ibis planning and roadmap Apr 12, 2024
@ivirshup
Copy link

Thank you for the in depth investigation and response!

I was quite confused, as I thought I had checked to make sure that results were distinct by the columns being sorted on, but am now realizing there was a bug in that code, and I was always returning distinct cases.

However, investigating this made find something, if I add a distinct call here, the results are not ordered correctly:

import ibis
from pathlib import Path

DB_PTH = Path("EnsDb.Hsapiens.v108.sqlite")

if not DB_PTH.exists():
    !wget "https://bioconductorhubs.blob.core.windows.net/annotationhub/AHEnsDbs/v108/EnsDb.Hsapiens.v108.sqlite"

conn = ibis.duckdb.connect(":memory:", extensions=["sqlite"])
conn.attach_sqlite(DB_PTH)


genes = conn.table("gene")
tx = conn.table("tx")
tx2exon = conn.table("tx2exon")
exon = conn.table("exon")

cols = ["seq_name", "gene_seq_start", "gene_seq_end", "tx_seq_start", "tx_seq_end", "exon_seq_start", "exon_seq_end"]

query = (
    genes
    .join(tx, predicates=["gene_id"])
    .join(tx2exon, predicates=["tx_id"])
    .join(exon, predicates=["exon_id"])
    .order_by(cols)
    .select(["gene_id", "tx_id"] + cols)
)

distinct_query = query.distinct()

# Note that the results are already distinct
assert distinct_query.count().execute() == query.count().execute()

query_result = query.execute()
assert query_result.equals(query_result.sort_values(cols, kind="stable"))

distinct_result = distinct_query.execute()
assert distinct_result.equals(distinct_result.sort_values(cols, kind="stable"))
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
Cell In[1], [line 37](vscode-notebook-cell:?execution_count=1&line=37)
     [34](vscode-notebook-cell:?execution_count=1&line=34) assert query_result.equals(query_result.sort_values(cols, kind="stable"))
     [36](vscode-notebook-cell:?execution_count=1&line=36) distinct_result = distinct_query.execute()
---> [37](vscode-notebook-cell:?execution_count=1&line=37) assert distinct_result.equals(distinct_result.sort_values(cols, kind="stable"))

AssertionError:

In my initial testing, I was getting results that just weren't in the order I was expecting at all (e.g. not ordered by seq_name) which would be explained by this.

Is this behaviour expected?

@gforsyth
Copy link
Member Author

In that case, yes, the distinct call is going to be the outermost query and most engines are going to handle that operation without ordering guarantees (beyond things like "keep_first" or "keep_last" in the implicit groupby that occurs there).

You'll need to move the order_by to after the distinct call to maintain the ordering in the output.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Incorrect behavior inside of ibis duckdb The DuckDB backend
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

2 participants