Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MSSQL Incorrect Translation of Boolean and row_number while Filtering Zero Rows #1233

Closed
kmishra9 opened this issue Mar 30, 2023 · 3 comments · Fixed by #1288
Closed

MSSQL Incorrect Translation of Boolean and row_number while Filtering Zero Rows #1233

kmishra9 opened this issue Mar 30, 2023 · 3 comments · Fixed by #1288
Labels
help wanted ❤️ we'd love your help! MSSQL

Comments

@kmishra9
Copy link

I'm trying to filter all rows from a table to handle certain edge cases where I want to pass a table forward for use in joins but just have it be empty if necessary. Typically, a filter(FALSE) is all that I need to do, but when connected to an MS SQL Server via odbc and the freetds driver, there's some translation hiccups that prevent the two most common patterns for this from happening.

My guess is that, TRUE and FALSE are being incorrectly translated to numerics instead of booleans? And the row_number() translation could probably rely on some sort of sensible default windowing scheme

Reprex:

suppressPackageStartupMessages(library(dplyr))
suppressPackageStartupMessages(library(dbplyr))

df <- tibble(a = 1:10) 

db_rs <- tbl_lazy(df = df, con = simulate_redshift())
db_sq <- tbl_lazy(df = df, con = simulate_sqlite())
db_ms <- tbl_lazy(df = df, con = simulate_mssql())

# Successful in their respective DBs
db_rs %>% filter(FALSE)
#> <SQL>
#> SELECT *
#> FROM `df`
#> WHERE (FALSE)
db_sq %>% filter(FALSE)
#> <SQL>
#> SELECT *
#> FROM `df`
#> WHERE (0)

# Fails in Microsft SQL Server 
db_ms %>% filter(FALSE)
#> <SQL>
#> SELECT *
#> FROM `df`
#> WHERE (0)
db_ms %>% filter(row_number() == -1)
#> <SQL>
#> SELECT `a`
#> FROM (
#>   SELECT *, ROW_NUMBER() OVER () AS `q01`
#>   FROM `df`
#> ) `q01`
#> WHERE (`q01` = -1.0)

# Workarounds
db_ms %>% filter(row_number(a) == -1)
#> <SQL>
#> SELECT `a`
#> FROM (
#>   SELECT
#>     *,
#>     CASE
#> WHEN (NOT((`a` IS NULL))) THEN ROW_NUMBER() OVER (PARTITION BY (CASE WHEN ((`a` IS NULL)) THEN 1 ELSE 0 END) ORDER BY `a`)
#> END AS `q02`
#>   FROM `df`
#> ) `q01`
#> WHERE (`q02` = -1.0)
db_ms %>% filter(1 != 1)
#> <SQL>
#> SELECT *
#> FROM `df`
#> WHERE (1.0 != 1.0)

Created on 2023-03-30 with reprex v2.0.2

Specifically, the two errors for the failing filter queries in my environment are:

Error in `collect()`:
! Failed to collect lazy table.
Caused by error:
! nanodbc/nanodbc.cpp:1752: 00000: [FreeTDS][SQL Server]Statement(s) could not be prepared.  [FreeTDS][SQL Server]An expression of non-boolean type specified in a context where a condition is expected, near ')'. 
<SQL> 'SELECT TOP 11 *
FROM "dbo"."test_df"
WHERE (0)'

and

Error in `collect()`:
! Failed to collect lazy table.
Caused by error:
! nanodbc/nanodbc.cpp:1752: 00000: [FreeTDS][SQL Server]Statement(s) could not be prepared.  [FreeTDS][SQL Server]The function 'ROW_NUMBER' must have an OVER clause with ORDER BY. 
<SQL> 'SELECT TOP 11
  "year",
  "month",
  "day",
  "dep_time",
  "sched_dep_time",
  "dep_delay",
  "arr_time",
  "sched_arr_time",
  "arr_delay",
  "carrier",
  "flight",
  "tailnum",
  "origin",
  "dest",
  "air_time",
  "distance",
  "hour",
  "minute",
  "time_hour"
FROM (
  SELECT *, ROW_NUMBER() OVER () AS "q01"
  FROM "dbo"."test_df"
) "q01"
WHERE ("q01" = -1.0)'
@kmishra9 kmishra9 changed the title Microsoft SQL Server - Incorrect Translation Filtering Zero Rows MSSQL Incorrect Translation of Boolean and row_number while Filtering Zero Rows Apr 11, 2023
@mgirlich
Copy link
Collaborator

I don't know much about MS SQL translation. From what I see in the dbplyr code it is a bit complicated handling logical values.

@mgirlich mgirlich added help wanted ❤️ we'd love your help! MSSQL labels Apr 26, 2023
@ejneer
Copy link
Contributor

ejneer commented Apr 28, 2023

re row_number:
It looks like this issue stems from MS SQL's odd syntax when an order isn't specified/desired (see this SO answer)

The query below succeeds against my test mssql db with nycflights data:

SELECT ROW_NUMBER() OVER (ORDER BY (SELECT 1)) AS "x"
FROM "flights"

This only happens when no argument is passed to row_number and there has been no prior sorting with arrange/window_order (since that ordering will be reused if nothing is given to row_number)

library(dplyr)
con <- DBI::dbConnect(...)

# ***** FAILS *****
tbl(con, "flights") |>
  transmute(year, month, x = row_number()) |>
  show_query()
#> <SQL>
#> SELECT "year", "month", ROW_NUMBER() OVER () AS "x"
#> FROM "flights"

# ***** SUCCEEDS *****
tbl(con, "flights") |>
  arrange(year) |>
  transmute(year, month, x = row_number()) |>
  show_query()
#> <SQL>
#> SELECT "year", "month", ROW_NUMBER() OVER (ORDER BY "year") AS "x"
#> FROM "flights"
#> ORDER BY "year"

# ***** SUCCEEDS *****
tbl(con, "flights") |>
  dbplyr::window_order(year) |>
  transmute(year, month, x = row_number()) |>
  show_query()
#> <SQL>
#> SELECT "year", "month", ROW_NUMBER() OVER (ORDER BY "year") AS "x"
#> FROM "flights"

# ***** SUCCEEDS *****
tbl(con, "flights") |>
  transmute(year, month, x = row_number(year)) |>
  show_query()
#> <SQL>
#> SELECT
#>   "year",
#>   "month",
#>   CASE
#> WHEN (NOT(("year" IS NULL))) THEN ROW_NUMBER() OVER (PARTITION BY (CASE WHEN (("year" IS NULL)) THEN 1 ELSE 0 END) ORDER BY "year")
#> END AS "x"
#> FROM "flights"

Created on 2023-04-28 with reprex v2.0.2

I've got a possible fix on my fork (6a9ad63) that patches win_rank to allow different handling of this empty ordering case. That way the mssql translator can customize it to return its strange syntax. TBH this solution feels a little wonky for such a specific case that has pretty easy workarounds... @mgirlich do you want a PR for this one?

on 6a9ad63 :

library(dplyr)
con <- DBI::dbConnect(...)

tbl(con, "flights") |> filter(row_number() == -1)
#> # Source:   SQL [0 x 19]
#> # Database: Microsoft SQL Server 15.00.2000[dbo@ac43d9cc0a62/test_dbplyr]
#> # … with 19 variables: year <int>, month <int>, day <int>, dep_time <int>,
#> #   sched_dep_time <int>, dep_delay <dbl>, arr_time <int>,
#> #   sched_arr_time <int>, arr_delay <dbl>, carrier <chr>, flight <int>,
#> #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#> #   hour <dbl>, minute <dbl>, time_hour <dttm>

tbl(con, "flights") |>
  transmute(year, month, x = row_number()) |>
  show_query()
#> <SQL>
#> SELECT "year", "month", ROW_NUMBER() OVER (ORDER BY (SELECT 1)) AS "x"
#> FROM "flights"

Created on 2023-04-28 with reprex v2.0.2

@mgirlich
Copy link
Collaborator

Closed by #1331.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted ❤️ we'd love your help! MSSQL
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants