Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for Databricks backend #1377

Closed
edgararuiz opened this issue Oct 24, 2023 · 5 comments · Fixed by #1379
Closed

Add support for Databricks backend #1377

edgararuiz opened this issue Oct 24, 2023 · 5 comments · Fixed by #1379

Comments

@edgararuiz
Copy link
Collaborator

Thanks to the default "DBI" translation, most of the dplyr/dbplyr operations work. There are Databricks specific operations, for example var(), currently marked as not-supported by dbplyr, due to the fact that var() is not in the default translation. Databricks uses "Spark SQL", which is underpinned by the Hive SQL syntax. My suggestion would be to use the same SQL variance currently in use for the Hive backend support.

The second issue uncovered, is that copy_to() does not work. The Databricks back-end does not support transactions. So I think a custom db_copy_to() function will be needed here.

Reprex

The following reprex contains the specific code that can be used to connect to Databricks using a PAT. It also contains confirmation that var() is supported in Databricks. The confirmation is in calling SQL directly via DBI, and then attempting
to do the same via dplyr, which errors out. It also contains the error received when trying to use copy_to()

# https://docs.databricks.com/en/sql/language-manual/index.html

library(dbplyr)
library(dplyr)
library(DBI)

con <- dbConnect(
  odbc::odbc(),
  Driver = "/Library/simba/spark/lib/libsparkodbc_sb64-universal.dylib",
  Host = "rstudio-partner-posit-default.cloud.databricks.com",
  Port = 443,
  AuthMech = 3,
  HTTPPath = "/sql/1.0/warehouses/300bd24ba12adf8e",
  Protocol = "https",
  ThriftTransport = 2,
  SSL = 1,
  UID = "token",
  PWD = Sys.getenv("DATABRICKS_TOKEN")
)

# https://docs.databricks.com/en/sql/language-manual/functions/variance.html

dbGetQuery(con, "Select variance(trip_distance) as var1 from samples.nyctaxi.trips where trip_distance > 20")
#>       var1
#> 1 3.369306

trips <- tbl(con, in_catalog("samples", "nyctaxi", "trips"))

trips %>% 
  filter(trip_distance > 20) %>% 
  summarise(var1 = var(trip_distance, na.rm = TRUE))
#> Error in `var()`:
#> ! `var()` is not available in this SQL variant.
#> Backtrace:
#>      ▆
#>   1. ├─base::tryCatch(...)
#>   2. │ └─base (local) tryCatchList(expr, classes, parentenv, handlers)
#>   3. │   ├─base (local) tryCatchOne(...)
#>   4. │   │ └─base (local) doTryCatch(return(expr), name, parentenv, handler)
#>   5. │   └─base (local) tryCatchList(expr, names[-nh], parentenv, handlers[-nh])
#>   6. │     └─base (local) tryCatchOne(expr, names, parentenv, handlers[[1L]])
#>   7. │       └─base (local) doTryCatch(return(expr), name, parentenv, handler)
#>   8. ├─base::withCallingHandlers(...)
#>   9. ├─base::saveRDS(...)
#>  10. ├─base::do.call(...)
#>  11. ├─base (local) `<fn>`(...)
#>  12. ├─global `<fn>`(input = base::quote("hexed-tuna_reprex.R"))
#>  13. │ └─rmarkdown::render(input, quiet = TRUE, envir = globalenv(), encoding = "UTF-8")
#>  14. │   └─knitr::knit(knit_input, knit_output, envir = envir, quiet = quiet)
#>  15. │     └─knitr:::process_file(text, output)
#>  16. │       ├─base::withCallingHandlers(...)
#>  17. │       ├─base::withCallingHandlers(...)
#>  18. │       ├─knitr:::process_group(group)
#>  19. │       └─knitr:::process_group.block(group)
#>  20. │         └─knitr:::call_block(x)
#>  21. │           └─knitr:::block_exec(params)
#>  22. │             └─knitr:::eng_r(options)
#>  23. │               ├─knitr:::in_input_dir(...)
#>  24. │               │ └─knitr:::in_dir(input_dir(), expr)
#>  25. │               └─knitr (local) evaluate(...)
#>  26. │                 └─evaluate::evaluate(...)
#>  27. │                   └─evaluate:::evaluate_call(...)
#>  28. │                     ├─evaluate (local) handle(...)
#>  29. │                     │ └─base::try(f, silent = TRUE)
#>  30. │                     │   └─base::tryCatch(...)
#>  31. │                     │     └─base (local) tryCatchList(expr, classes, parentenv, handlers)
#>  32. │                     │       └─base (local) tryCatchOne(expr, names, parentenv, handlers[[1L]])
#>  33. │                     │         └─base (local) doTryCatch(return(expr), name, parentenv, handler)
#>  34. │                     ├─base::withCallingHandlers(...)
#>  35. │                     ├─base::withVisible(value_fun(ev$value, ev$visible))
#>  36. │                     └─knitr (local) value_fun(ev$value, ev$visible)
#>  37. │                       └─knitr (local) fun(x, options = options)
#>  38. │                         ├─base::withVisible(knit_print(x, ...))
#>  39. │                         ├─knitr::knit_print(x, ...)
#>  40. │                         └─rmarkdown:::knit_print.tbl_sql(x, ...)
#>  41. │                           ├─context$df_print(x)
#>  42. │                           └─dbplyr:::print.tbl_sql(x)
#>  43. │                             ├─dbplyr:::cat_line(format(x, ..., n = n, width = width, n_extra = n_extra))
#>  44. │                             │ ├─base::cat(paste0(..., "\n"), sep = "")
#>  45. │                             │ └─base::paste0(..., "\n")
#>  46. │                             ├─base::format(x, ..., n = n, width = width, n_extra = n_extra)
#>  47. │                             └─pillar:::format.tbl(x, ..., n = n, width = width, n_extra = n_extra)
#>  48. │                               └─pillar:::format_tbl(...)
#>  49. │                                 └─pillar::tbl_format_setup(...)
#>  50. │                                   ├─pillar:::tbl_format_setup_dispatch(...)
#>  51. │                                   └─pillar:::tbl_format_setup.tbl(...)
#>  52. │                                     └─pillar:::df_head(x, n + 1)
#>  53. │                                       ├─base::as.data.frame(head(x, n))
#>  54. │                                       └─dbplyr:::as.data.frame.tbl_sql(head(x, n))
#>  55. │                                         ├─base::as.data.frame(collect(x, n = n))
#>  56. │                                         ├─dplyr::collect(x, n = n)
#>  57. │                                         └─dbplyr:::collect.tbl_sql(x, n = n)
#>  58. │                                           ├─dbplyr::db_sql_render(x$src$con, x, cte = cte)
#>  59. │                                           └─dbplyr:::db_sql_render.DBIConnection(x$src$con, x, cte = cte)
#>  60. │                                             ├─dbplyr::sql_render(sql, con = con, ..., cte = cte)
#>  61. │                                             └─dbplyr:::sql_render.tbl_lazy(sql, con = con, ..., cte = cte)
#>  62. │                                               ├─dbplyr::sql_render(...)
#>  63. │                                               └─dbplyr:::sql_render.lazy_query(...)
#>  64. │                                                 ├─dbplyr::sql_build(query, con = con, ...)
#>  65. │                                                 └─dbplyr:::sql_build.lazy_select_query(query, con = con, ...)
#>  66. │                                                   └─dbplyr:::get_select_sql(...)
#>  67. │                                                     └─dbplyr::translate_sql_(select_expr, con, window = FALSE, context = list(clause = "SELECT"))
#>  68. │                                                       └─base::lapply(...)
#>  69. │                                                         └─dbplyr (local) FUN(X[[i]], ...)
#>  70. │                                                           ├─dbplyr::escape(eval_tidy(x, mask), con = con)
#>  71. │                                                           └─rlang::eval_tidy(x, mask)
#>  72. └─dbplyr (local) var(trip_distance, na.rm = TRUE)
#>  73.   └─cli::cli_abort("{.fun {f}} is not available in this SQL variant.")
#>  74.     └─rlang::abort(...)

# Copying data does not work

tbl_mtcars <- copy_to(con, mtcars)
#> Error in eval(expr, envir, enclos): nanodbc/nanodbc.cpp:1296: 00000: [Simba][ODBC] (11470) Transactions are not supported.

dbDisconnect(con)

Created on 2023-10-24 with reprex v2.0.2

@edgararuiz
Copy link
Collaborator Author

con class is:

> class(con)
[1] "Spark SQL"
attr(,"package")
[1] ".GlobalEnv"

@edgararuiz
Copy link
Collaborator Author

Adding this two links I included in the issue for odbc:

@hadley
Copy link
Member

hadley commented Oct 24, 2023

For copy_to(temporary = TRUE) support need to save copy_inline() into a temporary view.

hadley added a commit that referenced this issue Oct 24, 2023
hadley added a commit that referenced this issue Oct 25, 2023
@kmishra9
Copy link

First off, thanks -- super helpful in a lot of ways.

Do we anticipate never being able to upload data to databricks in a permanent way (i.e. when setting temporary = FALSE in copy_to?). DBI::dbWriteTable didn't look like it supported this either, which I assume is the source reason.

And then something I just noticed is that it wasn't possible to pass a name for a temporary view using catalogs with this just yet either, in case that's something worth addressing

small_df_ref <-
    copy_to(
        dest = db,
        df = nycflights13::airlines,
        name = in_catalog(
            catalog = 'predictive_rewrite',
            schema = 'test',
            table = 'small_df_airlines'
        )
    )

=>

[TEMP_VIEW_NAME_TOO_MANY_NAME_PARTS] CREATE TEMPORARY VIEW or the corresponding Dataset APIs only accept single-part view names, but got: `predictive_rewrite`.`test`.`small_df_airlines`.(line 1, pos 0)

@hadley
Copy link
Member

hadley commented Nov 28, 2023

@kmishra9 can you please file a new issue? You got lucky this time, but generally you shouldn't expect folks to be reading closed issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants