Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add support for Microsoft SQL Server (MSSQL) #77

Merged
merged 31 commits into from
Jan 10, 2024

Conversation

marcusmunch
Copy link
Collaborator

@marcusmunch marcusmunch commented Jan 3, 2024

Intent

This PR adds support for Microsoft SQL Server (MSSQL).

Approach

I have been careful to add functionality to existing S3 methods. Simultaneously, some of these have been "promoted" from default methods to methods for DBIConnection objects as this is more specific while retaining use.

Testing has been enabled by treating ODBC connections a little more specifically, as these cannot be made with no arguments and it is no secret that the test suite is due for an overhaul. If one wishes to test ODBC backends, they can be defined in an environment variable SCDB_ODBC_JSON, which is a text string parseable by jsonlite::fromJSON. This can be set in the user's Renviron file1 and (in a good way) forgotten about 🙂

MSSQL seems to have revealed a potential issue in how dbplyr handles temporary tables, as it fails to overwrite a temporary table even if dplyr::copy_to(overwrite=TRUE) is given.2

Known issues

db_joins.R currently does not pass tests, but as these functions are next for a more or less complete S3 makeover, I consider them outside the scope of the PR.

Checklist

  • The PR passes all local unit tests (except one, see "Known issues")
  • I have documented any new features introduced
  • If the PR adds a new feature, please add an entry in NEWS.md
  • A reviewer is assigned to this PR

Footnotes

  1. Example line:
    SCDB_ODBC_JSON='{"MSSQL": {"driver": "SQL Server", "server": "localhost", "database": "my_MSSQL_db", "trusted_connection": "true"}}'

  2. See e.g tests/testthat/test-update_snapshot.R. Tables are temporary by default in dplyr::copy_to.

@marcusmunch marcusmunch added the enhancement New feature or request label Jan 3, 2024
@marcusmunch marcusmunch self-assigned this Jan 3, 2024
@RasmusSkytte
Copy link
Contributor

There seems to be some missing imports in DESCRIPTION:

See these messages from R-CMD-check:

❯ checking for unstated dependencies in ‘tests’ ... WARNING
  '::' or ':::' imports not declared from:
    ‘jsonlite’ ‘odbc’

@RasmusSkytte
Copy link
Contributor

The test-coverage workflows throws an error for digest_to_checksum

══ Failed tests ════════════════════════════════════════════════════════════════
── Failure ('test-db_manipulating_functions.R:121:3'): digest_to_checksum() works ──
checksums[1] == checksums[2] is not FALSE

`actual`:   TRUE 
`expected`: FALSE

@marcusmunch marcusmunch changed the title chore: Change defaults to DBIConnection feat: Add support for Microsoft SQL Server (MSSQL) Jan 8, 2024
@marcusmunch marcusmunch requested review from RasmusSkytte and removed request for RasmusSkytte January 8, 2024 15:08
@RasmusSkytte
Copy link
Contributor

Also, there are some flagged possible spelling mistakes:

< Potential spelling errors:
<   WORD     FOUND IN
< ODBC     NEWS.md:3
< funder   SCDB-package.Rd:31

@RasmusSkytte
Copy link
Contributor

And with all this hard work, you can also add some more to the NEWS.md :)

## Features

* Added support for Microsoft SQL Server using ODBC

## Minor Improvements and Fixes

* Implementation of `*_joins` improved and no longer masks `dplyr::*_joins`. 

Marcus Munch Grünewald added 2 commits January 10, 2024 09:27
(also sorted one existing entry)
Comment on lines 30 to 33
filter_keys <- function(.data, filters, by = NULL, na_by = NULL) {

# Check arguments
assert_data_like(.data)
assert_data_like(filters, null.ok = TRUE)
checkmate::assert_character(by, null.ok = TRUE)
checkmate::assert_character(na_by, null.ok = TRUE)

if (is.null(filters)) {
return(.data)
} else {
if (is.null(by) && is.null(na_by)) {
# Determine key types
key_types <- filters |>
dplyr::ungroup() |>
dplyr::summarise(dplyr::across(.cols = tidyselect::everything(), .fns = ~ any(is.na(.), na.rm = TRUE))) |>
tidyr::pivot_longer(tidyselect::everything(), names_to = "column_name", values_to = "is_na")

by <- key_types |> dplyr::filter(!.data$is_na) |> dplyr::pull("column_name")
na_by <- key_types |> dplyr::filter(.data$is_na) |> dplyr::pull("column_name")

if (length(by) == 0) by <- NULL
if (length(na_by) == 0) na_by <- NULL
}
return(inner_join(.data, filters, by = by, na_by = na_by))
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason all checkmate:: checks have ben removed?
I can see why the assert_data_like are redundant now that it is S3 method.

In fact, I would suggest adding a check that all columns are defined in by and na_by.
(since we say so in the description of the function)

checkmate::check_set_equal(c(by, na_by), colnames(filters))

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assert_character(by) most likely disappeared when I removed assert_character(na_by) due to na_by no longer being defined in the function signature.

I reinstated the checks, but since it is possible to skip to the dplyr::inner_join call, but have opted for checkmate::check_subset instead of check_set_equal (which would require the user to specify all columns in by or na_by if any were given).

Comment on lines 50 to 81
)) |>
tidyr::pivot_longer(tidyselect::everything(), names_to = "column_name", values_to = "is_na")

by <- key_types |> dplyr::filter(.data$is_na > 0) |> dplyr::pull("column_name")
na_by <- key_types |> dplyr::filter(.data$is_na == 0) |> dplyr::pull("column_name")

if (length(by) == 0) by <- NULL
if (length(na_by) == 0) na_by <- NULL
}
return(dplyr::inner_join(.data, filters, by = by, na_by = na_by))
}

#' @export
filter_keys.data.frame <- function(.data, filters, by = NULL, ...) {
.dots <- list(...)

args <- list(
x = .data,
y = filters,
by = by
) |>
append(.dots)

if ("na_by" %in% names(args)) {
args$na_matches <- "na"
args$na_by <- NULL
}

if (is.null(by)) args$by <- colnames(filters)

return(do.call(dplyr::inner_join, args = args))
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can be simplified a lot.

In my view, SCDB joins_* and by extension filter_keys are designed to mimic base R joins as much as possible. Since R by default joins NA with NA, we can simplify this function:

#' @export
filter_keys.data.frame <- function(.data, filters, by = NULL, na_by = NULL) {
  if (is.null(by) && is.null(na_by)) {
    by <- colnames(filters)
  }

  return(dplyr::inner_join(.data, filters, by = c(by, na_by), na_matches = "na"))
}

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

na_matches defaults to "na" for dplyr:::inner_join.data.frame, so I have now added an even more simplified version of your suggestion 😇

@marcusmunch marcusmunch merged commit 7ab7a6d into ssi-dk:main Jan 10, 2024
13 checks passed
@marcusmunch marcusmunch deleted the backend_mssql branch January 10, 2024 12:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants