-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Check that functions work with invalid Unicode strings #156
Comments
- Fix `dbQuoteString()` and `dbQuoteIdentifier()` to ignore invalid UTF-8 strings (r-dbi/DBItest#156).
- Fix `dbQuoteIdentifier()` to ignore invalid UTF-8 strings (r-dbi/DBItest#156).
@krlmlr What kind of failures in particular? |
revdepcheck was aborting when checking the testthat package with the error shown in the reprex above. (Not a failure in a downstream package, but in revdepcheck.) |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
utf8_sanitize <- function(x) {
ok <- utf8::utf8_valid(x)
x[ok] <- utf8::as_utf8(x[ok])
Encoding(x[!ok]) <- "bytes"
x
}
xx <- gsub("'", "''", utf8_sanitize(x), fixed = TRUE) I'll put something like this in the next release of |
( |
If the user somehow managed to produce an invalid UTF-8 string, I don't think it's the job of
|
Agreed. If you get an invalid string, then that's probably an error that should be fixed upstream, either by using encoding "bytes", or just creating a valid string. |
Do you also agree that DBI should pass that verbatim to the database? |
You mean a string with an invalid encoding? Maybe it would make sense to throw an error? As databases can behave differently in this case, I would not rely on them. |
As I understand this is already happening, and you get an error because the |
We should work with UTF-8 internally. We rely on the |
I think DBI should error on that. The "bytes" encoding can still be used to pass a string to the DB, as is, right? |
Not if we're always calling Encoding(enc2utf8(rawToChar(as.raw(runif(100) * 255 + 1))))
#> [1] "UTF-8" |
On the other hand: Encoding(enc2utf8("Encoding<-"(rawToChar(as.raw(runif(100) * 255+1)), "bytes")))
#> [1] "bytes" So maybe we should support bytes and check UTF-8 sanity? |
Converting to UTF-8 is fine, and the right thing to do imo. As I see the questions are
I would say
Looks like |
- The deprecated `print.list.pairs()` has been removed. - Fix `dbDataType()` for `AsIs` object (#198, @yutannihilation). - Point to db.rstudio.com (@wibeasley, #209). - Reflect new 'r-dbi' organization in `DESCRIPTION` (@wibeasley, #207). - Using switchpatch on the second argument for default implementations of `dbQuoteString()` and `dbQuoteIdentifier()`. - New `dbQuoteLiteral()` generic. The default implementation uses switchpatch to avoid dispatch ambiguities, and forwards to `dbQuoteString()` for character vectors. Backends may override methods that also dispatch on the second argument, but in this case also an override for the `"SQL"` class is necessary (#172). - Fix `dbQuoteString()` and `dbQuoteIdentifier()` to ignore invalid UTF-8 strings (r-dbi/DBItest#156).
…ps the names from the output if the `names` argument is unset. - The `dbReadTable()`, `dbWriteTable()`, `dbExistsTable()`, `dbRemoveTable()`, and `dbListFields()` generics now specialize over the first two arguments to support implementations with the `Id` S4 class as type for the second argument. Some packages may need to update their documentation to satisfy R CMD check again. New generics ------------ - Schema support: Export `Id()`, new generics `dbListObjects()` and `dbUnquoteIdentifier()`, methods for `Id` that call `dbQuoteIdentifier()` and then forward (#220). - New `dbQuoteLiteral()` generic. The default implementation uses switchpatch to avoid dispatch ambiguities, and forwards to `dbQuoteString()` for character vectors. Backends may override methods that also dispatch on the second argument, but in this case also an override for the `"SQL"` class is necessary (#172). Default implementations ----------------------- - Default implementations of `dbQuoteIdentifier()` and `dbQuoteLiteral()` preserve names, default implementation of `dbQuoteString()` strips names (#173). - Specialized methods for `dbQuoteString()` and `dbQuoteIdentifier()` are available again, for compatibility with clients that use `getMethod()` to access them (#218). - Add default implementation of `dbListFields()`. - The default implementation of `dbReadTable()` now has `row.names = FALSE` as default and also supports `row.names = NULL` (#186). API changes ----------- - The `SQL()` function gains an optional `names` argument which can be used to assign names to SQL strings. Deprecated generics ------------------- - `dbListConnections()` is soft-deprecated by documentation. - `dbListResults()` is deprecated by documentation (#58). - `dbGetException()` is soft-deprecated by documentation (#51). - The deprecated `print.list.pairs()` has been removed. Bug fixes --------- - Fix `dbDataType()` for `AsIs` object (#198, @yutannihilation). - Fix `dbQuoteString()` and `dbQuoteIdentifier()` to ignore invalid UTF-8 strings (r-dbi/DBItest#156). Documentation ------------- - Help pages for generics now contain a dynamic list of methods implemented by DBI backends (#162). - `sqlInterpolate()` now supports both named and positional variables (#216, @hannesmuehleisen). - Point to db.rstudio.com (@wibeasley, #209). - Reflect new 'r-dbi' organization in `DESCRIPTION` (@wibeasley, #207). Internal -------- - Using switchpatch on the second argument for default implementations of `dbQuoteString()` and `dbQuoteIdentifier()`.
- Values of class `"integer64"` are now supported for `dbWriteTable()` and `dbBind()` (#243). - New connections now automatically load default RSQLite extensions (#236). - Implement `dbUnquoteIdentifier()`. - Update bundled sqlite3 library to 3.22 (#252). - Names in the `x` argument to `dbQuoteIdentifier()` are preserved in the output (r-lib/DBI#173). - Fix rchk warnings on CRAN (#250). - `dbRowsAffected()` and `dbExecute()` return zero after a `DROP TABLE` statement, and not the number of rows affected by the last `INSERT`, `UPDATE`, or `DELETE` (#238). - Refactor connection and result handling to be more similar to other backends. - Fix `dbQuoteIdentifier()` to ignore invalid UTF-8 strings (r-dbi/DBItest#156).
in particular
dbQuoteString()
:(Also, results should always be in UTF-8.)
@gaborcsardi @hadley: I believe this is the reason for revdepcheck failures.
@patperry: Do you have a workaround for this in utf8?
The text was updated successfully, but these errors were encountered: