Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check that functions work with invalid Unicode strings #156

Open
krlmlr opened this issue Nov 8, 2017 · 17 comments
Open

Check that functions work with invalid Unicode strings #156

krlmlr opened this issue Nov 8, 2017 · 17 comments

Comments

@krlmlr
Copy link
Member

krlmlr commented Nov 8, 2017

in particular dbQuoteString():

set.seed(20171108)
x <- rawToChar(as.raw(runif(100) * 255+1))
xx <- gsub("'", "''", x)
xx <- gsub("'", "''", x, fixed = TRUE, useBytes = TRUE)
xx <- gsub("'", "''", x, fixed = TRUE)
#> Error in gsub("'", "''", x, fixed = TRUE): input string 1 is invalid in this locale

(Also, results should always be in UTF-8.)

@gaborcsardi @hadley: I believe this is the reason for revdepcheck failures.

@patperry: Do you have a workaround for this in utf8?

krlmlr added a commit to r-dbi/DBI that referenced this issue Nov 8, 2017
- Fix `dbQuoteString()` and `dbQuoteIdentifier()` to ignore invalid UTF-8 strings (r-dbi/DBItest#156).
krlmlr added a commit to r-dbi/RSQLite that referenced this issue Nov 8, 2017
- Fix `dbQuoteIdentifier()` to ignore invalid UTF-8 strings (r-dbi/DBItest#156).
@gaborcsardi
Copy link

@krlmlr What kind of failures in particular?

@krlmlr
Copy link
Member Author

krlmlr commented Nov 8, 2017

revdepcheck was aborting when checking the testthat package with the error shown in the reprex above. (Not a failure in a downstream package, but in revdepcheck.)

@gaborcsardi

This comment has been minimized.

@krlmlr

This comment has been minimized.

@gaborcsardi

This comment has been minimized.

@patperry
Copy link

patperry commented Nov 8, 2017

utf8::utf8_format and utf8::utf8_encode do the equivalent of

utf8_sanitize <- function(x) {
    ok <- utf8::utf8_valid(x)
    x[ok] <- utf8::as_utf8(x[ok])
    Encoding(x[!ok]) <- "bytes"
    x
}
xx <- gsub("'", "''", utf8_sanitize(x), fixed = TRUE)

I'll put something like this in the next release of utf8. (Thanks for the feature request!)

@patperry
Copy link

patperry commented Nov 8, 2017

(utf8_sanitize does what enc2utf8 is supposed to do, but the latter is buggy on Windows. See, e.g. this thread on r-devel: https://stat.ethz.ch/pipermail/r-devel/2017-September/074908.html )

@krlmlr
Copy link
Member Author

krlmlr commented Nov 9, 2017

If the user somehow managed to produce an invalid UTF-8 string, I don't think it's the job of dbQuoteString() to fix this. Also, I'd rather avoid adding dependencies to the DBI package, but backends are free to do so.

gsub(fixed = TRUE, useBytes = TRUE) does change the encoding of strings with invalid UTF-8, but plain gsub() does not.

@gaborcsardi
Copy link

Agreed. If you get an invalid string, then that's probably an error that should be fixed upstream, either by using encoding "bytes", or just creating a valid string.

@krlmlr
Copy link
Member Author

krlmlr commented Nov 9, 2017

Do you also agree that DBI should pass that verbatim to the database?

@gaborcsardi
Copy link

You mean a string with an invalid encoding? Maybe it would make sense to throw an error? As databases can behave differently in this case, I would not rely on them.

@gaborcsardi
Copy link

As I understand this is already happening, and you get an error because the gsub fails. Right?

@krlmlr
Copy link
Member Author

krlmlr commented Nov 9, 2017

We should work with UTF-8 internally. We rely on the Encoding() and on being able to get UTF-8 strings via enc2utf8(). But what if the user passes an invalid UTF-8 string? Do we pass that verbatim to the DB driver? Apparently SQLite is fine, other databases might be not.

@gaborcsardi
Copy link

But what if the user passes an invalid UTF-8 string?

I think DBI should error on that. The "bytes" encoding can still be used to pass a string to the DB, as is, right?

@krlmlr
Copy link
Member Author

krlmlr commented Nov 9, 2017

Not if we're always calling enc2utf8(): here, the "bytes" encoding becomes "UTF-8". I think we need to call enc2utf8() to align with the encoding of the database connection, because (especially on Windows) users might pass strings in the system encoding which is not UTF-8 there.

Encoding(enc2utf8(rawToChar(as.raw(runif(100) * 255 + 1))))
#> [1] "UTF-8"

@krlmlr
Copy link
Member Author

krlmlr commented Nov 9, 2017

On the other hand:

Encoding(enc2utf8("Encoding<-"(rawToChar(as.raw(runif(100) * 255+1)), "bytes")))
#> [1] "bytes"

So maybe we should support bytes and check UTF-8 sanity?

@gaborcsardi
Copy link

Converting to UTF-8 is fine, and the right thing to do imo. As I see the questions are

  1. what to do with "bytes" encoding?
  2. whether to check UTF-8 validity and error if invalid?
  3. what to do with "unknown" encoding?

I would say

  1. pass it through (maybe)
  2. yes, after enc2utf8 probably.
  3. enc2utf8 assumes native encoding for "unknown", AFAICT, so that is fine, but then we need to check if the result is valid.

Looks like iconv(x, "UTF-8", "UTF-8") can be used to check for valid UTF-8.

krlmlr added a commit to r-dbi/DBI that referenced this issue Nov 27, 2017
- The deprecated `print.list.pairs()` has been removed.
- Fix `dbDataType()` for `AsIs` object (#198, @yutannihilation).
- Point to db.rstudio.com (@wibeasley, #209).
- Reflect new 'r-dbi' organization in `DESCRIPTION` (@wibeasley, #207).
- Using switchpatch on the second argument for default implementations of `dbQuoteString()` and `dbQuoteIdentifier()`.
- New `dbQuoteLiteral()` generic. The default implementation uses switchpatch to avoid dispatch ambiguities, and forwards to `dbQuoteString()` for character vectors. Backends may override methods that also dispatch on the second argument, but in this case also an override for the `"SQL"` class is necessary (#172).
- Fix `dbQuoteString()` and `dbQuoteIdentifier()` to ignore invalid UTF-8 strings (r-dbi/DBItest#156).
krlmlr added a commit to r-dbi/DBI that referenced this issue Mar 9, 2018
…ps the names from the output if the `names` argument is unset. - The `dbReadTable()`, `dbWriteTable()`, `dbExistsTable()`, `dbRemoveTable()`, and `dbListFields()` generics now specialize over the first two arguments to support implementations with the `Id` S4 class as type for the second argument. Some packages may need to update their documentation to satisfy R CMD check again. New generics ------------ - Schema support: Export `Id()`, new generics `dbListObjects()` and `dbUnquoteIdentifier()`, methods for `Id` that call `dbQuoteIdentifier()` and then forward (#220). - New `dbQuoteLiteral()` generic. The default implementation uses switchpatch to avoid dispatch ambiguities, and forwards to `dbQuoteString()` for character vectors. Backends may override methods that also dispatch on the second argument, but in this case also an override for the `"SQL"` class is necessary (#172). Default implementations ----------------------- - Default implementations of `dbQuoteIdentifier()` and `dbQuoteLiteral()` preserve names, default implementation of `dbQuoteString()` strips names (#173). - Specialized methods for `dbQuoteString()` and `dbQuoteIdentifier()` are available again, for compatibility with clients that use `getMethod()` to access them (#218). - Add default implementation of `dbListFields()`. - The default implementation of `dbReadTable()` now has `row.names = FALSE` as default and also supports `row.names = NULL` (#186). API changes ----------- - The `SQL()` function gains an optional `names` argument which can be used to assign names to SQL strings. Deprecated generics ------------------- - `dbListConnections()` is soft-deprecated by documentation. - `dbListResults()` is deprecated by documentation (#58). - `dbGetException()` is soft-deprecated by documentation (#51). - The deprecated `print.list.pairs()` has been removed. Bug fixes --------- - Fix `dbDataType()` for `AsIs` object (#198, @yutannihilation). - Fix `dbQuoteString()` and `dbQuoteIdentifier()` to ignore invalid UTF-8 strings (r-dbi/DBItest#156). Documentation ------------- - Help pages for generics now contain a dynamic list of methods implemented by DBI backends (#162). - `sqlInterpolate()` now supports both named and positional variables (#216, @hannesmuehleisen). - Point to db.rstudio.com (@wibeasley, #209). - Reflect new 'r-dbi' organization in `DESCRIPTION` (@wibeasley, #207). Internal -------- - Using switchpatch on the second argument for default implementations of `dbQuoteString()` and `dbQuoteIdentifier()`.
krlmlr added a commit to r-dbi/RSQLite that referenced this issue Mar 25, 2018
- Values of class `"integer64"` are now supported for `dbWriteTable()` and `dbBind()` (#243).
- New connections now automatically load default RSQLite extensions (#236).
- Implement `dbUnquoteIdentifier()`.
- Update bundled sqlite3 library to 3.22 (#252).
- Names in the `x` argument to `dbQuoteIdentifier()` are preserved in the output (r-lib/DBI#173).
- Fix rchk warnings on CRAN (#250).
- `dbRowsAffected()` and `dbExecute()` return zero after a `DROP TABLE` statement, and not the number of rows affected by the last `INSERT`, `UPDATE`, or `DELETE` (#238).
- Refactor connection and result handling to be more similar to other backends.
- Fix `dbQuoteIdentifier()` to ignore invalid UTF-8 strings (r-dbi/DBItest#156).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants