Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected Behavior: Empty dataframe when reading from project with special characters #354

Closed
lrasmus opened this issue Sep 30, 2021 · 2 comments
Assignees
Labels
nonascii accommodate non-ascii character

Comments

@lrasmus
Copy link

lrasmus commented Sep 30, 2021

We have loaded data that contains non-ASCII characters. When calling read, the data frame is coming back empty:

project <- redcap_project$new(redcap_uri = uri, token = token)

#'title' and some other fields cannot be read due to encoding issues, read only abstract and full text review information
df  <- project$read(fields = c(....))

I have traced this back that the httr::POST is correctly returning data from the API, but the call to httr::content is using the default UTF-8 encoding.
The conversion fails, and the data is then returned as empty.

In the current read calls, it doesn't appear to allow overriding the encoding. I am happy to propose a fix and submit a PR, but was wondering if we want the user to always explicitly set the encoding, or if the package should try to detect the encoding using something like the uchardet package:

detect_raw_enc(response$content)
@wibeasley wibeasley self-assigned this Sep 30, 2021
@wibeasley
Copy link
Member

@lrasmus, thanks for digging into the problem.

  1. I completely agree with your dissection and that redcap_read_oneshot() (and redcap_read(), which calls redcap_read_oneshot()) could accept an encoding argument and pass it to the part of kernel_api() that you highlighted. If this enhancement would help you, then I'd love to accept a PR.
  2. Please send me a CSV that I can upload to a new test project for the REDCapR unit tests.
  3. So far, I haven't really encountered any non-ascii values (at least none that were intentionally non-ascii). What do you (and anyone else) think about allowing an automatic detection of encoding? Do you guys think it's an important capability (instead of forcing people to decide & specify each time they want something non-UTF-8)?
  4. If so, my initial preference is
    1. to allow redcap_read_oneshot() to pass through an encoding value (i.e., the PR you described above)
    2. the defaults remain "UTF-8". (I'm guessing the encoding can sometimes be falsely detected, so I'd prefer to be locked in by default)
    3. When the new encoding argument is explicitly set to NULL, then uchardet decides. Where does this happen --inside the kernel_api() function?
  5. I'm always reluctant to increase dependencies, but currently, the uchardet package would add only two dependencies: uchardet and indirectly tinytest (which doesn't have any dependencies outside parallel & utils, which install with R). All the other uchardet dependencies are existing REDCapR dependencies.
  6. Does anyone have ideas or experience for the unit tests? I experimented a little a few years ago, and got derailed because the same REDCap project represented Russian characters, depending if the client was Linux or Windows. I never came back to it. @teigentler reported similar discrepancies with German characters in Special (German) characters #296.
  7. This doesn't seem like something that should require probablistic detection. Is encoding a setting in REDCap or even MySQL? If so, maybe a new API function should be added to REDCap so it reports the encoding (which then is used for subsequent executions of redcap_read_oneshot())? It should be the encoding of the column redcap_data.value. I saw some chatter on the forums about recent changes to MySQL & Maria & REDCap.

@wibeasley
Copy link
Member

@lrasmus, I implemented your idea and added the http_response_encoding parameter to redcap_read() and redcap_read_oneshot().

Judging from the lack of responses, I'm guessing you and others don't feel too strongly for the need to add an automatic detection mechanism for encoding, so I went with the simpler route that doesn't require another package dependency.

If you or anyone else would like to open a new issue (and reference this one), I'm happy to revisit the issue. If you do, please send me a few simple-ish dictionaries & PHI-free datasets that can be included in the test suite.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
nonascii accommodate non-ascii character
Projects
None yet
Development

No branches or pull requests

2 participants