Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

check for missing records #400

Closed
wibeasley opened this issue Jul 21, 2022 · 4 comments
Closed

check for missing records #400

wibeasley opened this issue Jul 21, 2022 · 4 comments
Assignees

Comments

@wibeasley
Copy link
Member

wibeasley commented Jul 21, 2022

In a conversation w/ @TimMonahan, we found a scenario where something prevented records for being returned. It looks like it's happening in REDCap's API PHP code. Like something is choking on large sparse datasets and there's a relatively simple way to at least detect it in REDCapR. Make sure that everyone contained in initial_call has 1+ records in ds_stacked.

REDCapR/R/redcap-read.R

Lines 266 to 281 in e5994dc

initial_call <- REDCapR::redcap_read_oneshot(
redcap_uri = redcap_uri,
token = token,
records_collapsed = records_collapsed,
fields_collapsed = metadata$data$field_name[id_position],
forms_collapsed = forms_collapsed,
events_collapsed = events_collapsed,
filter_logic = filter_logic,
datetime_range_begin = datetime_range_begin,
datetime_range_end = datetime_range_end,
guess_type = guess_type,
http_response_encoding = http_response_encoding,
locale = locale,
verbose = verbose,
config_options = config_options
)

ds_stacked <- as.data.frame(dplyr::bind_rows(lst_batch))

@wibeasley wibeasley self-assigned this Jul 21, 2022
@wibeasley
Copy link
Member Author

wibeasley commented Jul 21, 2022

@TimMonahan, here's a new version we can try with your dataset.

The error message could be better. I can add more info when we discover the reason why they aren't being exported from the server.

Warning message:
There are 43 subject(s) that are missing rows in the final dataset.
Check for funny values that could trip up REDCap's PHP code:
1; 2; 3; 4; 5; 6; 7; 8; 9; 10; 11; 12; 13; 14; 15; 16; 17; 18; 19; 20; 21; 22; 23; 24; 25; 26; 27; 28; 29; 30; 31; 32; 33; 34; 35; 36; 37; 38; 39; 40; 41; 42; 43.

For now, I throwing a warning, instead of a full error.

  unique_ids_actual <- sort(unique(ds_stacked[[id_position]]))
  ids_missing_rows  <- setdiff(unique_ids, unique_ids_actual)

  if (0L < length(ids_missing_rows)) {
    warning(sprintf(
      "There are %i subject(s) that are missing rows in the final dataset.\nCheck for funny values that could trip up REDCap's PHP code:\n%s.",
      length(ids_missing_rows),
      paste(ids_missing_rows, collapse="; ")
    ))
  }

to install the dev version: remotes::install_github("OuhscBbmc/REDCapR", ref = "dev")

wibeasley added a commit that referenced this issue Jul 21, 2022
wibeasley added a commit that referenced this issue Jul 22, 2022
wibeasley added a commit that referenced this issue Jul 22, 2022
wibeasley added a commit that referenced this issue Jul 22, 2022
wibeasley added a commit that referenced this issue Jul 22, 2022
losely associated w/ #400
@wibeasley
Copy link
Member Author

wibeasley commented Jul 22, 2022

Advice to people browsing this issue: if you suspect you're losing data through the API, request fewer rows and columns. Ditch any columns you don't need, and reduce batch_size until the errors stop. (A version I'm about to pull into main will throw errors if it detects missing records. But right now redcap_read_oneshot() doesn't have a way to double-check.)

If anyone wants to try themselves, the url is "https://bbmc.ouhsc.edu/redcap/api/" and the token (to this PHI-free dataset) is "5C1526186C4D04AE0A0630743E69B53C".


@TimMonahan, a little followup yesterday's session: I believe I have isolated the originating problem to the PHP side. I was exploring similar scenarios on my own with large datasets (including the super wide 3 dataset). An incomplete dataset is returned in the playground and Postman, despite also returning a 200 http status code.

I've upgraded REDCapR's warning to an error/stop. I'll also add advice to the message about decreasing the batch size or dropping some forms/events.

I got completely upstream of R/httr/readr/REDCapR, and used Postman & bash. I made four identical calls and saved to files "raw-text-postman-2.csv" ... "raw-text-postman-5.csv" and then looked at the first 100 characters in each file. They weren't consistent and all of them were missing the first 20,000+ variables --which is bad. But there was a repeating pattern. Some started at variable_30253 and some started at variable_25495.

When I use the DataExport feature in the browser, I start with the initial variables --which is good.

Back to Postman, when I ask for only record_id and variable_00002, it never works --it returns only record_id. But when I ask for record_id and variable_25495, it works half the time (and the other half is only record_id).

Before I post something to the REDCap Community site, any other diagnostic information that would be helpful for people to know? Any holes/flaws in this examination?

Snippet of first 100 characters from four identical Postman calls:

$ head -c 100 raw-text-postman-2.csv
# record_id,variable_30253,variable_30254,variable_30255___1,variable_30255___2,variable_30255___3,var
$ head -c 100 raw-text-postman-3.csv
# record_id,variable_25495,variable_25496,variable_25497___1,variable_25497___2,variable_25497___3,var
$ head -c 100 raw-text-postman-4.csv
# record_id,variable_30253,variable_30254,variable_30255___1,variable_30255___2,variable_30255___3,var
$ head -c 100 raw-text-postman-5.csv
# record_id,variable_25495,variable_25496,variable_25497___1,variable_25497___2,variable_25497___3,var

Snippet of first 100 lines of REDCap export:

$ head -c 100 export-1.csv
# record_id,variable_00002,variable_00003___1,variable_00003___2,variable_00003___3,variable_00004,var

Screenshot of Postman call:
image

Similar alternating pattern I saw earlier inside kernel-api.R: I think you reporting seeing a similar alternation with your dataset yesterday.

Browse[2]> substr(raw_text, 1, 100)
[1] "record_id,variable_25495,variable_25496,variable_25497___1,variable_25497___2,variable_25497___3,var"

image

Asking for variable_25495, which works half of the time:
image

wibeasley added a commit that referenced this issue Jul 22, 2022
wibeasley added a commit that referenced this issue Jul 22, 2022
@wibeasley
Copy link
Member Author

wibeasley commented Jul 22, 2022

@TimMonahan, Anything you'd change about this error message? (And I'm submitting the previous post to the Community site. EDIT: https://community.projectredcap.org/questions/131318/api-returns-200-ok-despite-returning-an-incomplete.html)

Error: There are 32 subject(s) that are missing rows in the returned dataset. REDCap's PHP code is likely trying to process too much text in one bite.

Common solutions this problem are:
  - specifying only the records you need (w/ `records`)
  - specifying only the fields you need (w/ `fields`)
  - specifying only the forms you need (w/ `forms`)
  - specifying a subset w/ `filter_logic`
  - reduce `batch_size`

The missing ids are:
1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32.

wibeasley added a commit that referenced this issue Jul 22, 2022
@TimMonahan
Copy link

TimMonahan commented Jul 22, 2022 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants