check for missing records #400

wibeasley · 2022-07-21T20:06:35Z

In a conversation w/ @TimMonahan, we found a scenario where something prevented records for being returned. It looks like it's happening in REDCap's API PHP code. Like something is choking on large sparse datasets and there's a relatively simple way to at least detect it in REDCapR. Make sure that everyone contained in initial_call has 1+ records in ds_stacked.

REDCapR/R/redcap-read.R

Lines 266 to 281 in e5994dc

    
           initial_call <- REDCapR::redcap_read_oneshot( 
        
             redcap_uri         = redcap_uri, 
        
             token              = token, 
        
             records_collapsed  = records_collapsed, 
        
             fields_collapsed   = metadata$data$field_name[id_position], 
        
             forms_collapsed    = forms_collapsed, 
        
             events_collapsed   = events_collapsed, 
        
             filter_logic       = filter_logic, 
        
             datetime_range_begin   = datetime_range_begin, 
        
             datetime_range_end     = datetime_range_end, 
        
             guess_type         = guess_type, 
        
             http_response_encoding = http_response_encoding, 
        
             locale             = locale, 
        
             verbose            = verbose, 
        
             config_options     = config_options 
        
           )

REDCapR/R/redcap-read.R

Line 384 in e5994dc

ds_stacked <- as.data.frame(dplyr::bind_rows(lst_batch))

The text was updated successfully, but these errors were encountered:

wibeasley · 2022-07-21T23:23:27Z

@TimMonahan, here's a new version we can try with your dataset.

The error message could be better. I can add more info when we discover the reason why they aren't being exported from the server.

Warning message:
There are 43 subject(s) that are missing rows in the final dataset.
Check for funny values that could trip up REDCap's PHP code:
1; 2; 3; 4; 5; 6; 7; 8; 9; 10; 11; 12; 13; 14; 15; 16; 17; 18; 19; 20; 21; 22; 23; 24; 25; 26; 27; 28; 29; 30; 31; 32; 33; 34; 35; 36; 37; 38; 39; 40; 41; 42; 43.

For now, I throwing a warning, instead of a full error.

  unique_ids_actual <- sort(unique(ds_stacked[[id_position]]))
  ids_missing_rows  <- setdiff(unique_ids, unique_ids_actual)

  if (0L < length(ids_missing_rows)) {
    warning(sprintf(
      "There are %i subject(s) that are missing rows in the final dataset.\nCheck for funny values that could trip up REDCap's PHP code:\n%s.",
      length(ids_missing_rows),
      paste(ids_missing_rows, collapse="; ")
    ))
  }

to install the dev version: remotes::install_github("OuhscBbmc/REDCapR", ref = "dev")

ref #400

@TimMonahan

ref #400 cc: @TimMonahan

loosely associated w/ #400

losely associated w/ #400

wibeasley · 2022-07-22T18:31:06Z

Advice to people browsing this issue: if you suspect you're losing data through the API, request fewer rows and columns. Ditch any columns you don't need, and reduce batch_size until the errors stop. (A version I'm about to pull into main will throw errors if it detects missing records. But right now redcap_read_oneshot() doesn't have a way to double-check.)

If anyone wants to try themselves, the url is "https://bbmc.ouhsc.edu/redcap/api/" and the token (to this PHI-free dataset) is "5C1526186C4D04AE0A0630743E69B53C".

@TimMonahan, a little followup yesterday's session: I believe I have isolated the originating problem to the PHP side. I was exploring similar scenarios on my own with large datasets (including the super wide 3 dataset). An incomplete dataset is returned in the playground and Postman, despite also returning a 200 http status code.

I've upgraded REDCapR's warning to an error/stop. I'll also add advice to the message about decreasing the batch size or dropping some forms/events.

I got completely upstream of R/httr/readr/REDCapR, and used Postman & bash. I made four identical calls and saved to files "raw-text-postman-2.csv" ... "raw-text-postman-5.csv" and then looked at the first 100 characters in each file. They weren't consistent and all of them were missing the first 20,000+ variables --which is bad. But there was a repeating pattern. Some started at variable_30253 and some started at variable_25495.

When I use the DataExport feature in the browser, I start with the initial variables --which is good.

Back to Postman, when I ask for only record_id and variable_00002, it never works --it returns only record_id. But when I ask for record_id and variable_25495, it works half the time (and the other half is only record_id).

Before I post something to the REDCap Community site, any other diagnostic information that would be helpful for people to know? Any holes/flaws in this examination?

Snippet of first 100 characters from four identical Postman calls:

$ head -c 100 raw-text-postman-2.csv
# record_id,variable_30253,variable_30254,variable_30255___1,variable_30255___2,variable_30255___3,var
$ head -c 100 raw-text-postman-3.csv
# record_id,variable_25495,variable_25496,variable_25497___1,variable_25497___2,variable_25497___3,var
$ head -c 100 raw-text-postman-4.csv
# record_id,variable_30253,variable_30254,variable_30255___1,variable_30255___2,variable_30255___3,var
$ head -c 100 raw-text-postman-5.csv
# record_id,variable_25495,variable_25496,variable_25497___1,variable_25497___2,variable_25497___3,var

Snippet of first 100 lines of REDCap export:

$ head -c 100 export-1.csv
# record_id,variable_00002,variable_00003___1,variable_00003___2,variable_00003___3,variable_00004,var

Screenshot of Postman call:

Similar alternating pattern I saw earlier inside kernel-api.R: I think you reporting seeing a similar alternation with your dataset yesterday.

Browse[2]> substr(raw_text, 1, 100)
[1] "record_id,variable_25495,variable_25496,variable_25497___1,variable_25497___2,variable_25497___3,var"

Asking for variable_25495, which works half of the time:

ref #400

wibeasley · 2022-07-22T22:43:23Z

@TimMonahan, Anything you'd change about this error message? (And I'm submitting the previous post to the Community site. EDIT: https://community.projectredcap.org/questions/131318/api-returns-200-ok-despite-returning-an-incomplete.html)

Error: There are 32 subject(s) that are missing rows in the returned dataset. REDCap's PHP code is likely trying to process too much text in one bite.

Common solutions this problem are:
  - specifying only the records you need (w/ `records`)
  - specifying only the fields you need (w/ `fields`)
  - specifying only the forms you need (w/ `forms`)
  - specifying a subset w/ `filter_logic`
  - reduce `batch_size`

The missing ids are:
1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32.

closes #400

TimMonahan · 2022-07-22T23:08:02Z

Looks great Will! ship it. And thanks again!

…

________________________________ From: Will Beasley ***@***.***> Sent: Friday, July 22, 2022 3:43:33 PM To: OuhscBbmc/REDCapR ***@***.***> Cc: Monahan, Tim M ***@***.***>; Mention ***@***.***> Subject: Re: [OuhscBbmc/REDCapR] check for missing records (Issue #400) @TimMonahan<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_TimMonahan&d=DwMCaQ&c=eRAMFD45gAfqt84VtBcfhfEazhEXT91ASHynm_9f1N0&r=pPw25ao3EG1PtuhCPPP-vJYqkiMKdiJmk3JFDboZ0a4&m=YF-8u4-EKo6JD5Z3cMQkduxNx7a1pYDXt8w302Grjp5no71DKsdWdB805LwTHmgd&s=jtY2_wUzSqppzoY_KYEf3YILuFLy15Wgm67p_gxeXhs&e=>, Anything you'd change about this error message? (And I'm submitting the previous post to the Community site.) Error: There are 32 subject(s) that are missing rows in the returned dataset. REDCap's PHP code is likely trying to process too much text in one bite. Common solutions this problem are: - specifying only the records you need (w/ `records`) - specifying only the fields you need (w/ `fields`) - specifying only the forms you need (w/ `forms`) - specifying a subset w/ `filter_logic` - reduce `batch_size` The missing ids are: 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32. — Reply to this email directly, view it on GitHub<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_OuhscBbmc_REDCapR_issues_400-23issuecomment-2D1192986335&d=DwMCaQ&c=eRAMFD45gAfqt84VtBcfhfEazhEXT91ASHynm_9f1N0&r=pPw25ao3EG1PtuhCPPP-vJYqkiMKdiJmk3JFDboZ0a4&m=YF-8u4-EKo6JD5Z3cMQkduxNx7a1pYDXt8w302Grjp5no71DKsdWdB805LwTHmgd&s=KcH5c2VjMYZaFv2W4apPo2PmVWhJYohcbNne8UHpg1w&e=>, or unsubscribe<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AEBJL6GQ6JRSZF3IG47ZOUTVVMPZLANCNFSM54I3DJLQ&d=DwMCaQ&c=eRAMFD45gAfqt84VtBcfhfEazhEXT91ASHynm_9f1N0&r=pPw25ao3EG1PtuhCPPP-vJYqkiMKdiJmk3JFDboZ0a4&m=YF-8u4-EKo6JD5Z3cMQkduxNx7a1pYDXt8w302Grjp5no71DKsdWdB805LwTHmgd&s=BmG1CWG3V0uGOgCPe9bs3s0Gl-IUJppqjm9zv0QFqBo&e=>. You are receiving this because you were mentioned.Message ID: ***@***.***>

wibeasley self-assigned this Jul 21, 2022

wibeasley added a commit that referenced this issue Jul 21, 2022

check for missing records

374cd0f

ref #400

wibeasley added a commit that referenced this issue Jul 22, 2022

more debugging info

0cf513d

ref #400

wibeasley added a commit that referenced this issue Jul 22, 2022

make it friendly for record_collapsed

0911e3b

ref #400 cc: @TimMonahan

wibeasley added a commit that referenced this issue Jul 22, 2022

avoid including forms/fields in redcap_read()'s initial

c022b5e

loosely associated w/ #400

wibeasley added a commit that referenced this issue Jul 22, 2022

check unique_ids isn't empty

1d591ef

losely associated w/ #400

wibeasley added a commit that referenced this issue Jul 22, 2022

upload data for superwide-3

ef3e353

ref #400

wibeasley added a commit that referenced this issue Jul 22, 2022

testing more superwide datasets

da15592

ref #400

wibeasley added a commit that referenced this issue Jul 22, 2022

better error message when a batch is too large

91a1324

closes #400

wibeasley mentioned this issue Jul 22, 2022

checking for partially-returned datasets #401

Merged

wibeasley closed this as completed in 09fb518 Jul 22, 2022

wibeasley mentioned this issue Jul 23, 2022

Function to export logfile #383

Merged

OuhscBbmc deleted a comment from TimMonahan Oct 11, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

check for missing records #400

check for missing records #400

wibeasley commented Jul 21, 2022 •

edited

Loading

wibeasley commented Jul 21, 2022 •

edited

Loading

wibeasley commented Jul 22, 2022 •

edited

Loading

wibeasley commented Jul 22, 2022 •

edited

Loading

TimMonahan commented Jul 22, 2022 via email

check for missing records #400

check for missing records #400

Comments

wibeasley commented Jul 21, 2022 • edited Loading

wibeasley commented Jul 21, 2022 • edited Loading

wibeasley commented Jul 22, 2022 • edited Loading

wibeasley commented Jul 22, 2022 • edited Loading

TimMonahan commented Jul 22, 2022 via email

wibeasley commented Jul 21, 2022 •

edited

Loading

wibeasley commented Jul 21, 2022 •

edited

Loading

wibeasley commented Jul 22, 2022 •

edited

Loading

wibeasley commented Jul 22, 2022 •

edited

Loading