leveraging metadata to determine data type in R. #294

wibeasley · 2020-01-31T19:38:43Z

@pbchase re-energized the conversation about redcap_read() using metadata to determine the data type. This may be less important that it was before October's enhancements with types (#257 & #258), bt I agree we should still think about it.

I pulled the stock validation options with SELECT * FROM redcap_validation_types; and displayed a few relevant columns. (I excluded a few huge and/or obvious ones that distort the table.) The 2nd column is my best guess how readr columns should be specified. The validation_name column is exposed through REDCapR::redcap_metadata_read().

I don't know how the dates can work reliably. Most of them may have to be read as character. The JavaScript regex for the 2nd row below is /^((29([-\/])02\3(\d{2}([13579][26]\|[2468][048]\|04\|08)\|(1600\|2[048]00)))\|((((0[1-9]\|1\d\|2[0-8])([-\/])(0[1-9]\|1[012]))\|((29\|30)([-\/])(0[13-9]\|1[012]))\|(31([-\/])(0[13578]\|1[02])))(\11\|\15\|\18)\d{4}))$/.

Notice the character class [-\/] that's sprinkled in, which accepts either - or / between date parts. I don't think readr can be that flexible. You must specify either - or / in readr, but from the REDCap regex, there's no way to determine what it's going to be. In fact, the same column can contain cells that mix & match. (And I think within a cell, you can mix & match, such as "4/3-2001", because the regex isn't using a back reference.)

A related option is that REDCap suggests the col_types, by printing them to the console. The user then adjusts them and specifies it in the code right before redcap_read(). I have had scucess with a very similar strategy with OuhscMunge::readr_spec_aligned().

@pbchase and others, what are your thoughts?

	col_type	validation_name	validation_label	regex_js
⭕	`col_character()`	alpha_only	Letters only
⭕	`col_date("%d-%m-%Y")`	date_dmy	Date (D-M-Y)
⭕	`col_date("%m-%d-%Y")`	date_mdy	Date (M-D-Y)
⭕	`col_date()`	date_ymd	Date (Y-M-D)
⭕	`col_datetime("%d-%m-%Y %H:%M")`	datetime_dmy	Datetime (D-M-Y H:M)
⭕	`col_datetime("%m-%d-%Y %H:%M")`	datetime_mdy	Datetime (M-D-Y H:M)
⭕	`col_datetime("%d-%m-%Y %H:%M:%S")`	datetime_seconds_dmy	Datetime w/ seconds (D-M-Y H:M:S)
⭕	`col_datetime("%m-%d-%Y %H:%M:%S")`	datetime_seconds_mdy	Datetime w/ seconds (M-D-Y H:M:S)
⭕	`col_datetime("%Y-%m-%d %H:%M:%S")`	datetime_seconds_ymd	Datetime w/ seconds (Y-M-D H:M:S)
⭕	`col_datetime("%Y-%m-%d %H:%M")`	datetime_ymd	Datetime (Y-M-D H:M)
⭕	`col_character()`	email	Email
⭕	`col_integer()`	integer	Integer	`/^[-+]?\b\d+\b$/`
⭕	`col_character()`	mrn_10d	MRN (10 digits)	`/^\d{10}$/`
⭕	`col_character()`	mrn_generic	MRN (generic)	`/^[a-z0-9-_]+$/i`
⭕	`col_double()`	number	Number	`/^[-+]?[0-9]*\.?[0-9]+([eE][-+]?[0-9]+)?$/`
⭕	`col_double()`	number_1dp	Number (1 decimal place)	`/^-?\d+\.\d$/`
⭕	`col_double()`	number_1dp_comma_decimal	Number (1 decimal place - comma as decimal)	`/^-?\d+,\d$/`
⭕	`col_double()`	number_2dp	Number (2 decimal places)	`/^-?\d+\.\d{2}$/`
⭕	`col_double()`	number_2dp_comma_decimal	Number (2 decimal places - comma as decimal)	`/^-?\d+,\d{2}$/`
⭕	`col_double()`	number_3dp	Number (3 decimal places)	`/^-?\d+\.\d{3}$/`
⭕	`col_double()`	number_3dp_comma_decimal	Number (3 decimal places - comma as decimal)	`/^-?\d+,\d{3}$/`
⭕	`col_double()`	number_4dp	Number (4 decimal places)	`/^-?\d+\.\d{4}$/`
⭕	`col_double()`	number_4dp_comma_decimal	Number (4 decimal places - comma as decimal)	`/^-?\d+,\d{4}$/`
⭕	`col_double()`	number_comma_decimal	Number (comma as decimal)	`/^[-+]?[0-9]*,?[0-9]+([eE][-+]?[0-9]+)?$/`
⭕	`col_character()`	phone	Phone (North America)
⭕	`col_character()`	phone_australia	Phone (Australia)	`/^((0[2-8])
⭕	`col_character()`	postalcode_australia	Postal Code (Australia)	`/^\d{4}$/`
⭕	`col_character()`	postalcode_canada	Postal Code (Canada)	`/^[ABCEGHJKLMNPRSTVXY]{1}\d{1}[A-Z]{1}\s*\d{1}[A-Z]{1}\d{1}$/i`
⭕	`col_character()`	postalcode_french	Code Postal 5 caracteres (France)	`/^((0?[1-9])
⭕	`col_character()`	postalcode_germany	Postal Code (Germany)	`/^(0[1-9]
⭕	`col_character()`	ssn	Social Security Number (U.S.)	`/^\d{3}-\d\d-\d{4}$/`
⭕	`col_time("%H:%M")`	time	Time (HH:MM)	`/^([0-9]
⭕	`col_time("%M:%H")`	time_mm_ss	Time (MM:SS)	`/^[0-5]\d:[0-5]\d$/`
⭕	`col_character()`	vmrn	Vanderbilt MRN	`/^[0-9]{4,9}$/`
⭕	`col_character()`	zipcode	Zipcode (U.S.)	`/^\d{5}(-\d{4})?$/`

The text was updated successfully, but these errors were encountered:

pbchase · 2020-02-03T14:02:16Z

Are you sure your concerns about the variability in the date types is warranted? I know the backend storage for date-time types is always YMD. I've attached an export of a test project that uses all 9 date-time stamps. The CSV export is always ymd. I don't see a problem here.

date_test.zip

nutterb · 2020-02-03T14:11:35Z

@pbchase has a point here. The validation_label specifies describes the regex used to validate the input from the user on the form. But when the data are exported, they always come out in POSIX format.

Anything that has a field type starting with date_ can be processed in a single manner. You might find this part of redcapAPI helpful for this effort.

wibeasley · 2020-02-03T16:32:02Z

@pbchase & @nutterb, thanks for pointing out that I didn't need to be as worried about the date formats. One less thing to worry about.

@thomasnwilson and I are still concerned about the scenario where (a) some values are entered initially, and (b) later validation is added. In that case, it's very easy that the metadata information will be too restrictive when determining data type.

library(magrittr)
uri   <- "https://bbmc.ouhsc.edu/redcap/api/"
token <- "14A41597332864D74460CBBF52EE49A6"
ds_metadata <- REDCapR::redcap_metadata_read(redcap_uri=uri, token=token)$data
ds          <- REDCapR::redcap_read(redcap_uri=uri, token=token)$data

ds_metadata %>% 
  dplyr::select(
    field_name
    ,text_validation_type_or_show_slider_number
  )

ds

results

Notice the text_validation_type_or_show_slider_number values look like good dates & integer (in the 1st data frame). But look at the 2nd data frame, the cells in date_before_validation & integer_before_validation would fail readr::col_date() and readr::col_integer().

# A tibble: 4 x 2
  field_name                text_validation_type_or_show_slider_number
  <chr>                     <chr>                                     
1 record_id                 NA                                        
2 time_1                    NA                                        
3 date_before_validation    date_ymd                                  
4 integer_before_validation integer                                   

  record_id     time_1 date_before_validation integer_before_validation form_1_complete
1         1 2010-01-02    before validation 1       before validation 1               2
2         2      55:02    before validation 2       before validation 1               2

@pbchase, @thomasnwilson, @nutterb, and anyone else. Do you agree this is a problem, or am I inventing something? If so, what are your overall goals with this desired capability (i.e., the ability for redcap_read() to leverage metadata?

nutterb · 2020-02-04T02:27:08Z

You're probably more kind to your users than I am. My function has a dates = TRUE argument that determines if dates are converted. If someone were to impose that validation after collecting data, I'd advise them to set dates = FALSE and deal with the fall out on their own.

There are too many ways things can go wrong, and I personally feel that trying to accommodate all of those problems is a drain on development time. But I'm a cranky old man, too.

pbchase · 2020-02-04T14:21:02Z

I think leverage_metadata might be good name for a new boolean argument to redcap_read() that allows one to try to use the metadata to set the col_types.

I think redcap_read() should use another new function, perhaps named get_col_types_from_metadata() that would generate the col_types vector from the metadata. The poor bastards that can't set leverage_metadata to true would then have the option to generate the col_types vector, mutate it and then use the mutated output.

My idea is to make a tool that simplifies some problems. By no means can we fix all the problems that might arise.

ref #294

VS Code regex & sub `(.*) \| (.*) +$` & `- validation_name : $1\n fx_export : "$2"` ref #294

ref #294

wibeasley · 2022-08-11T17:59:09Z

issue replaced by #405

wibeasley added the enhancement label Jan 31, 2020

wibeasley self-assigned this Jan 31, 2020

wibeasley mentioned this issue Nov 11, 2021

Unexpected Behavior: Field validation type number_comma_decimal breaks in export #366

Closed

wibeasley added a commit that referenced this issue Nov 28, 2021

proof of concept using metadata to change parser

cb499cf

ref #294

wibeasley added a commit that referenced this issue Nov 29, 2021

starting validation yaml

f9cdf61

VS Code regex & sub `(.*) \| (.*) +$` & `- validation_name : $1\n fx_export : "$2"` ref #294

wibeasley added a commit that referenced this issue Nov 30, 2021

project for testing validation types

8cab45d

ref #294

wibeasley added a commit that referenced this issue Nov 30, 2021

seed the test dataset

b820726

ref #294

wibeasley mentioned this issue Aug 11, 2022

new redcap_metadata_coltypes() function #405

Closed

wibeasley closed this as completed Aug 11, 2022

OuhscBbmc locked and limited conversation to collaborators Aug 11, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

leveraging metadata to determine data type in R. #294

leveraging metadata to determine data type in R. #294

wibeasley commented Jan 31, 2020

pbchase commented Feb 3, 2020

nutterb commented Feb 3, 2020

wibeasley commented Feb 3, 2020

nutterb commented Feb 4, 2020

pbchase commented Feb 4, 2020 •

edited

Loading

wibeasley commented Aug 11, 2022

leveraging metadata to determine data type in R. #294

leveraging metadata to determine data type in R. #294

Comments

wibeasley commented Jan 31, 2020

pbchase commented Feb 3, 2020

nutterb commented Feb 3, 2020

wibeasley commented Feb 3, 2020

nutterb commented Feb 4, 2020

pbchase commented Feb 4, 2020 • edited Loading

wibeasley commented Aug 11, 2022

pbchase commented Feb 4, 2020 •

edited

Loading