Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

leveraging metadata to determine data type in R. #294

Closed
wibeasley opened this issue Jan 31, 2020 · 6 comments
Closed

leveraging metadata to determine data type in R. #294

wibeasley opened this issue Jan 31, 2020 · 6 comments
Assignees

Comments

@wibeasley
Copy link
Member

@pbchase re-energized the conversation about redcap_read() using metadata to determine the data type. This may be less important that it was before October's enhancements with types (#257 & #258), bt I agree we should still think about it.

I pulled the stock validation options with SELECT * FROM redcap_validation_types; and displayed a few relevant columns. (I excluded a few huge and/or obvious ones that distort the table.) The 2nd column is my best guess how readr columns should be specified. The validation_name column is exposed through REDCapR::redcap_metadata_read().

I don't know how the dates can work reliably. Most of them may have to be read as character. The JavaScript regex for the 2nd row below is /^((29([-\/])02\3(\d{2}([13579][26]\|[2468][048]\|04\|08)\|(1600\|2[048]00)))\|((((0[1-9]\|1\d\|2[0-8])([-\/])(0[1-9]\|1[012]))\|((29\|30)([-\/])(0[13-9]\|1[012]))\|(31([-\/])(0[13578]\|1[02])))(\11\|\15\|\18)\d{4}))$/.

Notice the character class [-\/] that's sprinkled in, which accepts either - or / between date parts. I don't think readr can be that flexible. You must specify either - or / in readr, but from the REDCap regex, there's no way to determine what it's going to be. In fact, the same column can contain cells that mix & match. (And I think within a cell, you can mix & match, such as "4/3-2001", because the regex isn't using a back reference.)

A related option is that REDCap suggests the col_types, by printing them to the console. The user then adjusts them and specifies it in the code right before redcap_read(). I have had scucess with a very similar strategy with OuhscMunge::readr_spec_aligned().

@pbchase and others, what are your thoughts?

col_type validation_name validation_label regex_js
col_character() alpha_only Letters only
col_date("%d-%m-%Y") date_dmy Date (D-M-Y)
col_date("%m-%d-%Y") date_mdy Date (M-D-Y)
col_date() date_ymd Date (Y-M-D)
col_datetime("%d-%m-%Y %H:%M") datetime_dmy Datetime (D-M-Y H:M)
col_datetime("%m-%d-%Y %H:%M") datetime_mdy Datetime (M-D-Y H:M)
col_datetime("%d-%m-%Y %H:%M:%S") datetime_seconds_dmy Datetime w/ seconds (D-M-Y H:M:S)
col_datetime("%m-%d-%Y %H:%M:%S") datetime_seconds_mdy Datetime w/ seconds (M-D-Y H:M:S)
col_datetime("%Y-%m-%d %H:%M:%S") datetime_seconds_ymd Datetime w/ seconds (Y-M-D H:M:S)
col_datetime("%Y-%m-%d %H:%M") datetime_ymd Datetime (Y-M-D H:M)
col_character() email Email
col_integer() integer Integer /^[-+]?\b\d+\b$/
col_character() mrn_10d MRN (10 digits) /^\d{10}$/
col_character() mrn_generic MRN (generic) /^[a-z0-9-_]+$/i
col_double() number Number /^[-+]?[0-9]*\.?[0-9]+([eE][-+]?[0-9]+)?$/
col_double() number_1dp Number (1 decimal place) /^-?\d+\.\d$/
col_double() number_1dp_comma_decimal Number (1 decimal place - comma as decimal) /^-?\d+,\d$/
col_double() number_2dp Number (2 decimal places) /^-?\d+\.\d{2}$/
col_double() number_2dp_comma_decimal Number (2 decimal places - comma as decimal) /^-?\d+,\d{2}$/
col_double() number_3dp Number (3 decimal places) /^-?\d+\.\d{3}$/
col_double() number_3dp_comma_decimal Number (3 decimal places - comma as decimal) /^-?\d+,\d{3}$/
col_double() number_4dp Number (4 decimal places) /^-?\d+\.\d{4}$/
col_double() number_4dp_comma_decimal Number (4 decimal places - comma as decimal) /^-?\d+,\d{4}$/
col_double() number_comma_decimal Number (comma as decimal) /^[-+]?[0-9]*,?[0-9]+([eE][-+]?[0-9]+)?$/
col_character() phone Phone (North America)
col_character() phone_australia Phone (Australia) `/^((0[2-8])
col_character() postalcode_australia Postal Code (Australia) /^\d{4}$/
col_character() postalcode_canada Postal Code (Canada) /^[ABCEGHJKLMNPRSTVXY]{1}\d{1}[A-Z]{1}\s*\d{1}[A-Z]{1}\d{1}$/i
col_character() postalcode_french Code Postal 5 caracteres (France) `/^((0?[1-9])
col_character() postalcode_germany Postal Code (Germany) `/^(0[1-9]
col_character() ssn Social Security Number (U.S.) /^\d{3}-\d\d-\d{4}$/
col_time("%H:%M") time Time (HH:MM) `/^([0-9]
col_time("%M:%H") time_mm_ss Time (MM:SS) /^[0-5]\d:[0-5]\d$/
col_character() vmrn Vanderbilt MRN /^[0-9]{4,9}$/
col_character() zipcode Zipcode (U.S.) /^\d{5}(-\d{4})?$/
@wibeasley wibeasley self-assigned this Jan 31, 2020
@pbchase
Copy link
Contributor

pbchase commented Feb 3, 2020

Are you sure your concerns about the variability in the date types is warranted? I know the backend storage for date-time types is always YMD. I've attached an export of a test project that uses all 9 date-time stamps. The CSV export is always ymd. I don't see a problem here.

date_test.zip

@nutterb
Copy link
Contributor

nutterb commented Feb 3, 2020

@pbchase has a point here. The validation_label specifies describes the regex used to validate the input from the user on the form. But when the data are exported, they always come out in POSIX format.

Anything that has a field type starting with date_ can be processed in a single manner. You might find this part of redcapAPI helpful for this effort.

@wibeasley
Copy link
Member Author

@pbchase & @nutterb, thanks for pointing out that I didn't need to be as worried about the date formats. One less thing to worry about.

@thomasnwilson and I are still concerned about the scenario where (a) some values are entered initially, and (b) later validation is added. In that case, it's very easy that the metadata information will be too restrictive when determining data type.

library(magrittr)
uri   <- "https://bbmc.ouhsc.edu/redcap/api/"
token <- "14A41597332864D74460CBBF52EE49A6"
ds_metadata <- REDCapR::redcap_metadata_read(redcap_uri=uri, token=token)$data
ds          <- REDCapR::redcap_read(redcap_uri=uri, token=token)$data

ds_metadata %>% 
  dplyr::select(
    field_name
    ,text_validation_type_or_show_slider_number
  )

ds

results

Notice the text_validation_type_or_show_slider_number values look like good dates & integer (in the 1st data frame). But look at the 2nd data frame, the cells in date_before_validation & integer_before_validation would fail readr::col_date() and readr::col_integer().

# A tibble: 4 x 2
  field_name                text_validation_type_or_show_slider_number
  <chr>                     <chr>                                     
1 record_id                 NA                                        
2 time_1                    NA                                        
3 date_before_validation    date_ymd                                  
4 integer_before_validation integer                                   

  record_id     time_1 date_before_validation integer_before_validation form_1_complete
1         1 2010-01-02    before validation 1       before validation 1               2
2         2      55:02    before validation 2       before validation 1               2

@pbchase, @thomasnwilson, @nutterb, and anyone else. Do you agree this is a problem, or am I inventing something? If so, what are your overall goals with this desired capability (i.e., the ability for redcap_read() to leverage metadata?

@nutterb
Copy link
Contributor

nutterb commented Feb 4, 2020

You're probably more kind to your users than I am. My function has a dates = TRUE argument that determines if dates are converted. If someone were to impose that validation after collecting data, I'd advise them to set dates = FALSE and deal with the fall out on their own.

There are too many ways things can go wrong, and I personally feel that trying to accommodate all of those problems is a drain on development time. But I'm a cranky old man, too.

@pbchase
Copy link
Contributor

pbchase commented Feb 4, 2020

I think leverage_metadata might be good name for a new boolean argument to redcap_read() that allows one to try to use the metadata to set the col_types.

I think redcap_read() should use another new function, perhaps named get_col_types_from_metadata() that would generate the col_types vector from the metadata. The poor bastards that can't set leverage_metadata to true would then have the option to generate the col_types vector, mutate it and then use the mutated output.

My idea is to make a tool that simplifies some problems. By no means can we fix all the problems that might arise.

wibeasley added a commit that referenced this issue Nov 29, 2021
VS Code regex & sub
`(.*) \| (.*) +$` & `- validation_name : $1\n  fx_export       : "$2"`

ref #294
wibeasley added a commit that referenced this issue Nov 30, 2021
wibeasley added a commit that referenced this issue Nov 30, 2021
@wibeasley
Copy link
Member Author

issue replaced by #405

@OuhscBbmc OuhscBbmc locked and limited conversation to collaborators Aug 11, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants