This R package can be used to create and/or load a database containing the UK Biobank Data Showcase schemas, which are data dictionaries describing the structure of the UK Biobank main dataset.
You can install the current version of ukbschemas from GitHub with:
# install.packages("devtools")
devtools::install_github("bjcairns/ukbschemas")
library(ukbschemas)
The package supports two workflows.
The recommended approach is to use ukbschemas_db()
to download the
schema tables and save them to an SQLite database, then use load_db()
to load the tables from the database and store them as tibbles in a
named list:
db <- ukbschemas_db(path = tempdir())
sch <- load_db(db = db)
By default, the database is named ukb-schemas-YYYY-MM-DD.sqlite
(where
YYYY-MM-DD
is the current date) and placed in the current working
directory. (path = tempdir()
in the above example puts it in the
current temporary directory instead.) At the most recent compilation of
the database (03 August 2019), the size of the .sqlite database file
produced by ukbschemas_db()
was approximately 10.1MB.
Note that without further arguments, ukbschemas_db()
tidies up the
database to give it a more consistent relational structure (the changes
are summarised in the output of the first example, above). Alternatively
the raw data can be loaded with the as_is
argument:
db <- ukbschemas_db(path = tempdir(), overwrite = TRUE, as_is = TRUE)
The overwrite
option allows the database file to be overwritten (if
TRUE
), or prevents this (FALSE
), or if not specified and the session
is interactive (interactive() == TRUE
) then the user is prompted to
decide.
Note: If you have created a schemas database with an earlier version
of ukbschemas, it should be possible to load that database with the
latest version of load_db()
, which (currently) should load any SQLite
database, regardless of contents.
The second approach is to download the schemas and store them in memory in a list, and save them to a database only as requried.
This is not recommended, because it is better (for everyone) not to download the schema files every time they are needed, and because the database assumes a certain structure that should be guaranteed when the database is saved. If you still want to take this approach, use:
sch <- ukbschemas()
db <- save_db(sch, path = tempdir())
This package was originally written in bash (a Unix shell scripting language). However, R is more accessible and all dependencies are loaded when you install the package; there is no need to install any secondary software (not even SQLite).
- All the encoding value tables (
esimpint
,esimpstring
,esimpreal
,esimpdate
,ehierint
,ehierstring
) have been harmonised and combined into a single tableencvalues
. Thevalue
column inencvalues
has typeTEXT
, but atype
column has been added in case the value is not clear from context. The original type-specific tables have been deleted. - To avoid redunancy, category parent-child relationships have been
moved to table
categories
, as columnparent_id
, from tablecatbrowse
(which has been deleted). - Reference to the category to which a field belongs is in the
main_category
column in thefields
schema, but has been renamed tocategory_id
for consistency with thecategories
schema. - Details of several of the field properties (
value_type
,stability
,item_type
,strata
andsexed
) are available elsewhere on the Data Showcase. These have been added manually to tablesvaluetypes
,stability
,itemtypes
,strata
andsexed
, and appropriate ID references have been renamed with the_id
suffix in tablesfields
andencodings
. - There are several columns in the tables which are not
well-documented (e.g.
base_type
in fields,availability
inencodings
andcategories
, and others). Additional tables documenting these encoded values may be included in future versions (and suggestions are welcome).
- The UK Biobank data schemas are regularly updated as new data are added to the system. ukbschemas does not currently include a facility for updating the database; it is necessary to create a new database.
- Because
readr::read_csv()
reads whole numbers as typedouble
, notinteger
(allowing 64-bit integers without loss of information), column types in schemas loaded in R will differ depending on whether the schemas are loaded directly to R or first saved to a database. This should make little or no difference for most applications. - Any other issues.