+ Summary
+ An accurate statement of the provenance of data is essential in
+ bio-medical research. Powerful data manipulation tools available in
+ the tidyverse R package ecosystem
+ (Wickham
+ et al., 2019) provide the infrastructure to assemble, clean and
+ filter data prior to statistical analysis. Manual documentation of the
+ steps taken in the data pipeline and the provenance of data is a
+ cumbersome and error prone task which may restrict reproducibility.
+ dtrackr is a wrapper around a subset of the
+ standard tidyverse data manipulation tools that
+ allows automatic tracking of the processing steps applied to a data
+ set, prior to statistical analysis. It allows early detection and
+ reporting of data quality problems, and automatically documents a
+ pipeline of data transformations as a flowchart in a format suitable
+ for scientific publication, including, but not limited to CONSORT
+ diagrams
+ (Schulz
+ et al., 2010).
+ dtrackr is first and foremost a utility to
+ accelerate and improve research by facilitating documentation,
+ supporting extraction of knowledge from data sets, and the execution
+ of research by helping identify data quality issues. The general
+ capability however fits into a broader context of other provenance or
+ data pipeline research. This includes initiatives such as
+ C2Metadata
+ (Alter
+ et al., 2021), which focus on a language independent
+ representation of a data pipeline, and R packages such as
+ targets
+ (Landau,
+ 2021) which focus on documenting pipeline code, and managing
+ the execution of a pipeline, or RDataTracker
+ which focusses on tracking the execution of a arbitrary R script
+ (Lerner
+ et al., 2018). dtrackr takes a more data
+ oriented approach, which could be complementary, in which we remain
+ agnostic to the detail of a data pipeline script or nature of its
+ execution, but capture a subset of the transformations applied to data
+ alongside the data itself, thereby documenting the data state as it is
+ being manipulated. This is achieved by overriding the execution of
+ dplyr pipeline functions and results in a
+ retrospective record of provenance
+ (Pimentel
+ et al., 2019). dtrackr also has the
+ ability to insert secondary analysis as annotations into the pipeline,
+ and allows control over what information is collected, ultimately with
+ a view to producing simple human readable output. The approach of
+ dtrackr is analogous to a
+ git commit history for dataframes, and there is
+ potential synergy with emerging versioned databases such as
+ dolt
+ (Dolt
+ Is Git for Data!, 2019/2022;
+ Ross,
+ 2022).
+
+
+ Statement of need
+ The collection of experimental or observational data for research
+ is often an iterative endeavour, involving curation of complex data
+ sets designed for multiple goals. Systematic data quality checking for
+ such sets is a major challenge, particularly when they are assembled
+ to identify emerging or rapidly evolving issues. Feedback from early
+ data analysis can identify specific data quality issues, resolution of
+ which can considerably improve data for the task at hand. However this
+ requires a clear understanding of why and when individual data items
+ are excluded, which is potentially tedious and may be seen as lower
+ priority compared to statistical analysis.
+ Data analysis using tidyverse in R is a
+ rapid means of transforming raw data into a format suitable for
+ statistical analysis. The transformations involved can, however affect
+ the results of statistical analysis, and meticulous care must be taken
+ to ensure that any assumptions made during data processing are well
+ documented. It is often too easy to inadvertently exclude data where
+ filtering on missing items, or joining linked data sets with
+ incomplete foreign key relationships.
+ In complex data analysis, the use of interactive programming
+ environments such as Read-Eval-Print Loops (REPL) in R markdown
+ documents, interim caching of results, or conditional branching data
+ pipelines, can result in the current state of a processed data set
+ becoming decoupled from the code that is designed to generate
+ them.
+ To surface these issues bio-medical journal articles are usually
+ required to report data manipulation to an agreed standard. For
+ example, CONSORT diagrams are part of the requirements in reporting
+ parallel group clinical trials. They are described in the updated 2010
+ CONSORT statement
+ (Schulz
+ et al., 2010), and clarify how patients were recruited,
+ selected, randomized and followed up. For observational studies, such
+ as case control designs, an equivalent requirement is the STROBE
+ statement
+ (von
+ Elm et al., 2008). There are many other similar requirements
+ for other types of study, such as the TRIPOD statement for
+ multivariate models
+ (Collins
+ et al., 2015). Maintaining such CONSORT diagram over the course
+ of a study when data sets are being actively collected and data
+ quality issues being addressed is time-consuming.
+ dtrackr addresses these issues by
+ instrumenting a commonly used subset of standard
+ tidyverse data manipulation pipeline functions
+ from dplyr and tidyr. It
+ can automatically record the steps taken, records excluded and a
+ summary of the result of each data processing step, as part of the
+ data set itself in a “history graph”. In this way data sets retain an
+ accurate history of their own provenance regardless of the actual
+ route taken to assemble them. This history includes a complete record
+ of any data quality issues that lead to excluded records. The history
+ is a directed graph which can be expressed in the commonly used
+ GraphViz language
+ (Gansner
+ & North, 2000) and may be visualised as a flowchart such as
+ in Figure 1; this
+ uses the Chronic Granulomatous Disease dataset from the
+ survival package
+ (Terry
+ M. Therneau & Patricia M. Grambsch, 2000;
+ Therneau,
+ 2022) as an example of a parallel group study and produces a
+ STROBE like flowchart.
+
+ An example flowchart derived directly from a simple
+ analysis of the Chronic Granulomatous Disease dataset demonstrating
+ use of dtrackr to generate the key parts of a
+ STROBE or CONSORT diagram.
+
+
+
+ dtrackr was originally conceptualized during
+ an analysis I undertook of the severity of the Alpha variant of
+ SARS-CoV-2
+ (Challen
+ et al., 2021), and has since been used for other
+ epidemiological studies including an analysis of the incidence of
+ hospitalization of acute lower respiratory tract disease in Bristol
+ (Hyams,
+ Challen, Begier, et al., 2022), and a comparative analysis of
+ the severity of the SARS-CoV-2 Omicron variant, versus the Delta
+ variant against a range of hospital outcomes
+ (Hyams,
+ Challen, Marlow, et al., 2022).
+ Although the specific example presented here is in the bio-medical
+ domain, tracking the provenance of data is a much broader issue, and
+ we anticipate there are many other applications for
+ dtrackr.
+
+