diff --git a/joss.04707/10.21105.joss.04707.crossref.xml b/joss.04707/10.21105.joss.04707.crossref.xml new file mode 100644 index 0000000000..0185e98cd1 --- /dev/null +++ b/joss.04707/10.21105.joss.04707.crossref.xml @@ -0,0 +1,320 @@ + + + + 20221213T203705-b9e5e2b1dfe502ad5ac93ecef9b780f6da9c0bbd + 20221213203705 + + JOSS Admin + admin@theoj.org + + The Open Journal + + + + + Journal of Open Source Software + JOSS + 2475-9066 + + 10.21105/joss + https://joss.theoj.org/ + + + + + 12 + 2022 + + + 7 + + 80 + + + + dtrackr: An R package for tracking the provenance of +data + + + + Robert + Challen + https://orcid.org/0000-0002-5504-7768 + + + + 12 + 13 + 2022 + + + 4707 + + + 10.21105/joss.04707 + + + http://creativecommons.org/licenses/by/4.0/ + http://creativecommons.org/licenses/by/4.0/ + http://creativecommons.org/licenses/by/4.0/ + + + + Software archive + 10.5281/zenodo.7433514 + + + GitHub review issue + https://github.com/openjournals/joss-reviews/issues/4707 + + + + 10.21105/joss.04707 + https://joss.theoj.org/papers/10.21105/joss.04707 + + + https://joss.theoj.org/papers/10.21105/joss.04707.pdf + + + + + + CONSORT 2010 Statement: Updated guidelines +for reporting parallel group randomised trials + Schulz + BMJ + 340 + 10.1136/bmj.c332 + 0959-8138 + 2010 + Schulz, K. F., Altman, D. G., & +Moher, D. (2010). CONSORT 2010 Statement: Updated guidelines for +reporting parallel group randomised trials. BMJ, 340, c332. +https://doi.org/10.1136/bmj.c332 + + + The Strengthening the Reporting of +Observational Studies in Epidemiology (STROBE) statement: Guidelines for +reporting observational studies + Elm + Journal of Clinical +Epidemiology + 4 + 61 + 2008 + von Elm, E., Altman, D. G., Egger, +M., Pocock, S. J., Gøtzsche, P. C., Vandenbroucke, J. P., & STROBE +Initiative. (2008). The Strengthening the Reporting of Observational +Studies in Epidemiology (STROBE) statement: Guidelines for reporting +observational studies. Journal of Clinical Epidemiology, 61(4), +344–349. + + + Transparent reporting of a multivariable +prediction model for individual prognosis or diagnosis (TRIPOD): The +TRIPOD Statement + Collins + BMC Medicine + 1 + 13 + 10.1186/s12916-014-0241-z + 1741-7015 + 2015 + Collins, G. S., Reitsma, J. B., +Altman, D. G., & Moons, K. G. (2015). Transparent reporting of a +multivariable prediction model for individual prognosis or diagnosis +(TRIPOD): The TRIPOD Statement. BMC Medicine, 13(1), 1. +https://doi.org/10.1186/s12916-014-0241-z + + + Welcome to the Tidyverse + Wickham + Journal of Open Source +Software + 43 + 4 + 10.21105/joss.01686 + 2475-9066 + 2019 + Wickham, H., Averick, M., Bryan, J., +Chang, W., McGowan, L. D., François, R., Grolemund, G., Hayes, A., +Henry, L., Hester, J., Kuhn, M., Pedersen, T. L., Miller, E., Bache, S. +M., Müller, K., Ooms, J., Robinson, D., Seidel, D. P., Spinu, V., … +Yutani, H. (2019). Welcome to the Tidyverse. Journal of Open Source +Software, 4(43), 1686. +https://doi.org/10.21105/joss.01686 + + + Risk of mortality in patients infected with +SARS-CoV-2 variant of concern 202012/1: Matched cohort +study + Challen + BMJ + 372 + 10.1136/bmj.n579 + 1756-1833 + 2021 + Challen, R., Brooks-Pollock, E., +Read, J. M., Dyson, L., Tsaneva-Atanasova, K., & Danon, L. (2021). +Risk of mortality in patients infected with SARS-CoV-2 variant of +concern 202012/1: Matched cohort study. BMJ, 372, n579. +https://doi.org/10.1136/bmj.n579 + + + Incidence of Community Acquired Lower +Respiratory Tract Disease in Bristol, UK During the COVID-19 +Pandemic + Hyams + 10.2139/ssrn.4087373 + 2022 + Hyams, C., Challen, R., Begier, E., +Southern, J., King, J., Morley, A., Szasz-Benczur, Z., Garcia Gonzalez, +M., Kinney, J., Campling, J., Gray, S., Oliver, J., Hubler, R., Valluri, +S. R., Vyse, A., McLaughlin, J. M., Ellsbury, G., Maskell, N., Gessner, +B., … Finn, A. (2022). Incidence of Community Acquired Lower Respiratory +Tract Disease in Bristol, UK During the COVID-19 Pandemic [SSRN +Scholarly Paper]. +https://doi.org/10.2139/ssrn.4087373 + + + An open graph visualization system and its +applications to software engineering + Gansner + Software - Practice and +Experience + 11 + 30 + 10.1002/1097-024X(200009)30:11<1203::AID-SPE338>3.0.CO;2-N + 2000 + Gansner, E. R., & North, S. C. +(2000). An open graph visualization system and its applications to +software engineering. Software - Practice and Experience, 30(11), +1203–1233. +https://doi.org/10.1002/1097-024X(200009)30:11<1203::AID-SPE338>3.0.CO;2-N + + + Capturing Data Provenance from Statistical +Software + Alter + International Journal of Digital +Curation + 1, 1 + 16 + 10.2218/ijdc.v16i1.763 + 1746-8256 + 2021 + Alter, G. C., Gager, J., Heus, P., +Hunter, C., Ionescu, S., Iverson, J., Jagadish, H. V., Lyle, J., +Mueller, A., Nordgaard, S., Risnes, O., Smith, D., & Song, J. +(2021). Capturing Data Provenance from Statistical Software. +International Journal of Digital Curation, 16(1, 1), 14–14. +https://doi.org/10.2218/ijdc.v16i1.763 + + + Dolt is Git for Data! + 2022 + Dolt is Git for Data! (2022). +[Computer software]. DoltHub. https://github.com/dolthub/dolt (Original +work published 2019) + + + Severity of Omicron (B.1.1.529) and Delta +(B.1.1.617.2) SARS-CoV-2 infection among hospitalised adults: A +prospective cohort study + Hyams + 10.1101/2022.06.29.22277044 + 2022 + Hyams, C., Challen, R., Marlow, R., +Nguyen, J., Begier, E., Southern, J., King, J., Morley, A., Kinney, J., +Clout, M., Oliver, J., Ellsbury, G., Maskell, N., Jodar, L., Gessner, +B., McLaughlin, J., Danon, L., Finn, A., & Group, T. A. C. R. +(2022). Severity of Omicron (B.1.1.529) and Delta (B.1.1.617.2) +SARS-CoV-2 infection among hospitalised adults: A prospective cohort +study (p. 2022.06.29.22277044). medRxiv. +https://doi.org/10.1101/2022.06.29.22277044 + + + The targets R package: A dynamic Make-like +function-oriented pipeline toolkit for reproducibility and +high-performance computing + Landau + Journal of Open Source +Software + 57 + 6 + 10.21105/joss.02959 + 2475-9066 + 2021 + Landau, W. M. (2021). The targets R +package: A dynamic Make-like function-oriented pipeline toolkit for +reproducibility and high-performance computing. Journal of Open Source +Software, 6(57), 2959. +https://doi.org/10.21105/joss.02959 + + + A Survey on Collecting, Managing, and +Analyzing Provenance from Scripts + Pimentel + ACM Computing Surveys + 3 + 52 + 10.1145/3311955 + 0360-0300 + 2019 + Pimentel, J. F., Freire, J., Murta, +L., & Braganholo, V. (2019). A Survey on Collecting, Managing, and +Analyzing Provenance from Scripts. ACM Computing Surveys, 52(3), +47:1–47:38. https://doi.org/10.1145/3311955 + + + Doltr: A client for the dolt +database + Ross + 2022 + Ross, N. (2022). Doltr: A client for +the dolt database [Manual]. + + + A package for survival analysis in +r + Therneau + 2022 + Therneau, T. M. (2022). A package for +survival analysis in r. +https://CRAN.R-project.org/package=survival + + + Modeling survival data: Extending the Cox +model + Terry M. Therneau + 0-387-98784-3 + 2000 + Terry M. Therneau, & Patricia M. +Grambsch. (2000). Modeling survival data: Extending the Cox model. +Springer. ISBN: 0-387-98784-3 + + + Using Introspection to Collect Provenance in +R + Lerner + Informatics + 1 + 5 + 10.3390/informatics5010012 + 2227-9709 + 2018 + Lerner, B., Boose, E., & Perez, +L. (2018). Using Introspection to Collect Provenance in R. Informatics, +5(1), 12. +https://doi.org/10.3390/informatics5010012 + + + + + + diff --git a/joss.04707/10.21105.joss.04707.jats b/joss.04707/10.21105.joss.04707.jats new file mode 100644 index 0000000000..a1876f1a6c --- /dev/null +++ b/joss.04707/10.21105.joss.04707.jats @@ -0,0 +1,559 @@ + + +
+ + + + +Journal of Open Source Software +JOSS + +2475-9066 + +Open Journals + + + +4707 +10.21105/joss.04707 + +dtrackr: An R package for tracking the provenance of +data + + + +0000-0002-5504-7768 + +Challen +Robert + + + + + + +Engineering Mathematics, University of Bristol, Bristol, +UK + + + + +College of Engineering, Mathematics and Physical Sciences, +University of Exeter, Devon, UK + + + + +4 +10 +2022 + +7 +80 +4707 + +Authors of papers retain copyright and release the +work under a Creative Commons Attribution 4.0 International License (CC +BY 4.0) +2022 +The article authors + +Authors of papers retain copyright and release the work under +a Creative Commons Attribution 4.0 International License (CC BY +4.0) + + + +R +data pipeline +consort diagram +strobe statement +data quality +reproducible research + + + + + + Summary +

An accurate statement of the provenance of data is essential in + bio-medical research. Powerful data manipulation tools available in + the tidyverse R package ecosystem + (Wickham + et al., 2019) provide the infrastructure to assemble, clean and + filter data prior to statistical analysis. Manual documentation of the + steps taken in the data pipeline and the provenance of data is a + cumbersome and error prone task which may restrict reproducibility. + dtrackr is a wrapper around a subset of the + standard tidyverse data manipulation tools that + allows automatic tracking of the processing steps applied to a data + set, prior to statistical analysis. It allows early detection and + reporting of data quality problems, and automatically documents a + pipeline of data transformations as a flowchart in a format suitable + for scientific publication, including, but not limited to CONSORT + diagrams + (Schulz + et al., 2010).

+

dtrackr is first and foremost a utility to + accelerate and improve research by facilitating documentation, + supporting extraction of knowledge from data sets, and the execution + of research by helping identify data quality issues. The general + capability however fits into a broader context of other provenance or + data pipeline research. This includes initiatives such as + C2Metadata + (Alter + et al., 2021), which focus on a language independent + representation of a data pipeline, and R packages such as + targets + (Landau, + 2021) which focus on documenting pipeline code, and managing + the execution of a pipeline, or RDataTracker + which focusses on tracking the execution of a arbitrary R script + (Lerner + et al., 2018). dtrackr takes a more data + oriented approach, which could be complementary, in which we remain + agnostic to the detail of a data pipeline script or nature of its + execution, but capture a subset of the transformations applied to data + alongside the data itself, thereby documenting the data state as it is + being manipulated. This is achieved by overriding the execution of + dplyr pipeline functions and results in a + retrospective record of provenance + (Pimentel + et al., 2019). dtrackr also has the + ability to insert secondary analysis as annotations into the pipeline, + and allows control over what information is collected, ultimately with + a view to producing simple human readable output. The approach of + dtrackr is analogous to a + git commit history for dataframes, and there is + potential synergy with emerging versioned databases such as + dolt + (Dolt + Is Git for Data!, 2019/2022; + Ross, + 2022).

+
+ + Statement of need +

The collection of experimental or observational data for research + is often an iterative endeavour, involving curation of complex data + sets designed for multiple goals. Systematic data quality checking for + such sets is a major challenge, particularly when they are assembled + to identify emerging or rapidly evolving issues. Feedback from early + data analysis can identify specific data quality issues, resolution of + which can considerably improve data for the task at hand. However this + requires a clear understanding of why and when individual data items + are excluded, which is potentially tedious and may be seen as lower + priority compared to statistical analysis.

+

Data analysis using tidyverse in R is a + rapid means of transforming raw data into a format suitable for + statistical analysis. The transformations involved can, however affect + the results of statistical analysis, and meticulous care must be taken + to ensure that any assumptions made during data processing are well + documented. It is often too easy to inadvertently exclude data where + filtering on missing items, or joining linked data sets with + incomplete foreign key relationships.

+

In complex data analysis, the use of interactive programming + environments such as Read-Eval-Print Loops (REPL) in R markdown + documents, interim caching of results, or conditional branching data + pipelines, can result in the current state of a processed data set + becoming decoupled from the code that is designed to generate + them.

+

To surface these issues bio-medical journal articles are usually + required to report data manipulation to an agreed standard. For + example, CONSORT diagrams are part of the requirements in reporting + parallel group clinical trials. They are described in the updated 2010 + CONSORT statement + (Schulz + et al., 2010), and clarify how patients were recruited, + selected, randomized and followed up. For observational studies, such + as case control designs, an equivalent requirement is the STROBE + statement + (von + Elm et al., 2008). There are many other similar requirements + for other types of study, such as the TRIPOD statement for + multivariate models + (Collins + et al., 2015). Maintaining such CONSORT diagram over the course + of a study when data sets are being actively collected and data + quality issues being addressed is time-consuming.

+

dtrackr addresses these issues by + instrumenting a commonly used subset of standard + tidyverse data manipulation pipeline functions + from dplyr and tidyr. It + can automatically record the steps taken, records excluded and a + summary of the result of each data processing step, as part of the + data set itself in a “history graph”. In this way data sets retain an + accurate history of their own provenance regardless of the actual + route taken to assemble them. This history includes a complete record + of any data quality issues that lead to excluded records. The history + is a directed graph which can be expressed in the commonly used + GraphViz language + (Gansner + & North, 2000) and may be visualised as a flowchart such as + in Figure 1; this + uses the Chronic Granulomatous Disease dataset from the + survival package + (Terry + M. Therneau & Patricia M. Grambsch, 2000; + Therneau, + 2022) as an example of a parallel group study and produces a + STROBE like flowchart.

+ +

An example flowchart derived directly from a simple + analysis of the Chronic Granulomatous Disease dataset demonstrating + use of dtrackr to generate the key parts of a + STROBE or CONSORT diagram. +

+ +
+

dtrackr was originally conceptualized during + an analysis I undertook of the severity of the Alpha variant of + SARS-CoV-2 + (Challen + et al., 2021), and has since been used for other + epidemiological studies including an analysis of the incidence of + hospitalization of acute lower respiratory tract disease in Bristol + (Hyams, + Challen, Begier, et al., 2022), and a comparative analysis of + the severity of the SARS-CoV-2 Omicron variant, versus the Delta + variant against a range of hospital outcomes + (Hyams, + Challen, Marlow, et al., 2022).

+

Although the specific example presented here is in the bio-medical + domain, tracking the provenance of data is a much broader issue, and + we anticipate there are many other applications for + dtrackr.

+
+ + Acknowledgements +

Thanks for contributions from TJ McKinley. I gratefully acknowledge + the financial support of the EPSRC via grants EP/N014391/1, + EP/T017856/1, the MRC (MC/PC/19067), and from the Somerset NHS + Foundation Trust, Global Digital Exemplar programme.

+
+ + + + + + + SchulzKenneth F. + AltmanDouglas G. + MoherDavid + + CONSORT 2010 Statement: Updated guidelines for reporting parallel group randomised trials + BMJ + British Medical Journal Publishing Group + 20100324 + 340 + 0959-8138 + 10.1136/bmj.c332 + c332 + + + + + + + von ElmErik + AltmanDouglas G + EggerMatthias + PocockStuart J + GøtzschePeter C + VandenbrouckeJan P + STROBE Initiative + + The Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) statement: Guidelines for reporting observational studies + Journal of Clinical Epidemiology + 200804 + 61 + 4 + 344 + 349 + + + + + + CollinsGary S. + ReitsmaJohannes B. + AltmanDouglas G. + MoonsKarel GM + + Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD): The TRIPOD Statement + BMC Medicine + 20150106 + 13 + 1 + 1741-7015 + 10.1186/s12916-014-0241-z + 1 + + + + + + + WickhamHadley + AverickMara + BryanJennifer + ChangWinston + McGowanLucy D’Agostino + FrançoisRomain + GrolemundGarrett + HayesAlex + HenryLionel + HesterJim + KuhnMax + PedersenThomas Lin + MillerEvan + BacheStephan Milton + MüllerKirill + OomsJeroen + RobinsonDavid + SeidelDana Paige + SpinuVitalie + TakahashiKohske + VaughanDavis + WilkeClaus + WooKara + YutaniHiroaki + + Welcome to the Tidyverse + Journal of Open Source Software + 20191121 + 4 + 43 + 2475-9066 + 10.21105/joss.01686 + 1686 + + + + + + + ChallenRobert + Brooks-PollockEllen + ReadJonathan M. + DysonLouise + Tsaneva-AtanasovaKrasimira + DanonLeon + + Risk of mortality in patients infected with SARS-CoV-2 variant of concern 202012/1: Matched cohort study + BMJ + British Medical Journal Publishing Group + 20210310 + 372 + 1756-1833 + 10.1136/bmj.n579 + n579 + + + + + + + HyamsCatherine + ChallenRobert + BegierElizabeth + SouthernJo + KingJade + MorleyAnna + Szasz-BenczurZsuzsa + Garcia GonzalezMaria + KinneyJane + CamplingJames + GraySharon + OliverJennifer + HublerRobin + ValluriSrinivas R. + VyseAndrew + McLaughlinJohn M. + EllsburyGillian + MaskellNick + GessnerBradford + DanonLeon + FinnAdam + + Incidence of Community Acquired Lower Respiratory Tract Disease in Bristol, UK During the COVID-19 Pandemic + Rochester, NY + 20220419 + 10.2139/ssrn.4087373 + + + + + + GansnerEmden R. + NorthStephen C. + + An open graph visualization system and its applications to software engineering + Software - Practice and Experience + 2000 + 30 + 11 + 10.1002/1097-024X(200009)30:11<1203::AID-SPE338>3.0.CO;2-N + 1203 + 1233 + + + + + + AlterGeorge Charles + GagerJack + HeusPascal + HunterCarson + IonescuSanda + IversonJeremy + JagadishH. V. + LyleJared + MuellerAlexander + NordgaardSigve + RisnesOrnulf + SmithDan + SongJie + + Capturing Data Provenance from Statistical Software + International Journal of Digital Curation + 2021 + 16 + 1, 1 + 1746-8256 + 10.2218/ijdc.v16i1.763 + 14 + 14 + + + + + Dolt is Git for Data! + DoltHub + 20221004 + 20221004 + https://github.com/dolthub/dolt + + + + + + HyamsCatherine + ChallenRobert + MarlowRobin + NguyenJennifer + BegierElizabeth + SouthernJo + KingJade + MorleyAnna + KinneyJane + CloutMadeleine + OliverJennifer + EllsburyGillian + MaskellNick + JodarLuis + GessnerBradford + McLaughlinJohn + DanonLeon + FinnAdam + GroupThe Avon CAP Research + + Severity of Omicron (B.1.1.529) and Delta (B.1.1.617.2) SARS-CoV-2 infection among hospitalised adults: A prospective cohort study + medRxiv + 20220630 + 10.1101/2022.06.29.22277044 + 2022.06.29.22277044 + + + + + + + LandauWilliam Michael + + The targets R package: A dynamic Make-like function-oriented pipeline toolkit for reproducibility and high-performance computing + Journal of Open Source Software + 20210115 + 6 + 57 + 2475-9066 + 10.21105/joss.02959 + 2959 + + + + + + + PimentelJoão Felipe + FreireJuliana + MurtaLeonardo + BraganholoVanessa + + A Survey on Collecting, Managing, and Analyzing Provenance from Scripts + ACM Computing Surveys + 20190618 + 52 + 3 + 0360-0300 + 10.1145/3311955 + 47:1 + 47:38 + + + + + + RossNoam + + Doltr: A client for the dolt database + 2022 + + + + + + TherneauTerry M + + A package for survival analysis in r + 2022 + https://CRAN.R-project.org/package=survival + + + + + + Terry M. Therneau + Patricia M. Grambsch + + Modeling survival data: Extending the Cox model + Springer + New York + 2000 + 0-387-98784-3 + + + + + + LernerBarbara + BooseEmery + PerezLuis + + Using Introspection to Collect Provenance in R + Informatics + 201803 + 20221103 + 5 + 1 + 2227-9709 + https://www.mdpi.com/2227-9709/5/1/12 + 10.3390/informatics5010012 + 12 + + + + + +
diff --git a/joss.04707/10.21105.joss.04707.pdf b/joss.04707/10.21105.joss.04707.pdf new file mode 100644 index 0000000000..79a6ccbad7 Binary files /dev/null and b/joss.04707/10.21105.joss.04707.pdf differ diff --git a/joss.04707/media/figure1-consort.pdf b/joss.04707/media/figure1-consort.pdf new file mode 100644 index 0000000000..28ac7c6f57 Binary files /dev/null and b/joss.04707/media/figure1-consort.pdf differ