JOSS review - Software paper #15

craig-willis · 2022-09-22T16:47:22Z

Review issue: openjournals/joss-reviews#4707

Below is my feedback on the "Software paper" section of the JOSS review checklist:

Summary
I question whether dtrackr can be described as a wrapper around the tidyverse collection of packages and think it may better be described as wrapper around dplyr. There may be other opportunities to support data provenance in tidyverse that cannot be done by instrumenting dplyr functions (e.g., read files).

State of the field
dtrackr is positioned as a provenance tool yet the paper includes no references to existing work in the area of computational provenance, transparency, and reproducibility. While I believe dtrackr functionality to be unique, the tool is not positioned relative to other packages in the R community or beyond. I'd suggest looking at the following articles and considering how dtrackr relates to other efforts.

Pimentel et al. 2019. A Survey on Collecting, Managing, and Analyzing Provenance from Scripts. ACM Comput. Surv. 52, 3, Article 47 (May 2020), 38 pages. https://doi-org/10.1145/3311955
Alter et a. 2021. Capturing Data Provenance from Statistical Software. International Journal of Digital Curation. https://doi.org/10.2218/ijdc.v16i1.763.

The absence of comparisons to other provenance tools to some degree raises a question about the scholarly aspects of this work (i.e., is this just a utility or intended to contribute to broader work in computational provenance, transparency, reproducibility). I can imagine how a widely-used provenance-aware dyplr might fit into this bigger picture.

The text was updated successfully, but these errors were encountered:

robchallen · 2022-10-04T23:36:28Z

Updated branch: joss-fixes-0.2.4.9000 (and main)

I've changed the wording to be more explicit that dtrackr is only wrapping a subset of tidyverse functions (in dplyr and tidyr). N.b. see also comments in issue #14 about the fact that it should be possible to interchange between dplyr, dtrackr and tidyr and other tidyverse functions within a pipeline.

To address your query about state of the field, I've added a new paragraph into the summary section to describe how I see dtrackr fits into the existing landscape, I do see dtrackr primarily as a useful tool for researchers, with a somewhat novel take on provenance, rather than a completely new concept, but as a useful tool for research I think it does tick the box. I think future iterations could contribute more to the broader field, particularly if it can be combined with a versioned database as I describe. It was only possible to include very high level summary here as I obviously can't cover off the whole field in the relatively short format of the JOSS article.

craig-willis · 2022-10-10T14:30:29Z

Branch: joss-fixes-0.2.4.9000

The emphasis on dtrackr as a utility helps me better understand it's positioning. This and simply signaling awareness of other provenance-related work addresses my main concern.

A few comments:

I noticed a possible typo in the first sentence ("dtrackr if first and foremost" -> should be "dtrackr is..." ?)
Pimentel et al. is not related to C2Metadata. As a broad survey, I guess it's an example of "other provenance research". My main point in sharing the reference was to highlight that there's related work on computational provenance tools -- and even some existing tools that work with R (e.g., RDataTracker, YesWorkflow, recordr) -- that could be used to help frame dtrackr. In section 2.1, the authors discuss a few classifications: prospective (program/experiment structure) v. retrospective (what actually happened) provenance. In 2.1.4 they discuss execution provenance approaches including passive monitoring, overriding, and instrumentation. In this light, I'd classify dtrackr is a retrospective provenance tool that relies on overriding -- although this probably isn't important for the JOSS paper. I do think the Pimentel reference is useful if only to signal awareness of related work in computational provenance research, but not in the context of C2Metadata.
C2Metadata is developing a language-independent representation of data transformations via SDTL, but relies on static parsing of scripts (prospective provenance), including R. It doesn't document a "pipeline" in the same sense as dtrackr. In hindsight, since it's a small/nascent project, it may not be something to highlight in the JOSS paper unless it's of interest to you.

A little wordy, but let me know if you have any questions.

robchallen · 2022-10-10T14:59:40Z

Thanks again, I'm waiting for the second reviewer's comments. Once I get them I'll take another stab at this, in light of your additional suggestions. I'm trying not to go to far down the rabbit hole as I suspect it's pretty deep.

…

On Mon, 10 Oct 2022, 15:30 Craig Willis, ***@***.***> wrote: The emphasis on dtrackr as a utility helps me better understand it's positioning. This and simply signaling awareness of other provenance-related work addresses my main concern. A few comments: - I noticed a possible typo in the first sentence ("dtrackr if first and foremost" -> should be "dtrackr is..." ?) - Pimentel et al. is not related to C2Metadata. As a broad survey, I guess it's an example of "other provenance research". My main point in sharing the reference was to highlight that there's related work on computational provenance tools -- and even some existing tools that work with R (e.g., RDataTracker, YesWorkflow, recordr) -- that could be used to help frame dtrackr. In section 2.1, the authors discuss a few classifications: prospective (program/experiment structure) v. retrospective (what actually happened) provenance. In 2.1.4 they discuss execution provenance approaches including passive monitoring, overriding, and instrumentation. In this light, I'd classify dtrackr is a retrospective provenance tool that relies on overriding -- although this probably isn't important for the JOSS paper. I do think the Pimentel reference is useful if only to signal awareness of related work in computational provenance research, but not in the context of C2Metadata. - C2Metadata is developing a language-independent representation of data transformations via SDTL, but relies on static parsing of scripts (prospective provenance), including R. It doesn't document a "pipeline" in the same sense as dtrackr. In hindsight, since it's a small/nascent project, and may not something to highlight in the JOSS paper unless it's of interest to you. A little wordy, but let me know if you have any questions. — Reply to this email directly, view it on GitHub <#15 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AD6SWICOHAKTCAM3QSMRN2DWCQSA7ANCNFSM6AAAAAAQTHT3D4> . You are receiving this because you commented.Message ID: ***@***.***>

craig-willis · 2022-10-10T15:28:11Z

Makes perfect sense. I don't think you need to go down the hole, but maybe just note that it's there.

robchallen · 2022-11-04T22:40:51Z

I have updated the paper (new version of paper at https://github.com/terminological/dtrackr/actions/runs/3397242636). I got the essence of your comments in there I think without going too far into the details. In this iteration dtrackr is primarily a pragmatic tool, and mainly focussed on the goal of producing a flowchart, but there are lots of ways it could evolve in the future.

This discussion, the pimental paper and an emerging need to be able to explain how a particular column in a dataset I'm working on has been derived has prompted me to do some prototyping of a column level tracking feature which I'll look to bring into a new release.

robchallen · 2022-11-04T23:47:05Z

N.b. please close this issue if you are happy with the updated paper.

craig-willis · 2022-11-05T19:21:51Z

The changes look good to me.

craig-willis closed this as completed Nov 5, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JOSS review - Software paper #15

JOSS review - Software paper #15

craig-willis commented Sep 22, 2022 •

edited

Loading

robchallen commented Oct 4, 2022

craig-willis commented Oct 10, 2022 •

edited

Loading

robchallen commented Oct 10, 2022 via email

craig-willis commented Oct 10, 2022

robchallen commented Nov 4, 2022

robchallen commented Nov 4, 2022

craig-willis commented Nov 5, 2022

JOSS review - Software paper #15

JOSS review - Software paper #15

Comments

craig-willis commented Sep 22, 2022 • edited Loading

robchallen commented Oct 4, 2022

craig-willis commented Oct 10, 2022 • edited Loading

robchallen commented Oct 10, 2022 via email

craig-willis commented Oct 10, 2022

robchallen commented Nov 4, 2022

robchallen commented Nov 4, 2022

craig-willis commented Nov 5, 2022

craig-willis commented Sep 22, 2022 •

edited

Loading

craig-willis commented Oct 10, 2022 •

edited

Loading