Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JOSS review - Software paper #15

Closed
craig-willis opened this issue Sep 22, 2022 · 7 comments
Closed

JOSS review - Software paper #15

craig-willis opened this issue Sep 22, 2022 · 7 comments

Comments

@craig-willis
Copy link

craig-willis commented Sep 22, 2022

Review issue: openjournals/joss-reviews#4707

Below is my feedback on the "Software paper" section of the JOSS review checklist:

Summary
I question whether dtrackr can be described as a wrapper around the tidyverse collection of packages and think it may better be described as wrapper around dplyr. There may be other opportunities to support data provenance in tidyverse that cannot be done by instrumenting dplyr functions (e.g., read files).

State of the field
dtrackr is positioned as a provenance tool yet the paper includes no references to existing work in the area of computational provenance, transparency, and reproducibility. While I believe dtrackr functionality to be unique, the tool is not positioned relative to other packages in the R community or beyond. I'd suggest looking at the following articles and considering how dtrackr relates to other efforts.

The absence of comparisons to other provenance tools to some degree raises a question about the scholarly aspects of this work (i.e., is this just a utility or intended to contribute to broader work in computational provenance, transparency, reproducibility). I can imagine how a widely-used provenance-aware dyplr might fit into this bigger picture.

@robchallen
Copy link
Contributor

Updated branch: joss-fixes-0.2.4.9000 (and main)

I've changed the wording to be more explicit that dtrackr is only wrapping a subset of tidyverse functions (in dplyr and tidyr). N.b. see also comments in issue #14 about the fact that it should be possible to interchange between dplyr, dtrackr and tidyr and other tidyverse functions within a pipeline.

To address your query about state of the field, I've added a new paragraph into the summary section to describe how I see dtrackr fits into the existing landscape, I do see dtrackr primarily as a useful tool for researchers, with a somewhat novel take on provenance, rather than a completely new concept, but as a useful tool for research I think it does tick the box. I think future iterations could contribute more to the broader field, particularly if it can be combined with a versioned database as I describe. It was only possible to include very high level summary here as I obviously can't cover off the whole field in the relatively short format of the JOSS article.

@craig-willis
Copy link
Author

craig-willis commented Oct 10, 2022

Branch: joss-fixes-0.2.4.9000

The emphasis on dtrackr as a utility helps me better understand it's positioning. This and simply signaling awareness of other provenance-related work addresses my main concern.

A few comments:

  • I noticed a possible typo in the first sentence ("dtrackr if first and foremost" -> should be "dtrackr is..." ?)
  • Pimentel et al. is not related to C2Metadata. As a broad survey, I guess it's an example of "other provenance research". My main point in sharing the reference was to highlight that there's related work on computational provenance tools -- and even some existing tools that work with R (e.g., RDataTracker, YesWorkflow, recordr) -- that could be used to help frame dtrackr. In section 2.1, the authors discuss a few classifications: prospective (program/experiment structure) v. retrospective (what actually happened) provenance. In 2.1.4 they discuss execution provenance approaches including passive monitoring, overriding, and instrumentation. In this light, I'd classify dtrackr is a retrospective provenance tool that relies on overriding -- although this probably isn't important for the JOSS paper. I do think the Pimentel reference is useful if only to signal awareness of related work in computational provenance research, but not in the context of C2Metadata.
  • C2Metadata is developing a language-independent representation of data transformations via SDTL, but relies on static parsing of scripts (prospective provenance), including R. It doesn't document a "pipeline" in the same sense as dtrackr. In hindsight, since it's a small/nascent project, it may not be something to highlight in the JOSS paper unless it's of interest to you.

A little wordy, but let me know if you have any questions.

@robchallen
Copy link
Contributor

robchallen commented Oct 10, 2022 via email

@craig-willis
Copy link
Author

Makes perfect sense. I don't think you need to go down the hole, but maybe just note that it's there.

@robchallen
Copy link
Contributor

I have updated the paper (new version of paper at https://github.com/terminological/dtrackr/actions/runs/3397242636). I got the essence of your comments in there I think without going too far into the details. In this iteration dtrackr is primarily a pragmatic tool, and mainly focussed on the goal of producing a flowchart, but there are lots of ways it could evolve in the future.

This discussion, the pimental paper and an emerging need to be able to explain how a particular column in a dataset I'm working on has been derived has prompted me to do some prototyping of a column level tracking feature which I'll look to bring into a new release.

@robchallen
Copy link
Contributor

N.b. please close this issue if you are happy with the updated paper.

@craig-willis
Copy link
Author

The changes look good to me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants