-
-
Notifications
You must be signed in to change notification settings - Fork 104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
git2rdata, a companion package for git2r #263
Comments
thanks for your submission @ThierryO - editors are discussing now |
Editor checks:
Editor commentsThanks for your submission @ThierryO! This is an interesting package and it passes checks with nothing but this from
I'm seeking reviewers for it now. FYI, because of the thorny nature of the problem your tackling - data versioning workflows - I anticipate a lot of discussion around high-level concepts and architecture. I'm going to look for reviewers with a lot of experience on these workflow issues, and encourage them to start that discussion before checking off all the low-level boxes so as to have an efficient and productive review. |
Reviewers assigned: @jennybc and @brodieG. At reviewers' request, final review date is extended to deal with travel and holidays. However, I request that reviewers provide some comment on high-level workflow/architecture issues that I think are likely to be the main points of sooner, before before a more detailed look at the code. Roughly:
|
Thanks for assigning the reviewers. I'm looking forward for their comments. There were some minor changes to the package. The latest version is available at ropensci/git2rdata#5 |
Hi Thierry. This looks like an interesting package to review; looking forward to it. One small request: do you mind creating tags/releases that we can work off of so there is no issue of the code changing under our feet as we are reviewing it? Ideally this would also be accompanied by a unique package version in DESCRIPTION. This will ensure we're all exactly on the same page about what version of the code is being reviewed. Thank you. |
Good point. There is currently a single release tag v0.0.1.9000 which is the current HEAD of the master branch. I suggest that you use this for the review. I won't push to the master branch during the review process. The current changes in the develop branch are mainly cosmetic, except one hotfix. |
Sounds good, I will do all of my work of of v0.0.1.9000. |
OverviewI haven't reviewed the internals of the package very closely yet, but overall I At the request of the editor I'm going to focus on the top level questions first Editor QuestionsQ1
VignetteThe vignette provides end to end "toy" examples for both the versioned and There are a few areas of a versioned workflow that would benefit from additional
DocumentationThe Q2
I do not have extensive experience with "reproducible workflows" beyond package It strikes me that the package is more intertwined with a git workflow than it
instead of:
To me the semantics are much clearer in the first example, or at worst, just as
It might even be reasonable to completely remove the One possible example is that we have to resolve the question of how to link the In terms of integrating with other vcs, the expectation seems to be implicitly Aside: is there a particular value of the file hashes that I'm missing? I would Q3
Yes, although there are some exceptions and possible improvements: rm_dataThe semantics of We would recommend that you split up the behavior across functions, possibly
You could document the difference in behavior in the function itself as a Another issue to consider is whether you should allow the removal of unversioned Versioned vs. Unversioned ModesAs mentioned in Q2, I think it is detrimental to try to intertwine these more Possible ImprovementsI would encourage you to create an intermediary
Which then simplify the workflow in the vignette:
Is now equivalent to:
Additional CommentsMore Workflow CommentsI think it would be helpful to have more aggressive defaults to facilitate the
Then things like the following just work:
This would produce 'iris.tsv' and 'iris.yml' at Obviously you will need something more sophisticated than what I use above (e.g. Implications of Versioning Data in Git@jennybc probably has a good view on this already, so I'll defer to her input here, Git is not really meant to store data frames, so I think it would be really You mention "medium sized data". I suspect different people will have different It may be that there is actually no problem with this at all, but it is unusual RationaleI think the vignette would benefit from a more thorough analysis of the
Based on my experience this is not a great reason to use version control. The I feel that the version control is most useful for data that is not supposed to On the other hand, for things that are not supposed to change, at least without I see two obvious applications:
Clearly you have these things in mind, but it might be worth being more explicit Research Of Comparable SolutionsI found: More generally, it might be worth doing some level of literature review to see Thomas's paper covers some of this, and also referencing Another idea raised in the same paper that I found interesting is to store data Also interesting: I have not used any of these solutions, so I cannot vouch for their efficacy or Storage format
DependenciesDo you really need |
Thank you @brodieG for a pre-review as detailed as many of our full reviews! All of this is in-scope for the review. I think the discussion of other frameworks is quite helpful. One does not have to adopt a totally different model as a result, but I think using that context to guide the API and documentation . I find it increasingly common and helpful to put a "Related Work" section in README/vignettes to compare a package's approach to other options. @ThierryO, I'd suggest waiting for @jennybc's input before making decisions and starting work on any major changes, but do respond to @brodieG's comments as you see fit. |
Thanks you @brodieG for the elaborate review. Here are my idea's on the topics you raised. Editors questionsQ1VignetteAdding an example where the data is actually used is a good suggestion. I have presented our workflow as a poster at the International Statistical Ecology Conference ISEC 2018. Please note that this poster predates the conception of Tracking analysis code is outside the scope of If we assume that the code to create and store the data.frames is all stored in a package DocumentationIf required, I can add some toy examples to the Q2The initial idea of the package was to read and write data.frames into a git repository. Hence the strong link with Git and Currently git2r::add(path = names(write_vc(df, 'df')))
svnr::commit(names(write_cv(df, 'df'))) The current version has only methods depending on the Q3rm_data
Possible improvementI'm not sure if there is much benefit of creating a new class. We often just want to write the object to a file / vcs and don't bother about any intermediate formats. The Additional commentsMore Workflow CommentsThere already is a PR which implements
library(dplyr)
test <- function(x, file = deparse(substitute(x))) {
message(file)
}
test(iris)
iris %>%
test() Implications of Versioning Data in GitA separate vignette with guidelines and caveat on using RationaleIn my line of work, I don't have any control over the raw data source. And hence cannot guarantee that I can reconstruct the raw data. I've been repeating the same analyses every year on several data sources. At least two of them had a major overhaul of the database structure during the past five years. And there is no version control within the database itself. In the past we published results which were questioned by one of our stakeholders. They disagreed with the results and feared that the results would lead to a detrimental change in policy. The simple fact that we were able state that both the data and the code was placed under version control instantly increased our trustworthiness. After a walk through of the data and the code, they now agree on the methodology and hence the results. So the focus is not on "the analysis of what changed in the data", but rather on "what data was used in this version of the analysis". Therefore Research Of Comparable SolutionsI'll have a look at these suggestions over the next weeks. Storage formatStoring each object in a dedicated folder is an interesting suggestion. Another benefit of this, is that we could allow the user to specify "file" extensions. E.g. DependenciesAn earlier version of Personally, I like While thinking on adding However, I see several gains in using
IMHO, these gains outweigh the extra dependencies. |
@jennybc A reminder that your review was due last week, please get it in as soon as you can. |
While waiting for the second review (which I think might take until after Examples
There's a lot of flexibility in this, but generally, these should be enough to demonstrate minimal usage/syntax. I usually make sure to have one example with only required arguments and examples that together include all optional arguments, and that print/show examples of the output value. DependenciesWe outline some low-dependency options for tibbles in the development guide, but if the author can justify the trade-offs for dependencies we defer to their choice. |
@noamross: Our main arguments in favour of
Removing the We'd like to get feedback from @noamross, @jennybc and @brodieG on this topic. |
@ThierryO, I haven't responded as I wanted to give @jennybc a chance to comment, but since you're asking explicitly, some responses to the narrow questions regarding readr. Here is some simple code that deals with I think most of the pathological inputs you're describing:
You will need to to generate a string that doesn't exist in the data to use as an NA string. A simple low-cost example is to find the length of the longest string with This also quotes an entire column if any element contains either a tab or a quote. However, both these cases (and cases were there are columns that have strings that are INT_MAX long, which probably would be a terrible use case for this anyway) are likely to be rare. Undoubtedly If you really still feel that there is substantial added value to the Re: returning tibbles, I don't see that as a positive. If someone likes tibbles and has the package installed, it is trivial for them to turn data frames into tibbles. Likewise if someone prefers data.tables. I think it makes more sense for package to be data structure neutral unless the data structure is integral to the package itself. Again, not something that would prevent me from approving the package, but chaining the package to a specific non-base data structure when it isn't necessary seems sub-optimal in the general use case. If you do decide to leave readr as a dependency, then one thing you can do is record whether the input was a tibble or not, and then if not, return a normal data frame to the user. EDIT: Forgot to mention this, |
New reviewer added: @jtr13, review due 2018-02-05 |
Package ReviewPlease check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide
DocumentationThe package includes all the following forms of documentation:
For packages co-submitting to JOSS
The package contains a
Functionality
Final approval (post-review)
Review CommentsNoteIn the interest of not delaying the review, I was not able to fully test all functionality claims. Documentation commentsStatement of needThe first two paragraphs of the vignette provide a much more detailed explanation of the problem that the software is designed to solve than the README. I know there's been a lot of discussion about what should go where and the importance of not having to maintain duplicate text. That said, I am of the opinion that important information like this should be in the README since my sense is that vignettes are often overlooked. (More so now since ExamplesI see that examples were added in the Community GuidelinesThere is no CONTRIBUTING file nor contributing guidelines in the README. Functionality commentsI was not able to perform a thorough review to confirm functional and performance claims. All code in the vignette worked as expected as did additional toy examples. General commentsThe package claims to address three needs:
Taking these in reverse order: storing meta data in a The benefits of optimization are not as clearly apparent. It would be helpful to be provided with more context, that is, answers to questions such as: in typical use cases, how much smaller are the text files created by Finally, with regard to version control, the contribution of It should be noted though that Given this scenario, the package name, Some minor points: it's not clear what the benefit is of using If the sorting parameter isn't essential it would be helping to have a way to not use it ( |
Thank you for your feedback, @jtr13. Documentation commentsI'll update the README (see ropensci/git2rdata#12). A CONTRIBUTING.md file is present in the I'll add a some comments to the examples in the help files. General commentsI'll probably separate (and elaborate) the current vignette into several vignettes.
Making the sorting optional is OK when not storing the files under version control. However, then it is up to the user to use the sorting when writing to a file path and doing the version control outside The Import, suggest or remove
|
Thanks @brodieG and @jtr13 for your reviews, and @ThierryO for your thoughtful comments. I think this discussion has helped clarify the workflows that the package does and does not try to address, and the extended documentation you discuss would go a long way in helping users understand this. A couple of observations: dependency weight and the return of tibbles seems to be one of the most contentious issues among our author and reviewer communities. We mostly defer to authors on this. In general, we find dependency weight to be a bigger issue for developers than users. We also find it helpful to give users choices as well as sensible dependencies. A package-level option such as The tightness of the recommended workflows with version control, and utility of Finally, while we won't require a package name change, I agree with both @ThierryO that having |
@ThierryO I see that you have made several changes. Do you have a new version we should work off of now (if you do, a tag of that version would be great)? Do you have other changes in the works we should wait for you to complete before we move to the detailed review? @noamross what's the time frame for the detailed review? One thing I'm wondering is whether it makes sense to wait for the vignettes outlined in @ThierryO most recent response. |
I'm working on the package in my spare time. I'll notify when a new version is available for review. |
@noamross : what is the policy on suggested packages? I'm thinking about using ggplot2 for plots in the vignettes. |
Sorry for the late reply on this one, @ThierryO. Suggested packages are are fine, especially for the purpose of displaying outputs in vignettes. |
A new version (tagged v0.0.3) is ready for another round of review. |
Thanks @ThierryO! @brodieG and @jtr13, let us know if you are satisfied with these changes, including covering any areas you weren't able to in the last review. @ThierryO, can you give a brief summary of changes? I notice there isn't a NEWS entry for v0.0.3, or is it effectively the same summary as 0.0.2? |
I forgot to update the NEWS. The difference between 0.0.2.9000 and 0.0.3 are mainly cosmetic and fixed typos. I bumped the version number because I did't want to change 0.0.2.9000 into 0.0.2. |
@noamross what is the new deadline for reviews? |
We normally aim for 2 weeks for the second round, but as the first round was partial/specialized here, let's go for 3 weeks, or April 10. |
Package ReviewPlease check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide
DocumentationThe package includes all the following forms of documentation:
Note: one small change was needed to get package to build locally (see patch
Functionality
A few "Package Guidelines" comments:
Final approval (post-review)
Estimated hours spent reviewing: Review CommentsOverviewThis is a useful and well written package. Congratulations to Thierry for The code is clean and clear. For the most part, parameters are validated and I see no obvious performance issues. The documentation is extensive, particularly the vignettes. I'm really Issues That Should Be AddressedThese are important and my approval of the package is contingent on them being
Strongly recommended
Recommended:
Every other comment in the review is provided for the author to decide whether File StorageIn an earlier iteration I had suggested that files be stored in a subfolder for
You could have something like: iris.grx
Then This is a very cheap way to make your package more robust to file system TyposSee attached patch: git2rdata_typo_patch.txt VignettesGenerallyTitle case in headers, particularly the top level vignette title? Not something Getting StartedAnother frequently used csv read/write set of functions is In Re:
POSIXct is numeric, not integer, consider:
I'm not sure this had any practical significance, although you may need to check
Suggested WorkflowIntroduction
This paragraph might be confusing to some. You should either generalize Storing DataframesSee comments about Automated WorkflowI find this section more ambitious than it needs to be. It tries to Another drawback of the multi-instance example is that it interferes with the I think you could introduce Nit: Analysis WorkflowGreat example! One nit: maybe add newlines or otherwise try to better format Other ThoughtsExample Data StructureNot a big deal at all, but it is very odd to me to see a database with tables of Associating Code With DataYour footnote about "code changing" reminded me that it is useful to be able to Efficiency in Terms of Storage and TimeThis is a great vignette. Very informative. You should promote the data storage statistics to a more prominent part of the
The only additional bit of information that would be worthwhile to add is some
Optimizing storage for ...Good vignette. See typos in patch. Generic Documentation CommentsDescriptionsIn general these are too terse. Several do not describe what the function
You are better off making the titles briefer and the description more Nit PicksTitle Case. While this is not strictly requested in Writing R Extensions, it is Almost all sections of the documentation should be written in full sentences. See base functions such as Common Parameters@param x: "the 'data.frame" stray quote mark? OtherExamples should cleanup on exit, as per CRAN policies. Currently some create Good use of "See Also" to link related functions. For the re-exported functions, the display link would be best if it showed the FunctionsParameter ValidationIt is great to see that the functions validate their parameters carefully. One additional thing you could do, although this isn't common practice, is to list_dataIn:
There is the risk that
I have not tested this, but you get the idea. Watch out for corner cases. I metaDocsThe description should give some indication of what the function does, not just Character MethodGood job of handling corner cases (e.g. strings containing quotes). Factor Method
Can cause problems for users that might pass a "truthy" value expecting this to Logical MethodSee Factor above. POSIXct MethodUPDATE: I think this is actually okay, although it will change the display representation for users. Maybe record time zone and try to restore? I am not an expert on this, but assuming that the timezone is UTC causes
This risks being a headache resolving. Date MethodI don't think
data.frame MethodThis function should check its parameters, sorting in particular. That In:
You need You need to handle the case where
Further down, in:
Probably not of great consequence, but it feels "wrong" to have an undocumented Finally, probably okay in this context because you overwrite
But you should be aware if you were not that calling setter functions indirectly
read_vcDocsThe description should describe what the function does, e.g. "Retrieves a It is not clear from the documentation that this is an S3 generic with methods. File integrity
Some possible checks:
write_vcDocumentationThat
Performance
If you'd like you can take advantage of the radix sort that the data.table
I have not thoroughly tested this or thought through corner cases. I so happen to be writing a blog post about the newfound powers of Interface
The author brings up the
You'll need to think about the regex. I may be too particular about this type of thing, but as a user I really OtherFor this type of error message that can be resolved with
e.g. as below or a variation to below:
Irreproducible BugSomehow at some point the yml file got corrupted. I have no idea what I was
rm_data / prune_dataUnintended Deletions (PROBLEM)
We just nuked our critical yml file. While the very careful user would realize
Please see the top level File Storage comments. DocumentationThe caveats noted in the workflow vignette need to appear in the function
OtherMore generally, what is the purpose of being able to You could simplify the user facing interface by replacing both functions with Adding a relabelDocsTitle should be clearer that it is factor levels in The description stops short of actually describing what the function does. Add a reference to the "Relabelling a Factor" section of "Optimizing storage for |
(I have additional suggestions listed below, but my approval is not contingent on their completion.) NoteA reminder that I was not able to fully test all functionality claims. Followup from previous commentsREADMEI recommended updating the README to reflect the explanation of the problem that the software is designed to solve provided in the previous version of the vignette. The README has been updated. I would suggest further editing the Rationale section so that it is framed in terms of problems and solutions. Again, I see three main rationales:
Currently more attention is paid in the README to meta data (3) rather than version control (1) and optimization (2). Finally, I recommend adding another section, perhaps called "Getting Started," to the README that provides a short summary of the contents of each vignette. As such, the theory/rationale for the package would be separate from the how-to. Building vignettesI mentioned that ExamplesI noted that more comments would be helpful to guide the user through the examples. Clear examples with comments are now provided. CONTRIBUTING fileI noted that there was no CONTRIBUTING file nor contributing guidelines in the README. A CONTRIBUTING file was added to the RationaleI asked about the benefits of the optimization that the package provides. An excellent vignette, Efficiency in terms of storage and time, was added that addresses this issue in depth. Package nameI suggested changing the package name. The author considered the issue and holds that the name is appropriate.
|
Thanks to both @brodieG and @jtr13 for your detailed reviews. Your comments allow me to improve the quality of the package. I'll look into them and notify you when a next version is available. I'll giving an oral presentation on Below is a response on a few issues raised by @brodieG. Can I get some feedback on this as some these potentially require a major overhaul of the package. General issues
File formatUsing a custom file format has indeed the benefit that the user is less likely to tamper with the (meta)data. However, a comment on an earlier version explicit asks to store the data in an open standard format. IMHO, the benefits using a clear open standard outweighs the fact to make it tamper proof. Image the case in which the data was send to someone without knowledge of
Currently, I haven't tested how I'm not sure if we should prevent user to make changes to the (information content of the) data itself using other software. Think of using the Github website to correct typo's in the data, to remove observations from the data or to even add a few observations. If these changes corrupt the data, the user can revert them use the VCS or the backup. metadata pruningThe I could make these functions safer by adding something like an Time zone issueIMHO, the issue is rather that
|
BadgeFor the badge I'm going off of the guide:
File FormatIt's not critical that you have a custom "format" for git2rdata. I will note though that the "format" I propose still stores the data in open format in a completely accessible way. Diffs will show it, CSV readers will be able to read it, etc. I'm pretty sure that the concern would only really exist in the case where the above isn't true, where you need specialized tools to access the data. Fundamentally the git2rdata object is a new non-standard format as it is the combination of the data and the meta data. By putting the files together in a folder you are making that format explicit and giving the signal to users that if you mess with the data, you may well break the git2rdata object. You can even have a README.txt in the folder explaining this instead of using a scary extension. This is probably better anyway, and will allow recipients access to the data with the appropriate warning. Finally, if you still prefer to have the yml and tsv files floating around, I won't stop you so long as you take measures to ensure that the file pruning and deletion only affects files that were created by To recap, based on your feedback and further thought, this is what I would have the folder look like for e.g.
The |
Upgrade to version 0.0.4. Ready for review in ropensci/software-review#263
Version 0.0.4 is ready for review. Changes as listed in NEWS.mdBREAKING FEATURES
NEW FEATURES
Bugfixes
Other changes
Changes not explicitly mentioned in NEWS.md
Suggestions not implementedFile structure
Warning for unstaged data
Checking that the dispatching parameter actually is what it is supposed to be.
Storing timezone
Pattern in
|
From a quick skim this looks good. I'll review more thoroughly sometime next week and report back. |
Final approval (post-review)
Looks good to me. I did not test all the new functionality, but did confirm that the accidental yml removal which was the blocking issue is fixed. It also does look like all the author-accepted suggestions are indeed in the new version, and I have no concerns with the author-rejected suggestions. This was all tested against v0.0.4 tag. Nice job finding corner cases in the new duplicate function. Some minor comments, none of which gets in the way of approval:
@noamross next steps? |
Approved! Thanks @ThierryO for submitting and @brodieG and @jtr13 for your reviews! I know this has been a long road, but it's been very thorough and we're glad to have this useful and well-vetted package in our collection. @ThierryO, should you choose to address the remaining small comments, you may, but we are good to go. To-dos:
Should you want to acknowledge your reviewers in your package DESCRIPTION, you can do so by making them In our developer's guide, this chapter starts the 3rd section that's about guidance for maintenance after acceptance. Welcome aboard! We'd love to host a blog post about your package - either a short introduction to it with one example or a longer post with some narrative about its development or something you learned, and an example of its use. If you are interested, review the instructions, and tag @stefaniebutland in your reply. She will get in touch about timing and can answer any questions. |
Thanks for accepting the package. And thanks to @brodieG and @jtr13 for their reviews which improved the package a lot. I'm happy to acknowledge this by adding you as reviewer in the DESCRIPTION. The README on GitHub starts by referring to the pkgdown website of the package. There the I'll give a 15' talk on |
I did miss the message about the pkgdown site, or didn't make the connection in the next section. It's probably fine as it is. FWIW that link is broken right now. |
That sounds good @ThierryO. Since use!R is July 9-12, we could publish prior to your talk so that people could refer to it during the conference. In the talk you could invite people to submit their use cases to our discussion forum. That might get your work and the package some extra attention. After your talk, you could add any new insights as a comment on the post. To publish before use!R, let me know and submit a draft for review by Tuesday July 2. To publish after, submit a draft when you are ready. |
Summary
What does this package do? (explain in 50 words or less):
Store dataframes as plain text files along with metadata to ensure that attributes like factor levels are maintained. The dataframes are optimized in order to minimize both file size and diffs, making it useful in combination with version control.
Paste the full DESCRIPTION file inside a code block below:
https://github.com/inbo/git2rdata
[e.g., "data extraction, because the package parses a scientific data file format"]
reproducibility, because it help to store dataframes as plain text files without loss of information, while minimize file sizes and diff.
Anyone who wants to work with medium sized dataframes and have them under version control. This is useful in case of recurrent analysis on growing datasets. E.g. each year new data is added to the data and a new report is created. A snapshot of the raw data can be stored as plain text files under version control.
yours differ or meet our criteria for best-in-category?
The initial idea was to add this functionality into
git2r
. After a discussion with the maintainer, we decide to create a separate package. We use functions fromread_r
and improve them by storing the metadata.Requirements
Confirm each of the following by checking the box. This package:
Publication options
paper.md
matching JOSS's requirements with a high-level description in the package root or ininst/
.Detail
Does
R CMD check
(ordevtools::check()
) succeed? Paste and describe any errors or warnings:Does the package conform to rOpenSci packaging guidelines? Please describe any exceptions:
We use
read_vc()
rather thanvc_read()
because it matchesread.table()
,read_csv()
, ...If this is a resubmission following rejection, please explain the change in circumstances:
If possible, please provide recommendations of reviewers - those with experience with similar packages and/or likely users of your package - and their GitHub user names:
@jennybc @stewid
The text was updated successfully, but these errors were encountered: