Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding assertr to ROpenSci #23

Closed
8 of 10 tasks
tonyfischetti opened this issue Dec 24, 2015 · 36 comments
Closed
8 of 10 tasks

Adding assertr to ROpenSci #23

tonyfischetti opened this issue Dec 24, 2015 · 36 comments

Comments

@tonyfischetti
Copy link

    1. What does this package do? (explain in 50 words or less)
      The assertr package supplies a suite of functions designed to verify assumptions about data early in an analysis pipeline to protect against common data errors and instances of bad data.
    1. Paste the full DESCRIPTION file inside a code block (bounded by ``` on either end).
Package: assertr
Type: Package
Title: Assertive Programming for R Analysis Pipelines
Version: 1.0.0
Authors@R: person("Tony", "Fischetti", email="[email protected]",
  role = c("aut", "cre"))
Maintainer: Tony Fischetti <[email protected]>
Description: Provides functionality to assert conditions
    that have to be met so that errors in data used in
    analysis pipelines can fail quickly. Similar to
    'stopifnot()' but more powerful, friendly, and easier
    for use in pipelines.
URL: https://github.com/tonyfischetti/assertr
BugReports: https://github.com/tonyfischetti/assertr/issues
License: MIT + file LICENSE
LazyData: TRUE
Imports:
    dplyr,
    MASS,
    lazyeval
Suggests:
    knitr,
    testthat,
    magrittr
VignetteBuilder: knitr
    1. URL for the package (the development repository, not a stylized html page)
      https://github.com/tonyfischetti/assertr
    1. What data source(s) does it work with (if applicable)?
      Any. Mostly in the form of data.frames
    1. Who is the target audience?
      Anyone who has ever struggled with bad data
    1. Are there other R packages that accomplish the same thing? If so, what is different about yours?
      The ensurer package attempts to solve the very same problem. To a certain extent, the assertive package also offer some similar capabilities. The difference between assertr and these other packages is the grammar of usage and the way that assertions of different types can be easily combined to express arbitrarily complex assertions in a very readable way.
    1. Check the box next to each policy below, confirming that you agree. These are mandatory.
  • This package does not violate the Terms of Service of any service it interacts with.
  • The repository has continuous integration with Travis and/or another service
  • The package contains a vignette
  • The package contains a reasonably complete readme with devtools install instructions
  • The package contains unit tests
  • The package only exports functions to the NAMESPACE that are intended for end users
    1. Do you agree to follow the rOpenSci packaging guidelines? These aren't mandatory, but we strongly suggest you follow them. If you disagree with anything, please explain.
      Yes. All the user-facing functions are in snake case, but the internal functions sometimes use dots (.) as separators. I'm open to changing that. Also, the package doesn't have a code of conduct yet but I think it's a good idea to include.
  • Are there any package dependencies not on CRAN?
    No
  • Do you intend for this package to go on CRAN?
    It already is
  • Does the package have a CRAN accepted license?
    You bet
  • Did devtools::check() produce any errors or warnings? If so paste them below.
    No warnings
    1. Please add explanations below for any exceptions to the above:
    1. If this is a resubmission following rejection, please explain the change in cirucmstances.
@richfitz
Copy link
Member

I have a use case for this and have already looked through the code, so am happy to review if that is useful (at the same time I can't review until the 4th January at the earliest as I will be travelling over the break).

@sckott
Copy link
Contributor

sckott commented Dec 24, 2015

Reviewers: @richfitz @jennybc

@jennybc
Copy link
Member

jennybc commented Dec 24, 2015

I've been meaning to do one, so I can be the second.

@sckott
Copy link
Contributor

sckott commented Dec 24, 2015

thanks jenny, assigned

@karthik
Copy link
Member

karthik commented Dec 24, 2015

@tonyfischetti Excellent! Thanks for submitting! Looking forward to the reviews and adding this to the suite. 😃

@sckott
Copy link
Contributor

sckott commented Jan 22, 2016

@richfitz @jennybc - hey there, it's been 29.0 days, please get your review in soon, thanks 😺

@richfitz
Copy link
Member

Sorry - I have been meaning to (and also #25). Next week for both I hope.

@sckott
Copy link
Contributor

sckott commented Jan 22, 2016

cool - (p.s. that comment was your friendly heroku robot https://github.com/ropenscilabs/heythere)

@richfitz
Copy link
Member

The descent towards manuscript central begins... 😁

@sckott
Copy link
Contributor

sckott commented Jan 22, 2016

unless you're volunteering to remind everyone manually

@richfitz
Copy link
Member

Definitely not! I think it's great. I actually thought it was you, which you could never say for MS central.

@jennybc
Copy link
Member

jennybc commented Jan 22, 2016

Maybe we should send hand-written notes?

@jennybc
Copy link
Member

jennybc commented Jan 22, 2016

And yes, duly noted, that I need to bust a move on this.

@sckott
Copy link
Contributor

sckott commented Jan 22, 2016

Definitely not! I think it's great. I actually thought it was you, which you could never say for MS central.

heythere != MS central

Maybe we should send hand-written notes?

Yes!

@richfitz
Copy link
Member

General comments


The assertr package provides a generalised framework for defensive programming around data.frames. This sits somewhere between stopifnot and testthat in terms of flexibility and complexity and as such forms a useful building block for data analysis workflow, which I believe is under-tooled at the moment. I really like the idea of having packages that are primarily focussed on data workflows rather than restricting people to ideas that were developed for software engineering (such as formal unit tests).

The package is very tight -- it exports the minimum set of functionality and conforms to the "do one thing and do it well" school of thought. The functions are well documented, the vignette is readable and less dry than most. I appreciate the split into NSE and SE versions of all core functions.

Accordingly, most of my comments focus on design decisions and therefore may all be out of line because the author will have thought about this more than I have.

The main entrypoints are difficult to differentiate

My biggest concern is that I found the three main entry points very difficult to keep straight. And when I put the package down over Christmas I had to re-remember them again.

  • assert(data, predicate, ...)
  • verify(data, expr, ...)
  • insist(data, predicate_generator, ...)

    The difference is primarily in the properties of the second argument and I wonder if there's a way of specifying that some other way than three functions that have such similar names? The current approach is extremely elegant but at the cost of being a bit too opaque to the user -- especially because the three function names are essentially synonyms of each other there is nothing to jog your memory. I presume the problem is it is difficult to detect the difference between the three argument types before they are evaluated (and the correct evaluation depends on the type).

The custom handler routine is inflexible

The custom handler is a really great addition, but could be improved. testthat has a similar handler approach that allows storage of a bunch of repeated assertions (pass or fail). My use-case for this package is to replicate something like this, so I'd want to pass the same handler in to all the functions in a pipeline:

mtcars %>%
  verify(nrow(mtcars) > 10, error_fun=my_error_fun) %>%
  verify(mpg > 0, error_fun=my_error_fun) %>%
  insist(within_n_sds(4), mpg, error_fun=my_error_fun) %>%
  assert(in_set(0,1), am, vs, error_fun=my_error_fun) %>%
  group_by(cyl) %>%
  summarise(avg.mpg=mean(mpg))


What would be heaps nicer is if I could register a handler; change the assert function to something like:

assertr <- function(data, predicate, ..., error_fun=getOption("assertr.handler", assertr_stop)) {
}


(or eqivalently use the package-environment trick like testthat does). This would allow:

options(assertr.handler=my_error_fun)
mtcars %>%
  verify(nrow(mtcars) > 10) %>%
  verify(mpg > 0) %>%
  insist(within_n_sds(4), mpg) %>%
  assert(in_set(0,1), am, vs) %>%
  group_by(cyl) %>%
  summarise(avg.mpg=mean(mpg))


(As a related comment, the usage definitions reference the unexported function assertr_stop which some may find confusing. Additionally, is there a reason why verify uses error_fun=stop not assertr_stop?)

Minor comments

  • The dplyr dependency, which is used soley for dplyr::select_ seems a potentially heavy dependency for one function; if it is straightforward to swap out for independent imlementation that would decrease the package footprint (I can imagine using this in contexts where I do not have dplyr installed such as container-based workflows). But I can totally see the advantage of sticking with something that is known to work.
  • Think about the non-pipe people. All the examples and the vignette make heavy use of the %>%operator (which is fine), but as someone who uses this little or who imagines using this mostly in packages where I'd be avoiding so much weird evaluation, I would appreciate a few more pipe-less examples. This is potentially confusing in places like:
our.data %>% assert(within_bounds(0,Inf), mpg) # and so on


when ?assert says that usage is

assert(data, predicate, ..., error_fun = assertr_stop)


This confused me because in usage it looks like predicate, data, but of course the data argument comes from the pipe and the mpg is the column name is passed through to the ... argument. While the use with pipes is very elegant, I think the package has use outside of that scope too. The examples within the package are actually really good like this.

  • Classed exceptions would make the error handling more flexible. Related to the second major point above; it would be nice to distinguish between errors that were because the input to assertr was incorrect, and errors that are raised because the data failed the assertions. R's classed errors provide a nice framework for this. Then tryCatch and withCallingHandlers can dispatch appropriately based on the sort of error.
  • A reporting framework would be fantastic If you don't write this, I will -- but given you have written this package I figure you should get right of refusal. Related to the point above, I would like to use the underlying bits you have here in a package for automated testing of upstream data sources that tend to be misbehaved. I can imagine a testthat-like reporting framework where a bunch of tests are run and the failures reported.

@tonyfischetti
Copy link
Author

Thanks for the kind words and really great feedback @richfitz.

The main entrypoints are difficult to differentiate

As pointed out, this is somewhat a consequence of my making the API as elegant as possible but at the expense of some opaque-ness. The problem is (again, as you pointed out) there's no elegant way that I can see to programmatically detect the argument types. So we have verbs that are literally synonyms (I chose their names using a thesaurus). Unfortunately, I can't think of a better naming mechanism without really long function names like take_a_predicate_generator_and_apply_to_each_column(). I'm open to suggestions, but it may not be the end of the world since there are only three main verbs and the docs are good.

The custom handler routine is inflexible

I like the idea of using a testthat style options mechanism. I'll implement this

Additionally, is there a reason why verify uses error_fun=stop not assertr_stop?)

Nope, that's an error. Good catch!

The dplyr dependency

That was a difficult choice. It's a hell of a heavy dependency, but I was wary of implementing dplyr::select myself. Especially since I would have to reference dplyr's implementation so heavily that it would likely constitute code theft. I'd love to drop the dependency though, if anyone thinks they can implement it without copying code.

A reporting framework would be fantastic

I need this feature, and I have a few great ideas on how to implement it. I'd like to talk to you more (@richfitz) about your particular use case in case my solution only suits my use case. I think this feature can potentially be the most powerful and useful capability of assertr

@tonyfischetti
Copy link
Author

So I'm going to run into a little more free time in the near future and I'd like to get back from my learning hiatus back into improving assertr--particularly because people are telling me its really useful for them. Because of the learning hiatus, I have some fresh new ideas for improvement, but I'd like to run them past some of you for further input...

The main entrypoints are difficult to differentiate

As mentioned before, there is assert, insist, insist_rows, and assert_rows for representing a wide range of tasks. However, if I wanted to add the ability to, say, declare that the whole data set should have no more than 15% missing values (and I do want to add that), the semantics of that would require another specialized function... and I'm running out of synonyms for "assert"!

Principled though that solution was, I'm not sure its the correct thing to do going forward; it's always been one of R's strengths from a user's point of view to use a familiar generic function (mean, plot, etc...) with all sorts of input and have the object system dynamically dispatch the correct functionality.

So how about this... creating an S3S4 generic function (proclaim perhaps?) that will handle all of the different semantics for the user. Concretely, the function returned from within_n_mads can be labeled with class assertr_dynamic (because the predicate is dynamically generated). Then, proclaim would dispatch on the second argument (the first arg is the data frame) and call what is currently referred to as insist. In the same way, maha_dist would be classed something like assertr_dynamic_rows and a user calling proclaim(df, maha_dist, within_n_mads(10), ...) would transparently dispatch insist_rows(maha_dist, within_n_mads(10), ...)

This would improve assertr's extensibility greatly; for example, adding semantics to check the supplied data.frame as a whole would only require writing another S3S4 method of the proclaim generic and making sure that the predicate function was correctly classed.

This solution is perhaps a little unconventional, but it completely obviates the requirement that a useR remember all the verbs.

The custom handler routine is inflexible

The most common suggestion for assertr is to be able to warn (not error) on violation. Additionally, even if it does error on violation, there should be semantics in place for the entire chain of assertions to run so that the final error message will contain the complete report of the data errors.

The reason I needed to take my time with this one is that I need the warnings to be concatenated through a assertr chain in a principled manner. To do this without using dynamic variables (eww), it requires that--along with returned the data frame given--the assertr verbs need to return the warnings so that they can be concatenated with the warnings further in the assertion pipeline. Up until recently, I thought that I would have to implement something that would be tantamount to a Haskell monad in order to do this. Another possibility is to use S4 (hence the S3 strikeouts in the paragraphs above) in order to get proclaim (or whatever it is) to dispatch on both the second argument, and the first argument. If the first argument is a data.frame the current semantics stand... if the first argument is something else (some class that holds a data.frame and a running list of errors) then the wanted behavior can be dispatched.

The big complication is what happens at the end of the chain. There needs to be something that tells assertr that the chain is ending so that it can take the data.frame out of the composite data.frame/error_log object and finally display the error or warning. Any ideas?

I'd appreciate any feedback on these ideas for two reasons (a) it's now (or will hopefully be soon) ROpenSci's project not just mine, and (b) I'd like to get the input of some talented developers :)

@tonyfischetti
Copy link
Author

To review, none of the proposed additions have to break backwards compatibility :) It would just make everything much easier for the user.... the example in the README would go from this:

  mtcars %>%
      verify(nrow(.) > 10) %>%
      verify(mpg > 0) %>%
      insist(within_n_sds(4), mpg) %>%
      assert(in_set(0,1), am, vs) %>%
      assert_rows(num_row_NAs, within_bounds(0,2), everything()) %>%
      insist_rows(maha_dist, within_n_mads(10), everything()) %>%
      group_by(cyl) %>%
      summarise(avg.mpg=mean(mpg))

to this

  mtcars %>%
      verify(nrow(.) > 10) %>%
      verify(mpg > 0) %>%
      assert(within_n_sds(4), mpg) %>%
      assert(in_set(0,1), am, vs) %>%
      assert(num_row_NAs, within_bounds(0,2), everything()) %>%
      assert(maha_dist, within_n_mads(10), everything()) %>%
      group_by(cyl) %>%
      summarise(avg.mpg=mean(mpg))

(verify wouldn't be able to be replicated under a assert S4 generic)

@jennybc
Copy link
Member

jennybc commented Mar 12, 2016

Since I'm the one who is yet to review ... @tonyfischetti do you have recommendations of a good dataset to run through assertr? I.e. one that you think should show it off but ... there's enough uncertainty that it would be interesting to see how things go? Also to see how a new user manages with it. I have one idea that I'll fall back on if nothing immediately comes to mind.

@tonyfischetti
Copy link
Author

@jennybc Nothing immediately comes to mind but I'm sure I can dig up one of the examples that inspired me to get into this package in the first place :)

@jennybc
Copy link
Member

jennybc commented Mar 12, 2016

@aammd politely reminded me I have some really ugly data asserting/cleaning code in the private STAT 545 instructors repo, so that's my plan B 😬.

@aammd
Copy link

aammd commented Mar 13, 2016

@jennybc wellll i would not say ugly, but rather "pre-assertr". it represents "how we did this before assertr" and highlights improvements in the UI of this package (improvements I tried to show in my lesson about assertr)

@sckott
Copy link
Contributor

sckott commented Mar 14, 2016

@tonyfischetti some thoughts on:

The big complication is what happens at the end of the chain. There needs to be something that tells assertr that the chain is ending so that it can take the data.frame out of the composite data.frame/error_log object and finally display the error or warning. Any ideas?

with help of @smbache - we have a way to detect whether a piped command is the last one or not. If it is the last, do X (e.g., execute some other fxn, print data, etc.) instead of passing to the next command. You can see the helper fxns here https://github.com/ropensci/jqr/blob/master/R/pipe_helpers.R and usage here https://github.com/ropensci/jqr/blob/master/R/index.R#L42

@sckott
Copy link
Contributor

sckott commented Mar 22, 2016

@jennybc - hey there, it's been 89 days, please get your review in soon, thanks 😺

@jennybc
Copy link
Member

jennybc commented Mar 22, 2016

OK I promise you I will not be able to face you an unconf w/o this being totally done.

@sckott
Copy link
Contributor

sckott commented May 23, 2016

@jennybc - hey there, it's been 151 days, please get your review in soon, thanks 😺 (ropensci-bot)

@smbache
Copy link

smbache commented May 23, 2016

It must be the temptation of ensurer that is causing delay 😆 Hehe

@jennybc
Copy link
Member

jennybc commented May 23, 2016

I am, and have been, halfway done for ages. I have a PR ready for the vignette. It's the incredibly insightful overall comments that need to be written. 😳 Will do.

@tonyfischetti
Copy link
Author

@smbache I didn't know ensurer was being considered

@smbache
Copy link

smbache commented May 25, 2016

It's not. I was joking ;)

@sckott
Copy link
Contributor

sckott commented May 31, 2016

@jennybc - hey there, it's been 159 days, please get your review in soon, thanks 😺 (ropensci-bot)

@sckott
Copy link
Contributor

sckott commented Jun 8, 2016

@tonyfischetti approved!

  • Add the footer to your README:
[![ropensci\_footer](http://ropensci.org/public_images/github_footer.png)](http://ropensci.org)
  • Update installation of dev versions to ropenscilabs/assertr and any urls for the github repo to ropenscilabs instead of tonyfischetti
  • Update any links to the package from tonyfischetti/assertr to ropenscilabs/assertr (though even if they aren't github will redirect to the new location :) )
  • Go to the Repo Settings --> Transfer Ownership and transfer to ropenscilabs - Note that all our newer pkgs go to ropenscilabs first, then when more mature we'll move to ropensci

@tonyfischetti
Copy link
Author

I tried to do the last thing and it says I don't have admin rights to ropenscilabs :(

@sckott
Copy link
Contributor

sckott commented Jun 14, 2016

you should have received an invitation from ropenscilabs, did you get that email?

@tonyfischetti
Copy link
Author

Idiotic move on my part not checking the mail :) It's done

@sckott
Copy link
Contributor

sckott commented Jun 14, 2016

sweet!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants