Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

agent$validation_set$tbl_checked - why keep a copy of the dataframe for every rule? #542

Closed
alissiahrus opened this issue Jun 27, 2024 · 4 comments · Fixed by #554
Closed

Comments

@alissiahrus
Copy link

Hello,

I have the following question. I have a dataframe with ca. 136'000 rows and 11 columns . On this dataframe, I define 36 rules. I define the agent, add validation rules to it and then interrogate it. My final object is the interrogated agent. The final object consumes approx. 310 MB of memory. On further analysis (running object.size on list components at different levels), I realise that this is due to agent$validation_set$tbl_checked object (ca. 300 MB). There seems to be a copy of the dataframe stored for (almost) every rule we define, particularly the ones performed on single cells. This motivates two questions from my side:

  1. Why is there a need to keep the entire dataframe for each rule?
  2. Is there an option to prevent having these dataframe copies in the output, but also from saving them internally?

Providing a simple example with a simple dataframe:
`

# Example data frame
df <- data.frame(
  column1 = c("A", "B", "C"),
  column2 = c(1, 2, 3),
  column3 = c("X", "Y", "Z")
)

# Creating an agent to validate the data frame
agent <- create_agent(tbl = df) 


agent <- apply_col_checks(agent = agent, column="column1", 
                           .(col_is_character), 
                           .(col_vals_in_set(set = c("A", "B", "C"))),
                           .(col_vals_not_null)
                           )
agent <- agent %>% interrogate

obj <- agent$validation_set$tbl_checked

`
Screenshot 2024-06-27 180500

Thanks a lot,
Alissia

@yjunechoe
Copy link
Collaborator

Thanks - I think an option to drop validation_set$tbl_checked before returning from interrogate() is reasonable!

I'll just note two things:

  1. It won't always be a full copy of the input data - it's useful info for steps with a preconditions argument that temporary changes the data for a step.
  2. x_write_disk() already "drops" this when writing out to rds, so validation_set$tbl_checked doesn't seem critical for any advertised user-facing behaviors. So at least for the moment, I'd say feel free to just drop it to free up some memory:

pointblank/R/object_ops.R

Lines 324 to 330 in 362b706

if (inherits(x, "ptblank_agent")) {
x$validation_set$tbl_checked <- NULL
x$validation_set <-
x$validation_set %>%
dplyr::mutate(tbl_checked = list(NULL))

I can take some time to look into whether not having validation_set$tbl_checked would break anything.

@yjunechoe
Copy link
Collaborator

yjunechoe commented Jun 28, 2024

I can confirm that this is just dormant metadata!

I think the simplest implementation is to introduce an argument to interrogate() that just toggles whether it should return $tbl_checked. For example, we could expose a interrogate(..., extract_tbl_checked = TRUE) argument, similar to the already-existing extract_failed argument.

Two tangentially related thoughts:

  1. I think it'd be nice for future extensibility of interrogate() to have unused dots to separate the required argument from the rest (so, interrogate <- function(agent, ..., extract_failed, arg1, arg2)). This is soft-breaking for code using positional matching (interrogate(agent, FALSE) must now be interrogate(agent, extract_failed = FALSE) but it'd prompt better reproducible code practices in the long run. We could use rlang::check_dots_empty() as a gentle nudge to users to make the optional arguments to interrogate() explicit.

  2. I just discovered that $tbl_checked is actually not even part of what's returned by get_agent_x_list(), so I could also see an argument for dropping it from the output of interrogate() entirely, if we don't want to pursue the option of adding an argument to interrogate(). This actually raises a separate question of whether get_agent_x_list() should return $tbl_checked when available, but I'll just note it here for the record and mark that Q as out of scope for this issue.

@rich-iannone Let me know if you have thoughts on these!

@rich-iannone
Copy link
Member

Sorry for the extra late reply here :\ I think the only function that actually uses tbl_checked is get_sundered_data(). Right here:

dplyr::pull(tbl_checked)

and then the function iterates through the data frames / tibbles.

Now, this is just for that one function AFAICT so the idea of an argument to opt out of storing these dfs is still a good one. We could have a warning in get_sundered_data() that explains that the processed tables are needed (if they are missing). There are probably other strategies as well but this seems pretty safe and understandable.

@yjunechoe
Copy link
Collaborator

Ah you're right! I missed this by only checking devtools::test(filter = "interrogate"). A warning (actually, should we promote this to error?) for get_sundered_data() sounds good here.

I'll move this into a PR to test it more formally!

@yjunechoe yjunechoe linked a pull request Aug 4, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants