agent$validation_set$tbl_checked - why keep a copy of the dataframe for every rule? #542

alissiahrus · 2024-06-27T16:10:46Z

Hello,

I have the following question. I have a dataframe with ca. 136'000 rows and 11 columns . On this dataframe, I define 36 rules. I define the agent, add validation rules to it and then interrogate it. My final object is the interrogated agent. The final object consumes approx. 310 MB of memory. On further analysis (running object.size on list components at different levels), I realise that this is due to agent$validation_set$tbl_checked object (ca. 300 MB). There seems to be a copy of the dataframe stored for (almost) every rule we define, particularly the ones performed on single cells. This motivates two questions from my side:

Why is there a need to keep the entire dataframe for each rule?
Is there an option to prevent having these dataframe copies in the output, but also from saving them internally?

Providing a simple example with a simple dataframe:
`

# Example data frame
df <- data.frame(
  column1 = c("A", "B", "C"),
  column2 = c(1, 2, 3),
  column3 = c("X", "Y", "Z")
)

# Creating an agent to validate the data frame
agent <- create_agent(tbl = df) 


agent <- apply_col_checks(agent = agent, column="column1", 
                           .(col_is_character), 
                           .(col_vals_in_set(set = c("A", "B", "C"))),
                           .(col_vals_not_null)
                           )
agent <- agent %>% interrogate

obj <- agent$validation_set$tbl_checked

`

Thanks a lot,
Alissia

The text was updated successfully, but these errors were encountered:

yjunechoe · 2024-06-27T16:36:30Z

Thanks - I think an option to drop validation_set$tbl_checked before returning from interrogate() is reasonable!

I'll just note two things:

It won't always be a full copy of the input data - it's useful info for steps with a preconditions argument that temporary changes the data for a step.
x_write_disk() already "drops" this when writing out to rds, so validation_set$tbl_checked doesn't seem critical for any advertised user-facing behaviors. So at least for the moment, I'd say feel free to just drop it to free up some memory:

pointblank/R/object_ops.R

Lines 324 to 330 in 362b706

    
           if (inherits(x, "ptblank_agent")) { 
        
             x$validation_set$tbl_checked <- NULL 
        
             x$validation_set <-  
        
               x$validation_set %>% 
        
               dplyr::mutate(tbl_checked = list(NULL))

I can take some time to look into whether not having validation_set$tbl_checked would break anything.

yjunechoe · 2024-06-28T07:16:50Z

I can confirm that this is just dormant metadata!

I think the simplest implementation is to introduce an argument to interrogate() that just toggles whether it should return $tbl_checked. For example, we could expose a interrogate(..., extract_tbl_checked = TRUE) argument, similar to the already-existing extract_failed argument.

Two tangentially related thoughts:

I think it'd be nice for future extensibility of interrogate() to have unused dots to separate the required argument from the rest (so, interrogate <- function(agent, ..., extract_failed, arg1, arg2)). This is soft-breaking for code using positional matching (interrogate(agent, FALSE) must now be interrogate(agent, extract_failed = FALSE) but it'd prompt better reproducible code practices in the long run. We could use rlang::check_dots_empty() as a gentle nudge to users to make the optional arguments to interrogate() explicit.
I just discovered that $tbl_checked is actually not even part of what's returned by get_agent_x_list(), so I could also see an argument for dropping it from the output of interrogate() entirely, if we don't want to pursue the option of adding an argument to interrogate(). This actually raises a separate question of whether get_agent_x_list() should return $tbl_checked when available, but I'll just note it here for the record and mark that Q as out of scope for this issue.

@rich-iannone Let me know if you have thoughts on these!

rich-iannone · 2024-08-02T15:29:38Z

Sorry for the extra late reply here :\ I think the only function that actually uses tbl_checked is get_sundered_data(). Right here:

pointblank/R/get_sundered_data.R

Line 344 in 4605b30

dplyr::pull(tbl_checked)

and then the function iterates through the data frames / tibbles.

Now, this is just for that one function AFAICT so the idea of an argument to opt out of storing these dfs is still a good one. We could have a warning in get_sundered_data() that explains that the processed tables are needed (if they are missing). There are probably other strategies as well but this seems pretty safe and understandable.

yjunechoe · 2024-08-04T04:08:43Z

Ah you're right! I missed this by only checking devtools::test(filter = "interrogate"). A warning (actually, should we promote this to error?) for get_sundered_data() sounds good here.

I'll move this into a PR to test it more formally!

alissiahrus added the Type: ⁇ Question label Jun 27, 2024

yjunechoe linked a pull request Aug 4, 2024 that will close this issue

Option to drop tbl_checked information #554

Merged

rich-iannone closed this as completed in #554 Aug 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

agent$validation_set$tbl_checked - why keep a copy of the dataframe for every rule? #542

agent$validation_set$tbl_checked - why keep a copy of the dataframe for every rule? #542

alissiahrus commented Jun 27, 2024

yjunechoe commented Jun 27, 2024

yjunechoe commented Jun 28, 2024 •

edited

Loading

rich-iannone commented Aug 2, 2024

yjunechoe commented Aug 4, 2024

agent$validation_set$tbl_checked - why keep a copy of the dataframe for every rule? #542

agent$validation_set$tbl_checked - why keep a copy of the dataframe for every rule? #542

Comments

alissiahrus commented Jun 27, 2024

yjunechoe commented Jun 27, 2024

yjunechoe commented Jun 28, 2024 • edited Loading

rich-iannone commented Aug 2, 2024

yjunechoe commented Aug 4, 2024

yjunechoe commented Jun 28, 2024 •

edited

Loading