add support for ascending CDF values with numeric output_type_id cast as character #105

zkamvar · 2024-08-12T19:33:34Z

This PR creates a lookup table from the required values in the config file and perform an inner join to force the values in the submission to sort with the config order.

NOTE: This was initially attempted before v4 of the hubs was released, and questions about required and optional output type ID values was brought up (#105 (comment)). We had decided to put this PR on hold (#105 (comment)). Now with the release of v4, we can revisit this with the explicit assumption that hubs with pre-v4 configs will have concatenation of required and optional values.

I do want to note that @elray1 and I modified existing tests in 3148f40, and we would appreciate feedback from @annakrystalli

This will fix #78

github-actions · 2024-08-12T19:35:56Z

🚀 Deployed on https://677ec3ac4f9da06f1ca9ee5d--hubvalidations-pr-previews.netlify.app

tests/testthat/test-check_tbl_value_col_ascending.R

annakrystalli · 2024-08-13T08:42:53Z

Thanks for working on this @zkamvar !

Unfortunately this does not solve the problem when cdf output type IDs have a genuine character output type ID (eg some form of epiweek encoding)

This has sadly been introduced by the fact we relaxed the formatting requirements for character cdf output type IDs in schema version v3.0.1 which is now coming back to bite us. In previous schema, the need for for character cdf output type IDs to conform to the regex pattern ^EW[0-9]{6} (e.g. "EW202240") means the character sorting worked. However, now a code such as EW1 - EW52 would in theory be allowed and cause the same problems discussed in #78

See below:

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(hubValidations)
#> 
#> Attaching package: 'hubValidations'
#> The following object is masked from 'package:dplyr':
#> 
#>     combine
library(testthat)
#> 
#> Attaching package: 'testthat'
#> The following object is masked from 'package:dplyr':
#> 
#>     matches

ex <- structure(
  list(
    location = c(
      "01", "01", "01", "01", "01", "01",
      "01", "01", "01", "01", "01", "01", "01", "01", "01", "01", "01",
      "01", "01", "01", "01", "01", "01", "01", "01", "01", "01", "01",
      "01", "01", "01", "01", "01", "01", "01", "01", "01", "01", "01",
      "01", "01", "01", "01", "01", "01", "01", "01", "01", "01", "01",
      "01", "01", "01", "01", "01", "01", "01", "01", "01", "01", "01",
      "01", "01", "01", "01", "01", "01", "01", "01", "01", "01", "01",
      "01", "01", "01", "01", "01", "01", "01", "01", "01", "01", "01",
      "01", "01", "01", "01", "01", "01", "01", "01", "01", "01", "01",
      "01", "01", "01", "01", "01", "01", "01", "01", "01", "01", "01",
      "01", "01", "01", "01", "01", "01", "01", "01", "01", "01", "01",
      "01", "01", "01", "01", "01", "01", "01", "01", "01", "01", "01"
    ),
    reference_date = structure(c(
      19371, 19371, 19371, 19371, 19371,
      19371, 19371, 19371, 19371, 19371, 19371, 19371, 19371, 19371,
      19371, 19371, 19371, 19371, 19371, 19371, 19371, 19371, 19371,
      19371, 19371, 19371, 19371, 19371, 19371, 19371, 19371, 19371,
      19371, 19371, 19371, 19371, 19371, 19371, 19371, 19371, 19371,
      19371, 19371, 19371, 19371, 19371, 19371, 19371, 19371, 19371,
      19371, 19371, 19371, 19371, 19371, 19371, 19371, 19371, 19371,
      19371, 19371, 19371, 19371, 19371, 19371, 19371, 19371, 19371,
      19371, 19371, 19371, 19371, 19371, 19371, 19371, 19371, 19371,
      19371, 19371, 19371, 19371, 19371, 19371, 19371, 19371, 19371,
      19371, 19371, 19371, 19371, 19371, 19371, 19371, 19371, 19371,
      19371, 19371, 19371, 19371, 19371, 19371, 19371, 19371, 19371,
      19371, 19371, 19371, 19371, 19371, 19371, 19371, 19371, 19371,
      19371, 19371, 19371, 19371, 19371, 19371, 19371, 19371, 19371,
      19371, 19371, 19371, 19371, 19371
    ), class = "Date"),
    horizon = c(
      0,
      0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
      0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
      0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
      0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
      0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
      0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
    ),
    target_end_date = structure(c(
      19371, 19371, 19371, 19371,
      19371, 19371, 19371, 19371, 19371, 19371, 19371, 19371, 19371,
      19371, 19371, 19371, 19371, 19371, 19371, 19371, 19371, 19371,
      19371, 19371, 19371, 19371, 19371, 19371, 19371, 19371, 19371,
      19371, 19371, 19371, 19371, 19371, 19371, 19371, 19371, 19371,
      19371, 19371, 19371, 19371, 19371, 19371, 19371, 19371, 19371,
      19371, 19371, 19371, 19371, 19371, 19371, 19371, 19371, 19371,
      19371, 19371, 19371, 19371, 19371, 19371, 19371, 19371, 19371,
      19371, 19371, 19371, 19371, 19371, 19371, 19371, 19371, 19371,
      19371, 19371, 19371, 19371, 19371, 19371, 19371, 19371, 19371,
      19371, 19371, 19371, 19371, 19371, 19371, 19371, 19371, 19371,
      19371, 19371, 19371, 19371, 19371, 19371, 19371, 19371, 19371,
      19371, 19371, 19371, 19371, 19371, 19371, 19371, 19371, 19371,
      19371, 19371, 19371, 19371, 19371, 19371, 19371, 19371, 19371,
      19371, 19371, 19371, 19371, 19371, 19371
    ), class = "Date"),
    target = c(
      "wk inc flu hosp",
      "wk inc flu hosp", "wk inc flu hosp", "wk inc flu hosp", "wk inc flu hosp",
      "wk inc flu hosp", "wk inc flu hosp", "wk inc flu hosp", "wk inc flu hosp",
      "wk inc flu hosp", "wk inc flu hosp", "wk inc flu hosp", "wk inc flu hosp",
      "wk inc flu hosp", "wk inc flu hosp", "wk inc flu hosp", "wk inc flu hosp",
      "wk inc flu hosp", "wk inc flu hosp", "wk inc flu hosp", "wk inc flu hosp",
      "wk inc flu hosp", "wk inc flu hosp", "wk flu hosp rate category",
      "wk flu hosp rate category", "wk flu hosp rate category", "wk flu hosp rate category",
      "wk flu hosp rate", "wk flu hosp rate", "wk flu hosp rate", "wk flu hosp rate",
      "wk flu hosp rate", "wk flu hosp rate", "wk flu hosp rate", "wk flu hosp rate",
      "wk flu hosp rate", "wk flu hosp rate", "wk flu hosp rate", "wk flu hosp rate",
      "wk flu hosp rate", "wk flu hosp rate", "wk flu hosp rate", "wk flu hosp rate",
      "wk flu hosp rate", "wk flu hosp rate", "wk flu hosp rate", "wk flu hosp rate",
      "wk flu hosp rate", "wk flu hosp rate", "wk flu hosp rate", "wk flu hosp rate",
      "wk flu hosp rate", "wk flu hosp rate", "wk flu hosp rate", "wk flu hosp rate",
      "wk flu hosp rate", "wk flu hosp rate", "wk flu hosp rate", "wk flu hosp rate",
      "wk flu hosp rate", "wk flu hosp rate", "wk flu hosp rate", "wk flu hosp rate",
      "wk flu hosp rate", "wk flu hosp rate", "wk flu hosp rate", "wk flu hosp rate",
      "wk flu hosp rate", "wk flu hosp rate", "wk flu hosp rate", "wk flu hosp rate",
      "wk flu hosp rate", "wk flu hosp rate", "wk flu hosp rate", "wk flu hosp rate",
      "wk flu hosp rate", "wk flu hosp rate", "wk flu hosp rate", "wk flu hosp rate",
      "wk flu hosp rate", "wk flu hosp rate", "wk flu hosp rate", "wk flu hosp rate",
      "wk flu hosp rate", "wk flu hosp rate", "wk flu hosp rate", "wk flu hosp rate",
      "wk flu hosp rate", "wk flu hosp rate", "wk flu hosp rate", "wk flu hosp rate",
      "wk flu hosp rate", "wk flu hosp rate", "wk flu hosp rate", "wk flu hosp rate",
      "wk flu hosp rate", "wk flu hosp rate", "wk flu hosp rate", "wk flu hosp rate",
      "wk flu hosp rate", "wk flu hosp rate", "wk flu hosp rate", "wk flu hosp rate",
      "wk flu hosp rate", "wk flu hosp rate", "wk flu hosp rate", "wk flu hosp rate",
      "wk flu hosp rate", "wk flu hosp rate", "wk flu hosp rate", "wk flu hosp rate",
      "wk flu hosp rate", "wk flu hosp rate", "wk flu hosp rate", "wk flu hosp rate",
      "wk flu hosp rate", "wk flu hosp rate", "wk flu hosp rate", "wk flu hosp rate",
      "wk flu hosp rate", "wk flu hosp rate", "wk flu hosp rate", "wk flu hosp rate",
      "wk flu hosp rate", "wk flu hosp rate", "wk flu hosp rate", "wk flu hosp rate"
    ),
    output_type = c(
      "quantile", "quantile", "quantile", "quantile",
      "quantile", "quantile", "quantile", "quantile", "quantile", "quantile",
      "quantile", "quantile", "quantile", "quantile", "quantile", "quantile",
      "quantile", "quantile", "quantile", "quantile", "quantile", "quantile",
      "quantile", "pmf", "pmf", "pmf", "pmf", "cdf", "cdf", "cdf",
      "cdf", "cdf", "cdf", "cdf", "cdf", "cdf", "cdf", "cdf", "cdf",
      "cdf", "cdf", "cdf", "cdf", "cdf", "cdf", "cdf", "cdf", "cdf",
      "cdf", "cdf", "cdf", "cdf", "cdf", "cdf", "cdf", "cdf", "cdf",
      "cdf", "cdf", "cdf", "cdf", "cdf", "cdf", "cdf", "cdf", "cdf",
      "cdf", "cdf", "cdf", "cdf", "cdf", "cdf", "cdf", "cdf", "cdf",
      "cdf", "cdf", "cdf", "cdf", "cdf", "cdf", "cdf", "cdf", "cdf",
      "cdf", "cdf", "cdf", "cdf", "cdf", "cdf", "cdf", "cdf", "cdf",
      "cdf", "cdf", "cdf", "cdf", "cdf", "cdf", "cdf", "cdf", "cdf",
      "cdf", "cdf", "cdf", "cdf", "cdf", "cdf", "cdf", "cdf", "cdf",
      "cdf", "cdf", "cdf", "cdf", "cdf", "cdf", "cdf", "cdf", "cdf",
      "cdf", "cdf", "cdf", "cdf", "cdf", "cdf", "cdf"
    ),
    output_type_id = c(
      "0.01",
      "0.025", "0.05", "0.1", "0.15", "0.2", "0.25", "0.3", "0.35",
      "0.4", "0.45", "0.5", "0.55", "0.6", "0.65", "0.7", "0.75", "0.8",
      "0.85", "0.9", "0.95", "0.975", "0.99", "low", "moderate", "high",
      "very high", "0.25", "0.5", "0.75", "1", "1.25", "1.5", "1.75",
      "2", "2.25", "2.5", "2.75", "3", "3.25", "3.5", "3.75", "4",
      "4.25", "4.5", "4.75", "5", "5.25", "5.5", "5.75", "6", "6.25",
      "6.5", "6.75", "7", "7.25", "7.5", "7.75", "8", "8.25", "8.5",
      "8.75", "9", "9.25", "9.5", "9.75", "10", "10.25", "10.5", "10.75",
      "11", "11.25", "11.5", "11.75", "12", "12.25", "12.5", "12.75",
      "13", "13.25", "13.5", "13.75", "14", "14.25", "14.5", "14.75",
      "15", "15.25", "15.5", "15.75", "16", "16.25", "16.5", "16.75",
      "17", "17.25", "17.5", "17.75", "18", "18.25", "18.5", "18.75",
      "19", "19.25", "19.5", "19.75", "20", "20.25", "20.5", "20.75",
      "21", "21.25", "21.5", "21.75", "22", "22.25", "22.5", "22.75",
      "23", "23.25", "23.5", "23.75", "24", "24.25", "24.5", "24.75",
      "25"
    ),
    value = c(
      17, 44, 72, 105, 122, 125, 127, 128, 131, 132,
      133, 136, 139, 140, 141, 144, 145, 147, 150, 167, 200, 228, 255,
      0.220842781557067, 0.768398474282558, 0.0107282559276931, 3.04882326815914e-05,
      0.00853380042747561, 0.0135533534527697, 0.0208413454592117,
      0.0299015976961877, 0.0406341808051564, 0.0548617841719505, 0.0720793809659529,
      0.0929593384551091, 0.111686153095421, 0.236379680785012, 0.560349384758665,
      0.864271648664744, 0.89630163333021, 0.918320726070205, 0.937138052331475,
      0.952967021875378, 0.96520141962482, 0.974905656518415, 0.983169904293088,
      0.989315411865382, 0.993311235511738, 0.995919595802813, 0.997577247537977,
      0.998600149380724, 0.999213049267616, 0.999569631735671, 0.999771071520066,
      0.999881567193594, 0.999940419028641, 0.999970855156356, 0.999986139027956,
      0.999993591354829, 0.999997119649744, 0.999998741655025, 0.999999465680723,
      0.999999779493023, 0.999999911561791, 0.999999965530796, 0.999999986945038,
      0.999999995195438, 0.999999998281901, 0.999999999403044, 0.999999999798479,
      0.999999999933905, 0.999999999978939, 0.99999999999348, 0.999999999998039,
      0.999999999999427, 0.999999999999838, 0.999999999999955, 0.999999999999988,
      0.999999999999997, 0.999999999999999, 1, 1, 1, 1, 1, 1, 1, 1,
      1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
      1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1
    )
  ),
  row.names = c(
    NA,
    -127L
  ),
  class = c("tbl_df", "tbl", "data.frame")
)


res <- check_tbl_value_col_ascending(ex, file_path = "")
expect_s3_class(res, "check_success")
#> Error: `res` inherits from 'check_failure'/'hub_check'/'rlang_warning'/'warning'/'condition' not 'check_success'.
expect_null(res$error_tbl)
#> Error: res$error_tbl is not NULL
#> 
#> `actual` is an S3 object of class <tbl_df/tbl/data.frame>, a list
#> `expected` is NULL


# Test on character cdf values with consistent  lengths. This should pass and
# was imposed by schema versions less than v3.0.1 where the format of a character
# output_type_id was strict.
ex <- ex |>
  dplyr::filter(output_type == "cdf") |>
  dplyr::mutate(output_type_id = paste0(
    "EW2024",
    stringr::str_pad(seq_along(output_type),
      width = 2, pad = "0"
    )
  )) |>
  dplyr::slice(1:52)
ex[, c("output_type_id", "value")]
#> # A tibble: 52 × 2
#>    output_type_id   value
#>    <chr>            <dbl>
#>  1 EW202401       0.00853
#>  2 EW202402       0.0136 
#>  3 EW202403       0.0208 
#>  4 EW202404       0.0299 
#>  5 EW202405       0.0406 
#>  6 EW202406       0.0549 
#>  7 EW202407       0.0721 
#>  8 EW202408       0.0930 
#>  9 EW202409       0.112  
#> 10 EW202410       0.236  
#> # ℹ 42 more rows

res <- check_tbl_value_col_ascending(ex, file_path = "")
expect_s3_class(res, "check_success")
expect_null(res$error_tbl)


# test on character cdf values with inconsistent  lengths. After relaxation of
# the format requirements of cdf character output_type_id this sort of situation
# is now possible. We want this to pass but at the minute it fails for the same
# reasons reported in #78
ex <- ex |>
  dplyr::filter(output_type == "cdf") |>
  dplyr::mutate(output_type_id = paste0("EW", seq_along(output_type))) |>
  dplyr::slice(1:52)
ex[, c("output_type_id", "value")]
#> # A tibble: 52 × 2
#>    output_type_id   value
#>    <chr>            <dbl>
#>  1 EW1            0.00853
#>  2 EW2            0.0136 
#>  3 EW3            0.0208 
#>  4 EW4            0.0299 
#>  5 EW5            0.0406 
#>  6 EW6            0.0549 
#>  7 EW7            0.0721 
#>  8 EW8            0.0930 
#>  9 EW9            0.112  
#> 10 EW10           0.236  
#> # ℹ 42 more rows

res <- check_tbl_value_col_ascending(ex, file_path = "")
expect_s3_class(res, "check_success")
#> Error: `res` inherits from 'check_failure'/'hub_check'/'rlang_warning'/'warning'/'condition' not 'check_success'.
expect_null(res$error_tbl)
#> Error: res$error_tbl is not NULL
#> 
#> `actual` is an S3 object of class <tbl_df/tbl/data.frame>, a list
#> `expected` is NULL

^{Created on 2024-08-13 with reprex v2.1.0}

@elray1 any thoughts on this?

zkamvar · 2024-08-13T13:52:45Z

This has sadly been introduced by the fact we relaxed the formatting requirements for character cdf output type IDs in schema version v3.0.1 which is now coming back to bite us. In previous schema, the need for for character cdf output type IDs to conform to the regex pattern ^EW[0-9]{6} (e.g. "EW202240") means the character sorting worked. However, now a code such as EW1 - EW52 would in theory be allowed and cause the same problems discussed in #78

(NOTE: adding hubverse-org/schemas#80 and the corresponding PR (hubverse-org/schemas#93) so that we have a more direct reference as the issue shows it being fixed by hubverse-org/schemas#88)

So, if I understand this correctly, we had a formatting requirement for cdf that was either a number (which this PR would fix) OR an epi week string and then in hubverse-org/schemas#80, and then it was suggested to remove the restrictive requirement for an epi week string because "one could imagine collecting CDF predictions for an ordinal categorical variable other than epidemic weeks with strings naming the categories that don't match that format." The solution was to remove the requirement.

In order to solve #78, we need to enforce some format that we can sort numerically. Would a regex of [^0-9]{1,}[0-9]{1,} work? This would allow for us to easily capture the numeric elements of things like EW42, epiweek42, PAPAT4, UB40, and 손50. I'm happy to open a new issue in schemas for this in the schemas repo (because if we end up with a hub that uses an epi week with a basis from when the 20th anniversary of the horror thriller franchise Halloween came out (HWEENH2OEW42), then our jobs become MUCH harder).

elray1 · 2024-08-13T16:33:21Z

I suggest that we address #78 by asserting/requiring that the hub has listed the values in their tasks.json file in the correct order, and do the sorting here in the same order as they were given in that config file. I think it's best to avoid relying on enforcing that the category labels/names can be sorted in a standard numeric or lexicographic order that aligns with the ordering of those categories. There are some examples of how a hub might need to specify this under point 1 in this dicussion comment. The suggestion I made there is that even if the values are numeric (and so an ordering is implicit), we would just use the order that was given in that file.

zkamvar · 2024-08-13T17:57:17Z

I suggest that we address #78 by asserting/requiring that the hub has listed the values in their tasks.json file in the correct order, and do the sorting here in the same order as they were given in that config file.

This makes sense. We can not enforce the creation of this part of the config computationally, but it's a reasonable social expectation as long as it's documented well. If I am not mistaken, this is a field that hub administrators would set this once and never touch again?

As I understand it from #70, modellers want to submit in any order, but they were met with this restriction and we worked around it by arranging by the output_type_id column, but that was the incorrect approach because of the sorting issue brought up in #78. For my own context, the configuration file has the quantiles in the correct order, it was just the model output that was unordered.

I will see what I can do to incorporate the order from the config file.

annakrystalli · 2024-08-13T19:08:05Z

Quick implementation question @elray1, how would we make use of any ordering split across optional & required properties?

elray1 · 2024-08-13T20:38:25Z

As I understand it from #70, modellers want to submit in any order, but they were met with this restriction and we worked around it by arranging by the output_type_id column, but that was the incorrect approach because of the sorting issue brought up in #78. For my own context, the configuration file has the quantiles in the correct order, it was just the model output that was unordered.

This all sounds right to me. Just to be clear, I think we would still (potentially) need to re-order the submission by the output_type_id column. It's just that the order we impose on that would come from the config file rather than a "natural" ordering.

how would we make use of any ordering split across optional & required properties?

Good question that I hadn't thought of. It kind of seems intractable? I think I'd be inclined to say that for the cdf and pmf types (where this matters), it doesn't make sense to put some values in required and some in optional, they should either all be required or all be optional. Certainly this seems logical for the pmf type, where we would generally want the probabilities across all categories to sum to 1. It also feels reasonable to me for the cdf type. But I'm a little worried that I'm making up new rules here.

zkamvar · 2024-08-13T20:47:46Z

Just to be clear, I think we would still (potentially) need to re-order the submission by the output_type_id column. It's just that the order we impose on that would come from the config file rather than a "natural" ordering.

Clear as an unmuddied lake under an expanse of endless blue sky 👍. My approach uses expand_model_out_grid to create a data frame from the config and then whittles that down to a two-column lookup table with quantile and cdf output_types. This is then used to do an inner join on the submission so that it automatically orders based on the lookup table.

zkamvar · 2024-08-13T21:45:02Z

I think this is ready for review now. The hardest part was actually coming up with the test data due to the new requirement of a hub, but I was able to get it to work.

annakrystalli · 2024-08-14T12:59:31Z

@elray1 it does feel a little like making up rules on the fly and I worry about how ad hoc this might look in the documentation. If there is definitely no reasonable sortable pattern we can enforce, can we atleast make the requirement for output_type_id values being eitherrequired or optional for all output types? This is the case already for mean & median, sample is it's own thing all together, we want to impose this for cdf and pmf so can you see any reason for not doing the same for quantile too?

@zkamvar thanks for the speedy response to the feedback! The use of expand_model_out_grid() while sensible is sadly (atl east currently) computationally quite expensive. I'm working on optimising the entire validation workflow and modifications to expand_model_out_grid() are at the core of much of the work.

Would you mind holding back until that is all merged in shortly as this PR will likely require modification to make use of the optimisations coming and I wouldn't want to merge as is as it would add to the current computational load.

zkamvar · 2024-08-14T13:08:09Z

@zkamvar thanks for the speedy response to the feedback! The use of expand_model_out_grid() while sensible is sadly (atl east currently) computationally quite expensive. I'm working on optimising the entire validation workflow and modifications to expand_model_out_grid() are at the core of much of the work.

Would you mind holding back until that is all merged in shortly as this PR will likely require modification to make use of the optimisations coming and I wouldn't want to merge as is as it would add to the current computational load.

Of course! I also realized this is solvable by directly pulling cdf and quantile values from the config without going through expand_model_out_grid(), but I can definitely hold off on this and I'll put a note in the initial comment to reflect that it's dependent on the other optimization task.

annakrystalli · 2024-08-14T14:00:43Z

I also realized this is solvable by directly pulling cdf and quantile values from the config without going through expand_model_out_grid()

In fact you were on the right track with using the expand grid approach because, to be safe when we're comparing submitted data to config settings, because a single round might have multiple model tasks, to make sure we're comparing data to the correct model task config we need to split our tbl data across modeling tasks. The only way to do that is by inner joining to expanded grids of each modeling task.

See for example what this new function is doing in terms of separating tbl data into separate model tasks:

hubValidations/R/match_tbl_to_model_task.R

Lines 25 to 49 in 7fe78b4

    
           match_tbl_to_model_task <- function(tbl, config_tasks, round_id, 
        
                                               output_types = NULL, derived_task_ids = NULL, 
        
                                               all_character = TRUE) { 
        
             join_cols <- names(tbl)[names(tbl) != "value"] 
        
             if (hubUtils::is_v3_config(config_tasks)) { 
        
               tbl[tbl$output_type == "sample", "output_type_id"] <- NA 
        
             } 
        
             expand_model_out_grid( 
        
               config_tasks, 
        
               round_id = round_id, 
        
               required_vals_only = FALSE, 
        
               all_character = TRUE, 
        
               as_arrow_table = FALSE, 
        
               bind_model_tasks = FALSE, 
        
               output_types = output_types, 
        
               derived_task_ids = derived_task_ids 
        
             ) %>% 
        
               purrr::map(\(.x) { 
        
                 if (nrow(.x) == 0L) { 
        
                   return(NULL) 
        
                 } 
        
                 dplyr::inner_join(.x, tbl, by = join_cols) 
        
               }) 
        
           }

which is then used to compared the value column values to the config for each model task:

hubValidations/R/check_tbl_value_col.R

Lines 47 to 58 in 7fe78b4

    
             purrr::map2( 
        
               .x = match_tbl_to_model_task(tbl, config_tasks, 
        
                 round_id, output_type, 
        
                 derived_task_ids = derived_task_ids 
        
               ), 
        
               .y = get_round_output_types(config_tasks, round_id), 
        
               \(.x, .y) compare_values_to_config( 
        
                 tbl = .x, output_type_config = .y, output_type = output_type 
        
               ) 
        
             ) %>% 
        
               unlist(use.names = TRUE) 
        
           }

But the new functions will allow us to just subset for output types of interest and ignore derived task ids which makes everything much faster and more efficient.

elray1 · 2024-08-14T19:19:42Z

I think the optional/required issue is a real one that it seems like we should settle before merging this in. I haven't actually looked at the solution, but based on the description above I think it may not work as desired if for example the optional and required values have "interleaved" values, like what if a hub said "you must submit cdf predictions at every even integer from 0 to 100, but the remaining odd integers are optional". I believe the optional and required values would simply be concatenated out of order?

I thought about bringing this up on the hubverse devteam call today, but it felt like there were too few people on the call. If it seems reasonable to you two, suggest that we just hold off on this discussion for a couple of weeks till Nick is back, settle that piece in a devteam call, and let that inform how we deal with this PR/determining the ordering.

zkamvar · 2024-08-14T21:20:33Z

I think it may not work as desired if for example the optional and required values have "interleaved" values, like what if a hub said "you must submit cdf predictions at every even integer from 0 to 100, but the remaining odd integers are optional". I believe the optional and required values would simply be concatenated out of order?

That's a correct assessment since the default expand_model_out_grid() sets required_vals_only = FALSE, which does indeed concatenate:

hubValidations/R/expand_model_out_grid.R

Lines 200 to 206 in 172091c

    
           process_grid_inputs <- function(x, required_vals_only = FALSE) { 
        
             if (required_vals_only) { 
        
               purrr::map(x, ~ .x %>% purrr::map(~ .x[["required"]])) 
        
             } else { 
        
               purrr::modify_depth(x, .depth = 2, ~ unlist(.x, use.names = FALSE)) 
        
             } 
        
           }

That being said, I think this can work if we do not assert any relationship between required and optional values (which, yes is kinda silly since they are literally coming from the same model, but it would still be an improvement over not having the ability to validate ascending values for character output_type_id at all). It would be the following process:

create lookup table of required values via expand_model_out_grid(required_vals_only = TRUE)
do an inner join and validate required values
do an anti-join to check for any output_type_ids that do not match (indicating optional values)
i. if there are any rows, then create another lookup table that includes the optional values,
ii. perform an inner join and validate

annakrystalli · 2024-08-15T14:59:47Z

I agree @elray1 , let's discuss a bit more with others before pushing ahead.

zkamvar · 2024-10-31T15:45:18Z

Note that this should be revisited for schemas v4

Ensuring a mix of numeric and character values are sorted correctly involves creating a separate column for the numerics and using that as the first sort key. This then allows the original method of validating ascent to work. Another solution is split the data frame into a list and then perform a numeric sort if the column can be transformed, but the coercion will probably have the equivalent amount of overhead. This will fix #78

This is a good example of the need to add both positive and negative checks. I made the following mistakes initially: 1. I had accidentally added a `cfg` output type instead of `cdf` (dang my dyslexia) 2. I did not convert the model output tbl output type to use `cdf` instead of `quantile` Both of these were discovered when I added the failure condition test.

Now that I understand expand_model_out_grid better, I know now that I can subset it explicitly to cdf or quantile output to reduce both time and memory pressure.

elray1

Had a couple of minor suggestions and a question about what you meant in one comment.

R/check_tbl_value_col_ascending.R

see https://github.com/hubverse-org/hubValidations/pull/105/files#r1904416427

Co-authored-by: Zhian N. Kamvar <[email protected]>

This aligns with how the data would be presented to the function during validation

add failing test for check of ascending columns

My initial implementation was incorrect because I was sorting purely output_type and output_type_id, but Evan brought up the fact that it was possible for two targets to have separate sets of output type IDs, which would end up being sorted incorrectly. I had initially avoided creating a lookup table with the full column because of the type errors I was getting when trying to compare a date column with a character column. This is solved by setting `all_character = FALSE` in `expand_model_out_grid()` and coercing both output_type_id columns to character.

@elray1

After the previous commit, this particular test began to fail because the check started to pass. During a coworking session with @elray1, we found that the reason why the check was suddenly passing was because the target "wk ahead inc covid hosp" was not a valid target according to the hub, thus `order_output_type_ids()` returned an empty table. The target's presence was not adding anything important when testing for ascending values with the output type. Because of this, we determined that this particular aspect of the test was not necessary to testing the validity of this function. I have added a TODO in case this was intentional.

annakrystalli · 2025-01-08T16:13:16Z

R/check_tbl_value_col_ascending.R

+    round_id = round_id,
+    all_character = FALSE,
+    force_output_types = TRUE,
+    output_types = only_cdf_or_quantile


Here it would be much cleaner to use bind_model_tasks = FALSE. This returns a list of tibbles, one for each model task, keeping values associated with different model tasks separate. You can then map over any model tasks, processing them separately. This would cleanly get round the issue @elray1 described of different quantile levels in different model tasks.

TIL about this feature 😮

I'll give it a go!

Honestly, given the fact that this already works, I don't think it's strictly necessary to implement. Good to know of the functionality though as it's really useful for separating submission files into model tasks by joining them onto an expanded grid created with bind_model_tasks = FALSE!

annakrystalli · 2025-01-08T16:39:00Z

It's the end of the day so I apologise I haven't looked very deeply at this. In general it looks sound to me as you both seem to have thrashed out all the issues in good detail. I made a minor suggestion but don't feel it's required as the functionality already works.

Changes to tests all look good to me. Excellent work both!!

elray1

this all looks good to me -- thanks!

zkamvar · 2025-01-08T19:14:25Z

Thank you both for being patient with me on this! I've learned a lot with this process.

annakrystalli · 2025-01-08T20:07:27Z

Sorry! My brain was fried by arrow earlier and I didn't think through the performance implications of creating one expanded grid for both output types and all model tasks in one go. If we find this now slows some validation workflows down we could just perform the check one output type at a time. Sth to monitor.

More importantly we should definitely ignore derived task IDs in both the expanded grid and tbl asap as this can have a real impact on performance.

@zkamvar would you be able to open a couple of issues for these? And ideally address the derived IDs issue asap? There's a few validation functions where something similar is happening, eg checking that all values are valid off the top of my head so it should be pretty straightforward to implement.

zkamvar · 2025-01-09T00:02:57Z

would you be able to open a couple of issues for these? And ideally address the derived IDs issue asap?

I can do that. Luckily we haven't released this yet, so it should not impact any hubs.

zkamvar requested review from elray1 and annakrystalli August 12, 2024 19:34

zkamvar commented Aug 12, 2024

View reviewed changes

tests/testthat/test-check_tbl_value_col_ascending.R Outdated Show resolved Hide resolved

zkamvar added the validation-function label Aug 12, 2024

annakrystalli added the on-hold label Sep 26, 2024

zkamvar force-pushed the znk/78/fix-ascending-check branch from 02f3939 to cc7bfe8 Compare November 7, 2024 17:26

zkamvar mentioned this pull request Nov 14, 2024

Support specifying whether v4 output types are required via is_required property #161

Merged

zkamvar added 7 commits January 3, 2025 10:26

dang lintr what I ever do to you?

3ed727d

Update tests/testthat/test-check_tbl_value_col_ascending.R

a7a4be9

fix ordering to be based on hub config

8421214

add arguments in validate_model_data

808e6e6

update, man

f551702

appease the linter

dea1cc8

zkamvar added 2 commits January 3, 2025 11:25

move early exit; only create cdf/quantile grid

f52f619

Now that I understand expand_model_out_grid better, I know now that I can subset it explicitly to cdf or quantile output to reduce both time and memory pressure.

zkamvar removed the on-hold label Jan 3, 2025

oops

f3f04cf

elray1 reviewed Jan 6, 2025

View reviewed changes

R/check_tbl_value_col_ascending.R Outdated Show resolved Hide resolved

R/check_tbl_value_col_ascending.R Outdated Show resolved Hide resolved

R/check_tbl_value_col_ascending.R Outdated Show resolved Hide resolved

elray1 and others added 13 commits January 7, 2025 09:20

add failing test for check of ascending columns

d2517dd

simplify order_output_type_ids; clarify object names

c0c356a

see https://github.com/hubverse-org/hubValidations/pull/105/files#r1904416427

Update tests/testthat/test-check_tbl_value_col_ascending.R

95d47af

Co-authored-by: Zhian N. Kamvar <[email protected]>

read in test data using read_model_out_file

c8b9e67

This aligns with how the data would be presented to the function during validation

Merge pull request #188 from hubverse-org/elr/ascending_check_test_case

b76cf0f

add failing test for check of ascending columns

read in model output using hubvalidation mechanism

a3eb124

add MWE for internal function

3471e4d

add TODO note

afa20db

appease linter

98b022e

clean up expectations

93a5eb7

add internal doc

1591351

zkamvar requested a review from elray1 January 7, 2025 18:34

annakrystalli reviewed Jan 8, 2025

View reviewed changes

elray1 approved these changes Jan 8, 2025

View reviewed changes

add NEWS item

ca6eea7

zkamvar merged commit 81efa09 into main Jan 8, 2025
8 checks passed

zkamvar deleted the znk/78/fix-ascending-check branch January 8, 2025 19:14

This was referenced Jan 9, 2025

refactor check_tbl_vals_cols_ascending(): add derived task IDs and improve performance #189

Open

refactor check_tbl_col_ascending #190

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add support for ascending CDF values with numeric output_type_id cast as character #105

add support for ascending CDF values with numeric output_type_id cast as character #105

zkamvar commented Aug 12, 2024 •

edited

Loading

github-actions bot commented Aug 12, 2024 •

edited

Loading

annakrystalli commented Aug 13, 2024 •

edited

Loading

zkamvar commented Aug 13, 2024

elray1 commented Aug 13, 2024

zkamvar commented Aug 13, 2024 •

edited

Loading

annakrystalli commented Aug 13, 2024

elray1 commented Aug 13, 2024

zkamvar commented Aug 13, 2024

zkamvar commented Aug 13, 2024

annakrystalli commented Aug 14, 2024

zkamvar commented Aug 14, 2024

annakrystalli commented Aug 14, 2024 •

edited

Loading

elray1 commented Aug 14, 2024

zkamvar commented Aug 14, 2024

annakrystalli commented Aug 15, 2024

zkamvar commented Oct 31, 2024

elray1 left a comment

annakrystalli Jan 8, 2025 •

edited

Loading

zkamvar Jan 8, 2025

annakrystalli Jan 8, 2025

annakrystalli commented Jan 8, 2025

elray1 left a comment

zkamvar commented Jan 8, 2025

annakrystalli commented Jan 8, 2025

zkamvar commented Jan 9, 2025

add support for ascending CDF values with numeric output_type_id cast as character #105

add support for ascending CDF values with numeric output_type_id cast as character #105

Conversation

zkamvar commented Aug 12, 2024 • edited Loading

github-actions bot commented Aug 12, 2024 • edited Loading

annakrystalli commented Aug 13, 2024 • edited Loading

zkamvar commented Aug 13, 2024

elray1 commented Aug 13, 2024

zkamvar commented Aug 13, 2024 • edited Loading

annakrystalli commented Aug 13, 2024

elray1 commented Aug 13, 2024

zkamvar commented Aug 13, 2024

zkamvar commented Aug 13, 2024

annakrystalli commented Aug 14, 2024

zkamvar commented Aug 14, 2024

annakrystalli commented Aug 14, 2024 • edited Loading

elray1 commented Aug 14, 2024

zkamvar commented Aug 14, 2024

annakrystalli commented Aug 15, 2024

zkamvar commented Oct 31, 2024

elray1 left a comment

Choose a reason for hiding this comment

annakrystalli Jan 8, 2025 • edited Loading

Choose a reason for hiding this comment

zkamvar Jan 8, 2025

Choose a reason for hiding this comment

annakrystalli Jan 8, 2025

Choose a reason for hiding this comment

annakrystalli commented Jan 8, 2025

elray1 left a comment

Choose a reason for hiding this comment

zkamvar commented Jan 8, 2025

annakrystalli commented Jan 8, 2025

zkamvar commented Jan 9, 2025

zkamvar commented Aug 12, 2024 •

edited

Loading

github-actions bot commented Aug 12, 2024 •

edited

Loading

annakrystalli commented Aug 13, 2024 •

edited

Loading

zkamvar commented Aug 13, 2024 •

edited

Loading

annakrystalli commented Aug 14, 2024 •

edited

Loading

annakrystalli Jan 8, 2025 •

edited

Loading