Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Validate coarse-grained samples #88

Closed
annakrystalli opened this issue Jun 17, 2024 · 15 comments · Fixed by #99
Closed

Validate coarse-grained samples #88

annakrystalli opened this issue Jun 17, 2024 · 15 comments · Fixed by #99

Comments

@annakrystalli
Copy link
Member

annakrystalli commented Jun 17, 2024

Currently, samples must strictly match the compound task ID set expectations and cannot handle coarser-grained compound task ID sets. However, in our documentation we state that this should be allowed.

This will need more work to implement but it would be good to understand the purpose of this and details of implamentation.

Questions:

  1. Can different teams be submitting completely different coarser grained samples (i.e. completely different compound idxs?)
  2. Must the compound idx be consistent at least through an individual sample?
  3. Does this affect the way we infer sample dependence (e.g. Infer dependence structure for samples hubData#6)? i.e. will different samples be allowed to have different dependence structure? Will it affect plotting in any way? add ability to plot sample output_type data hubVis#31
  4. If target is a compound task id, would it ever make sense to have a sample associated with more than one target? Would that be a multivariate output of a model that e.g. predicts hospitalisation + deaths together?
@annakrystalli
Copy link
Member Author

@elray1 , @nickreich and @LucieContamin would appreciate your thoughts!

@LucieContamin
Copy link
Contributor

Please find here my answers to your questions:

  1. My understanding is that in a same round, teams can submit files with different compound ids but they should at least include the minimal level expected from the hub. For example, if a hub has a round defined with 6 task id columns: "origin_date", "scenario_id", "location", "target", "age_group" and "horizon" with "compound_taskid_set": ["origin_date", "scenario_id", "location", "target"]. A team can submit a file with "compound_taskid_set": ["origin_date"] and another with "compound_taskid_set": ["origin_date", "scenario_id", "target"]. However, a file with "compound_taskid_set": ["origin_date", "age_group"] should not be accepted.

  2. I am not sure I totally understand the question here. For me, an individual sample should have a consistent compound id that match the minimal level expected from the hub. And all the samples should follow the same compound id level.

  3. In my experience, to have different model output file for the same round with different level of "sample dependence" does impact the visualization but it's easy to adapt to it but I guess it depends what you want to do with the model output (same for data processing). Once you infer the structure of the samples of the files, it's easy to adapt as necessary, especially as all the model output should at least follow the same minimal structure.

  4. I don't understand that one, so will wait the answers from the others!

@annakrystalli
Copy link
Member Author

I would like to start work on this and have come up with a mechanism to implement it.
However, there are still things I am confused about.

Specifically, I can see problems with:

My understanding is that in a same round, teams can submit files with different compound ids but they should at least include the minimal level expected from the hub. For example, if a hub has a round defined with 6 task id columns: "origin_date", "scenario_id", "location", "target", "age_group" and "horizon" with "compound_taskid_set": ["origin_date", "scenario_id", "location", "target"]. A team can submit a file with "compound_taskid_set": ["origin_date"] and another with "compound_taskid_set": ["origin_date", "scenario_id", "target"]. However, a file with "compound_taskid_set": ["origin_date", "age_group"] should not be accepted.

My worry with this is that if a team submits two files for the same round_id (ignoring issues of file naming for now!)
with different compound_taskid_sets, when it comes to accessing data or determining the compound_taskid_set from samples, it if they have used integers (say from 1:300) in both files, it could be difficult to ensure that rows with the same output type id but different "compound_taskid_set" are not mixed up as some of the finer grained samples will inevitably match up to the values of some of the coarser grained samples so if, by change, the output type id also coincides, then the compound_taskid_set will be impossible to tell apart.

We could specify that such hubs should require a character output type id but I'm not sure how we would enforce that to avoid the above situation.

@elray1
Copy link
Contributor

elray1 commented Jul 2, 2024

I agree with Lucie's answers to 1 through 3. Clarifying the discussion about 1, I think Lucie was talking about submissions from 2 different teams, or submissions from 1 team in 2 rounds, rather than 2 submissions from a single team for the same round.

For 4, I think it's fair to treat target as just another task id variable for our purposes here. It could make sense to include target in the compound_taskid_set specification, in which case we're allowing for models to produce separate predictions for each target level (e.g., for hospitalizations and deaths separately, or for flu, covid, and rsv separately), or to leave target out of the compound_taskid_set, in which case the hub would be asking for joint predictions of hospitalizations and deaths, or of flu, covid and rsv.

@annakrystalli
Copy link
Member Author

annakrystalli commented Jul 2, 2024

I agree with Lucie's answers to 1 through 3. Clarifying the discussion about 1, I think Lucie was talking about submissions from 2 different teams, or submissions from 1 team in 2 rounds, rather than 2 submissions from a single team for the same round.

Thanks @elray1!

OK great. So just to confirm, you feel samples with different compound_taskid_sets will be clearly distinguishable from each other correct? i.e. once all data is accessed as an arrow dataset

@elray1
Copy link
Contributor

elray1 commented Jul 2, 2024

One more thought about the third question, "Does this affect the way we infer sample dependence (e.g. hubverse-org/hubData#6)? i.e. will different samples be allowed to have different dependence structure?"

I think that for a single submission and target, we should expect all samples to have the same dependence structure. The only exception to this I can think of would be an ensemble that combines samples from models that had different approaches to dependence. But even in a case like that, I think it would be reasonable to say the ensemble should update the sample indices to get a consistent representation of dependence.

@LucieContamin
Copy link
Contributor

I agree with Lucie's answers to 1 through 3. Clarifying the discussion about 1, I think Lucie was talking about submissions from 2 different teams, or submissions from 1 team in 2 rounds, rather than 2 submissions from a single team for the same round.

That is what I was trying to say, thanks for clarifiying.

For 4, Now I think I understand, I agree with Evan's answer.

@elray1
Copy link
Contributor

elray1 commented Jul 2, 2024

samples with different compound_taskid_sets will be clearly distinguishable from each other

Yeah, I think that's right. Those samples will have to come either from different models, or different rounds, or different task groups within a round. Any time we're working with samples, we'll have to be aware that the sample indices are only distinct within combinations of model_id, round (, and task group/target?? actually maybe we are guaranteeing that within a submisison file, the sample indices are different across different task groups?). But as long as we keep track of that, we should be able to distinguish between samples that are submitted with the same sample index. Does that seem right?

@annakrystalli
Copy link
Member Author

Trying to push through with the implementation of this, I'm having difficulty thinking through the logic in certain situations.

I'm trying to determine the compound_taskid_set of each sample from the data, first to check that the compound_task_id set is the same or coarser than what is allowed an dthen to use the determined compound_taskid_set for further validations.

The only way I see to determine this from the data itself is to look at the unique values in each task id in each sample and wherever it's 1, consider that task id as one of the compound task ids.

Ηowever, I'm coming up against issues with approach.

Say for example we have the following submission:

# A tibble: 500 × 8
   location reference_date horizon target_end_date target          output_type output_type_id value
   <chr>    <chr>          <chr>   <chr>           <chr>           <chr>                <int> <chr>
 1 01       2022-10-22     1       2022-10-29      wk inc flu hosp sample                   1 95   
 2 01       2022-10-22     1       2022-10-29      wk inc flu hosp sample                   2 83   
 3 01       2022-10-22     1       2022-10-29      wk inc flu hosp sample                   3 89   
 4 01       2022-10-22     1       2022-10-29      wk inc flu hosp sample                   4 96   
 5 01       2022-10-22     1       2022-10-29      wk inc flu hosp sample                   5 68   
 6 01       2022-10-22     1       2022-10-29      wk inc flu hosp sample                   6 46   
 7 01       2022-10-22     1       2022-10-29      wk inc flu hosp sample                   7 85   
 8 01       2022-10-22     1       2022-10-29      wk inc flu hosp sample                   8 85   
 9 01       2022-10-22     1       2022-10-29      wk inc flu hosp sample                   9 95   
10 01       2022-10-22     1       2022-10-29      wk inc flu hosp sample                  10 53   
# ℹ 490 more rows
# ℹ Use `print(n = ...)` to see more rows

to a hub configured with the this tasks.json, specifically this section, which specifies a compound_taskid_set of ["reference_date", "location"].

Using the afforementioned rule, all task IDs (i.e. location, reference_date, horizon, target_end_date and target) appear to be part of the compound_tasksid_set) and the file should fail validations with respect to compound_tasksid_set and supplying finer grained samples.

However, in this example, there is only a single target and all horizons are optional. Hence both target only has a single possible value per sample while a single unique value for horizon (and by extension target_end_date) is technically allowed by the hub.

Any ideas how to get round this? Am I missing something really basic here?

@elray1
Copy link
Contributor

elray1 commented Jul 8, 2024

I think your proposed rule is essentially right (if you're doing this check within groups defined by the output_type_id; it'd be helpful to see the subset of this example submission with output_type_id == 1 to get more clarity on what we expect for the results of the validation), but you're right that the issues with target and horizon are tricky. I don't think you're missing anything basic. That said, here are some thoughts about how we might handle the cases you pointed out here:

  1. For a case like target where there is a single possible value: we could either say
    1. the hub should put variables like this in its compound_taskid_set ("joint across" a single level of a variable is not very meaningful).
    2. it's valid for the hub to put this in the compound_taskid_set or not, as both setups are mathematically equivalent (a distribution that is specific to a level of a variable is the same as one that is "joint across" the one level of that variable). When making a determination about the compound taskid set for a submission, for variables like this we'll look up what the hub did and replicate that.
  2. For a case like horizon where there are multiple optional values, I think we would need to check whether there are any other output_type_ids (i.e., sample indices) with values of the horizon other than the one included in this sample. If so, then the sample has horizon in its compound_taskid_set and it should fail validation. If not, then I think with logic similar to that from point 1.ii, we could check what the hub listed and use that.

For item 1, I think I have a mild preference for option ii since it's more forgiving to hub admins setting up their files and there will not be any downstream functionality where the inclusion or not of the variable in question in the compound_taskid_set impacts how things work.

@annakrystalli
Copy link
Member Author

annakrystalli commented Jul 11, 2024

Thanks for the feedback and suggestions @elray1 . After some though and processing, on top of counting the unique values of each task ID within a sample , I've implemented the following additional check:

  • Count the number of unique values for each task ID column across the entire tbl (not just at sample level) and return TRUE if greater than 1. The aim of this check is to distinguish between false compound task id positives, resulting from task ids that can only take a single value or where only 1 value is required or all are optional and only one has been provided, so here we also check whether the value is consistent across other samples. If a single unique value is consistent across all samples for a task id, it cannot be considered a compound_taskid and therefore FALSE will be returned for that task id. Note that, because this check is looking to weed out false positives, it only applies to task ids that are not members of the compound_taskids set. compound_taskids set columns will always be set to TRUE in this check.
  • Then I combine these two checks at sample level to determine which task ids are true compound task IDs, i.e. task IDs that have a unique value within samples but non unique values across samples.

This means I've effectively implemented 1.ii, i.e. admins are not forced to add single value task_ids to the compound_taskid_set for the check to work.

The specifics of the implementation can be found here:

# Count number of unique values per sample (output_type_id) for each task ID column and
# return TRUE if 1 which would indicate a task ID can be considered a compound_taskid
spl_uniq <- tbl[, c(task_ids, out_tid)] |>
dplyr::group_by(.data[[out_tid]]) |>
dplyr::summarise(
dplyr::across(
dplyr::everything(),
function(.x) {
dplyr::n_distinct(.x) == 1L
}
)
)
# Count number of unique values for each task ID column across tbl and
# return TRUE if greater than 1.
# The aim of this check is to distinguish between false positives, resulting
# from task ids that can only take a single value or where only 1 value
# is required or all are optional and only one has been provided, so here we also check
# whether the value is consistent across other samples. If a single unique value
# is consistent across all samples for a task id, it cannot be considered a
# compound_taskid and therefore FALSE will be returned. Note that this check
# only applies to task ids that are not members of the compound_taskids set.
# compound_taskids set vars will always be set to TRUE in this check.
tbl_non_uniq <- tbl[, task_ids] |>
purrr::map(~ dplyr::n_distinct(.x) > 1L) |>
tibble::as_tibble()
tbl_non_uniq[, config_comp_tids] <- TRUE
# Create a sample (output_type_id) vs compound_taskid membership boolean matrix.
# For a task id to be considered a compound_taskid, a unique value must be TRUE
# at the sample level and we need to ensure it's not a false positive.
# Here we combine the two checks.
is_compound_taskid <- apply(spl_uniq[, task_ids], 1,
function(x) x & tbl_non_uniq[, task_ids],
simplify = FALSE
) |> purrr::reduce(rbind)

@elray1
Copy link
Contributor

elray1 commented Jul 11, 2024

Thanks Anna, I think what you described makes sense. This raised one question for me more generally: what are we doing about checking or enforcing consistency of compound task id sets across different task groups within the same round? Here's an example config to make the question concrete. In the first task group there are multiple levels for age_group and they are not listed in the compound_taskid_set, but in the second task group there is one (different) level for age_group and it is listed in the compount_taskid_set. Is this kind of thing allowed? How is it handled?

    "rounds": [
        {
            "round_id_from_variable": true,
            "round_id": "reference_date",
            "model_tasks": [
                {
                    "task_ids": {
                        "reference_date": {
                            "required": null,
                            "optional": ["2022-10-22", "2022-10-29"]
                        },
                        "target": {
                            "required": null,
                            "optional": ["wk flu hosp"]
                        },
                        "age_group": {
                            "required": ["0-5", "6-18", "19-40", "41-65", "66+"],
                            "optional": null
                        },
                        "location": {
                            "required": null,
                            "optional": [
                                "US",
                                "01",
                                "02"
                            ]
                        },
                    },
                    "output_type": {
                        "sample": {
                            "output_type_id_params": {
                                "is_required": true,
                                "type": "integer",
                                "min_samples_per_task": 100,
                                "max_samples_per_task": 100,
                                "compound_taskid_set" : ["reference_date", "location"]
                            },
                            "value": {
                                "type": "integer",
                                "minimum": 0
                            }
                        }
                    }
                },
                {
                    "task_ids": {
                        "reference_date": {
                            "required": null,
                            "optional": ["2022-10-22", "2022-10-29"]
                        },
                        "target": {
                            "required": null,
                            "optional": ["wk rsv hosp"]
                        },
                        "age_group": {
                            "required": ["NA"],
                            "optional": null
                        },
                        "location": {
                            "required": null,
                            "optional": [
                                "US",
                                "01",
                                "02",
                            ]
                        },
                    },
                    "output_type": {
                        "sample": {
                            "output_type_id_params": {
                                "is_required": true,
                                "type": "integer",
                                "min_samples_per_task": 100,
                                "max_samples_per_task": 100,
                                "compound_taskid_set" : ["reference_date", "location",  "age_group"]
                            },
                            "value": {
                                "type": "integer",
                                "minimum": 0
                            }
                        }
                    }
                }


@annakrystalli
Copy link
Member Author

Good question! and yes, this is allowed. How it's handled is each round is evaluatated separately.

The higher level function (and really all checks evaluating samples) split the table between data being submitted to different modeling tasks and determine modeling task level compound_taskid_sets (see

mt_compound_taskids <- get_round_compound_task_ids(
config_tasks,
round_id
)
mt_tbls <- purrr::map(
.x = expand_model_out_grid(
config_tasks = config_tasks,
round_id = round_id,
all_character = TRUE,
include_sample_ids = FALSE,
bind_model_tasks = FALSE
),
function(.x) {
if (!has_spls_tbl(.x)) {
return(NULL)
}
dplyr::inner_join(tbl, .x[, names(.x) != out_tid],
by = setdiff(names(tbl), out_tid)
)
}
)
)

and then perform each check at the modeling task level:

tbl_compound_taskids <- purrr::map2(
mt_tbls, mt_compound_taskids, function(.x, .y) {
get_mt_compound_taskid_set(.x, .y, config_tasks,
error = error, call = call
)
}

@annakrystalli annakrystalli self-assigned this Jul 17, 2024
@annakrystalli annakrystalli moved this from Todo to In Progress in hubverse Development overview Jul 17, 2024
@annakrystalli
Copy link
Member Author

I've hit a problem in completing this work which i've detailed here: https://github.com/orgs/hubverse-org/discussions/26

Even if we solved the issues described above, I generally worry that auto-detecting compound_taskid_set during validation might cause further unforseen errors down the line.

As such, I was wondering if this is something submitting teams should actually record and communicate, perhaps in their model metadata file? Of course this assumes a single model would have a single compound task ID structure.

Anyways, thoughts on the above welcome.

@annakrystalli
Copy link
Member Author

Adding that whether we can implement the above suggestion largely depends on how stable within a model sample structures are or whether any variation can be captured in the model metadata effectively to be able to be used by validation functions. If this was possible, I would consider this declarative approach much more robust.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

Successfully merging a pull request may close this issue.

3 participants