-
Notifications
You must be signed in to change notification settings - Fork 590
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
#299 - Sample list ease of use for cohort extracts #7272
Conversation
# query_labels is string that looks like 'key1=val1, key2=val2' | ||
if len(query_labels) != 0: | ||
if query_labels is not None and len(query_labels) != 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ooooh thank you!
global client | ||
# this is where a set of labels are being created for the cohort extract | ||
query_labels_map = {} | ||
query_labels_map["id"]= {output_table_prefix} | ||
query_labels_map["gvs_tool_name"]= f"gvs_prepare_callset" | ||
query_labels_map["id"]= output_table_prefix |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thank you!!!
|
||
if not (bool(re.match(r"[a-z0-9_-]+$", key)) & bool(re.match(r"[a-z0-9_-]+$", value))): | ||
raise ValueError(f"label key or value did not pass validation--format should be 'key1=val1, key2=val2'") | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want to add testing for all of this, but am procrastinating it because it seems like this will become Java at some point soon, but maybe that's not true....
table_id = f"{fq_temp_table_dataset}.{EXTRACT_SAMPLE_TABLE}" | ||
|
||
job_labels = client._default_query_job_config.labels | ||
job_labels["gvs_query_name"] = "load-sample-names" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note to self: "gvs_query_name" should prob be a const
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had some comments around the prefix but they're non blocking; the changes look good and def work for AoU 👍 .
sql = f"select m.sample_id from `{fq_cohort_sample_names}` c JOIN `{fq_sample_mapping_table}` m ON (m.sample_name = c.sample_name)" | ||
def load_sample_names(sample_names_to_extract, fq_temp_table_dataset): | ||
schema = [ bigquery.SchemaField("sample_name", "STRING", mode="REQUIRED") ] | ||
table_id = f"{fq_temp_table_dataset}.{EXTRACT_SAMPLE_TABLE}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: table_id
-> fq_sample_table
?
@@ -13,8 +13,7 @@ workflow GvsExtractCallset { | |||
File reference_index | |||
File reference_dict | |||
|
|||
String fq_samples_to_extract_table | |||
String fq_cohort_extract_table | |||
String fq_cohort_extract_table_prefix |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hm, I'm wondering if the previous way of providing the fq_tables explicitly is more flexible since it ties the implementation a little less to PrepareCallset. Now, the fq_sample_to_extract_table
and fq_cohort_extract_table
would need to be part of the same dataset even though they technically don't have to be.
That said, this works just fine for AoU. Just thinking about some other use cases.
|
||
if __name__ == '__main__': | ||
parser = argparse.ArgumentParser(allow_abbrev=False, description='Extract a cohort from BigQuery Variant Store ') | ||
|
||
parser.add_argument('--fq_petvet_dataset',type=str, help='project.dataset location of pet/vet data', required=True) | ||
parser.add_argument('--fq_temp_table_dataset',type=str, help='project.dataset location where results should be stored', required=True) | ||
parser.add_argument('--fq_destination_dataset',type=str, help='project.dataset location where results should be stored', required=True) | ||
parser.add_argument('--destination_table',type=str, help='destination table', required=True) | ||
parser.add_argument('--fq_cohort_sample_names',type=str, help='FQN of cohort table to extract, contains "sample_name" column', required=True) | ||
parser.add_argument('--destination_cohort_table_prefix',type=str, help='prefix for destination tables', required=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
similar comment to above - passing around a prefix works well if PrepareCallset and ExtractCallset are only meant to be used with each other but from the perspective of someone that just wants to run PrepareCallset, they wouldn't know the fq names of the tables generated without reading the implementation and seeing that __DATA
and __SAMPLE
suffixes are being added.
In that regard, I think the explicit fq table names are nice but I think a comment in the argument's help section will also suffice. It could also be good to give an example of a valid prefix. It's not immediately clear if the prefix should contain just the table name prefix or the project and dataset as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea about the comments, I'll take a look at that. I wish there were some other way to create collections of related tables, and we could give that name instead. Datasets would work for that, but they're much heavier weight and permissions wise are a higher bar than tables
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
A collection of changes to enhance reliability and ease-of-use. Users no longer have to make a table containing the sample names to extract in GvsPrepareCallset (which was painful) and they don't have to re-supply that same table when rendering the VCF in GvsExtractCallset (which was error prone).
GvsPrepareCallet now takes a file of sample names as a parameter as well as an export table prefix
The main
sample_info
table is then subset to the sample names in the supplied file and stored in the table{export_prefix}__SAMPLES
. The export table is created and now named{export_prefix}__DATA
GvsExtractCallset now only needs to take this export prefix and is able to get the sample list and data it needs from these tables.
@ericsong -- does this fit the AoU use case well?