updates to ImportGenomes and LoadBigQueryData #7112

mmorgantaylor · 2021-02-26T16:00:33Z

changes in this PR:

resolves specops issue make 100% sure that allele ordering is preserved in htsjdk #247 - ImportGenomes.wdl takes Array[File] from data table as vcf input
refactor LoadBigQueryData.wdl back into ImportGenomes
returns an error if the bq load step fails (workflow was silently succeeding when this step failed)
checks existence of tables using bq show rather than the csv file - this should still be safe against a race condition because of @ericsong 's refactoring to prevent the CreateTables step from being scattered
run CreateTables at the start (don't wait for CreateImportTsvs)
does NOT use a preemptible VM for the LoadTables step, to minimize (though not eliminate) the possibility of loading a duplicate set of data (see specops issue investigate replacing R plotting with javascript plotting #248 for further discussion)

testing:

these changes were tested in Terra, BQ outputs checked and verified

…load from github

mmorgantaylor · 2021-02-26T16:44:19Z

closing - refactoring to one WDL

kcibul

I've got a bunch of questions, but I think they mostly stemmed from the previous structure of Load and Create tables... We should clean that up, but we can also split that out of this PR and make it a different ticket (or part of the productionization ticket).

scripts/variantstore/wdl/ImportGenomes.wdl

kcibul · 2021-03-01T14:05:00Z

scripts/variantstore/wdl/ImportGenomes.wdl

+      String numbered
+      String partitioned
+      String uuid
+      Array[String] tsv_creation_done


what does this do?

by requiring this as an input, this ensures that CreateImportTSVs runs before CreateTables can start. (removed here but added in to LoadTables)

scripts/variantstore/wdl/ImportGenomes.wdl

…ptible, rename numbered to superpartitioned

mmorgantaylor · 2021-03-02T14:46:03Z

scripts/variantstore/wdl/ImportGenomes.wdl

+      schema = metadata_schema,
+      superpartitioned = "false",
+      partitioned = "false",
+      uuid = "",


@kcibul @ahaessly should we include uuid as an optional input to the WDL? or rip this option (to prepend tables with a uuid) out altogether? this is currently just hardcoded to nothing.

I haven't used the UUID piece before, I think it was from earlier testing but now I would just create a new dataset instead of tables with a prefix. Remove it? (@ahaessly wdyt?)

This was definitely used for automated integration testing. I think Megan added it. If we wanted to add a uuid to the dataset, I think we would need to create that dataset outside of this wdl. But we should be able to do that in the test itself. Assuming we are not running that integration test, I would say let's go ahead and remove it.

ok let's keep it for now until we decide what we're doing with integration testing.

kcibul · 2021-03-02T14:50:45Z

scripts/variantstore/wdl/ImportGenomes.wdl

+      schema = metadata_schema,
+      superpartitioned = "false",
+      partitioned = "false",
+      uuid = "",


I haven't used the UUID piece before, I think it was from earlier testing but now I would just create a new dataset instead of tables with a prefix. Remove it? (@ahaessly wdyt?)

kcibul · 2021-03-02T14:54:36Z

scripts/variantstore/wdl/ImportGenomes.wdl

+	input {
+      String project_id
+      String dataset_name
+      String storage_location


I don't think this is used anymore?

nice catch!

ahaessly

👍

* revert input_vcfs to array[file], add this to sample inputs json * add this branch to dockstore * remove this branch from dockstore * add LoadBigQueryData to dockstore, modify check for existing tables, load from github * exit with error if bq load fails * use relative path to import LoadBigQueryData.wdl * refactor ImportGenomes to contain BQ table creation and loading * remove for_testing_only * docker -> docker_final * last wdl fix please * remove #done * add back done - end of for loop * remove LoadBigQueryData wdl * ensure tsv creation before making bq tables * run CreateTables concurrently, clean up old code, LoadTable not preemptible, rename numbered to superpartitioned * pad table id to 3 digits * fix padded table id * fix padded logic again * fix range for table_id * remove unused import * remove feature branch from dockstore.yml

mmorgantaylor added 6 commits February 25, 2021 14:18

revert input_vcfs to array[file], add this to sample inputs json

0a3d287

add this branch to dockstore

cc67145

remove this branch from dockstore

fc9ff4b

add LoadBigQueryData to dockstore, modify check for existing tables, …

d715546

…load from github

exit with error if bq load fails

fd1af57

use relative path to import LoadBigQueryData.wdl

b52ef35

mmorgantaylor requested a review from kcibul February 26, 2021 16:00

mmorgantaylor closed this Feb 26, 2021

mmorgantaylor added 7 commits February 26, 2021 11:52

refactor ImportGenomes to contain BQ table creation and loading

e0b6f54

remove for_testing_only

7aa70b8

docker -> docker_final

de00ca6

last wdl fix please

22673e1

remove #done

a3058e5

add back done - end of for loop

e6089f0

remove LoadBigQueryData wdl

851e6fb

mmorgantaylor reopened this Feb 26, 2021

ensure tsv creation before making bq tables

0b17df2

kcibul approved these changes Mar 1, 2021

View reviewed changes

mmorgantaylor commented Mar 1, 2021

View reviewed changes

scripts/variantstore/wdl/ImportGenomes.wdl Outdated Show resolved Hide resolved

mmorgantaylor added 5 commits March 1, 2021 13:05

run CreateTables concurrently, clean up old code, LoadTable not preem…

d893a33

…ptible, rename numbered to superpartitioned

pad table id to 3 digits

440ca76

fix padded table id

c88eb67

fix padded logic again

47d3a64

fix range for table_id

4070928

mmorgantaylor commented Mar 2, 2021

View reviewed changes

kcibul approved these changes Mar 2, 2021

View reviewed changes

remove unused import

1a65347

ahaessly approved these changes Mar 2, 2021

View reviewed changes

remove feature branch from dockstore.yml

8fdfd6f

mmorgantaylor merged commit b1f6753 into ah_var_store Mar 2, 2021

mmorgantaylor deleted the mmt_vcf_inputs branch March 2, 2021 17:09

This was referenced Mar 17, 2023

lb merge gvs branch #8248

Closed

testing something, please ignore #8251

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

updates to ImportGenomes and LoadBigQueryData #7112

updates to ImportGenomes and LoadBigQueryData #7112

mmorgantaylor commented Feb 26, 2021 •

edited

Loading

mmorgantaylor commented Feb 26, 2021

kcibul left a comment

kcibul Mar 1, 2021

mmorgantaylor Mar 1, 2021

mmorgantaylor Mar 2, 2021

kcibul Mar 2, 2021

ahaessly Mar 2, 2021

mmorgantaylor Mar 2, 2021

kcibul Mar 2, 2021

kcibul Mar 2, 2021

mmorgantaylor Mar 2, 2021

ahaessly left a comment

updates to ImportGenomes and LoadBigQueryData #7112

updates to ImportGenomes and LoadBigQueryData #7112

Conversation

mmorgantaylor commented Feb 26, 2021 • edited Loading

mmorgantaylor commented Feb 26, 2021

kcibul left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ahaessly left a comment

Choose a reason for hiding this comment

mmorgantaylor commented Feb 26, 2021 •

edited

Loading