Add BigQuery dependency, BigQueryUtils, move to unshaded google-cloud-java, and newer Spark/Hadoop/Guava #5928

droazen · 2019-05-08T13:42:33Z

Add BigQuery as a GATK dependency. In order to add this dependency, we have to move
to an unshaded version of google-cloud-java, as the shaded version causes breakage
in BigQuery, as well as newer versions of Spark/Hadoop/Guava.

This also includes basic utilities for working with BigQuery (BigQueryUtils)

…r guava version Add BigQuery as a GATK dependency. In order to add this dependency, we have to move to an unshaded version of google-cloud-java, as the shaded version causes breakage in BigQuery, as well as a newer version of guava.

droazen · 2019-05-08T13:44:41Z

@tomwhite, @jean-philippe-martin, and/or @lbergelson please review.

@tomwhite If you could try running with this branch on a Spark cluster and let me know if anything appears broken to you, that would be helpful! The few Spark tools I tested ran fine, but my testing was very basic.

codecov · 2019-05-08T14:27:51Z

Codecov Report

Merging #5928 into master will decrease coverage by 70.112%.
The diff coverage is 1.258%.

@@               Coverage Diff               @@
##             master     #5928        +/-   ##
===============================================
- Coverage     86.84%   16.728%   -70.112%     
+ Complexity    32326      8207     -24119     
===============================================
  Files          1991      1988         -3     
  Lines        149342    148952       -390     
  Branches      16482     16022       -460     
===============================================
- Hits         129689     24917    -104772     
- Misses        13646    121673    +108027     
+ Partials       6007      2362      -3645

Impacted Files	Coverage Δ	Complexity Δ
...oadinstitute/hellbender/utils/gcs/BucketUtils.java	`26.351% <ø> (-52.315%)`	`13 <0> (-27)`
...er/tools/walkers/mutect/Mutect2EngineUnitTest.java	`4.545% <ø> (-95.455%)`	`1 <0> (-4)`
...ender/utils/nio/SeekableByteChannelPrefetcher.java	`0% <ø> (-78.443%)`	`0 <0> (-27)`
...icationsAndLocationAndAltSeqInferenceUnitTest.java	`0.833% <0%> (-74.167%)`	`1 <0> (-11)`
...te/hellbender/utils/nio/GcsNioIntegrationTest.java	`8.696% <0%> (+0.362%)`	`1 <0> (ø)`	⬇️
...itute/hellbender/utils/bigquery/BigQueryUtils.java	`0% <0%> (ø)`	`0 <0> (?)`
...llbender/utils/bigquery/BigQueryUtilsUnitTest.java	`7.692% <7.692%> (ø)`	`2 <2> (?)`
...ls/variant/writers/GVCFBlockCombiningIterator.java	`0% <0%> (-100%)`	`0% <0%> (-1%)`
...nder/tools/copynumber/utils/TagGermlineEvents.java	`0% <0%> (-100%)`	`0% <0%> (-3%)`
...r/tools/spark/pathseq/PSBwaArgumentCollection.java	`0% <0%> (-100%)`	`0% <0%> (-1%)`
... and 1709 more

droazen · 2019-05-08T15:13:48Z

It seems like the only thing that broke was the MiniClusterUtils, which is not a huge deal. I'll see if I can push a fix for that.

tomwhite · 2019-05-08T16:24:36Z

@droazen I assume that this fixes the BigQuery error you were seeing. I think this may fail on a cluster due to not using a shaded version of google-cloud-java, but I'll give it a go.

droazen · 2019-05-08T18:20:27Z

@tomwhite I found that CountReadsSpark succeeds on a Dataproc 1.3 cluster (which uses Hadoop 2.9 and Spark 2.3), but it did fail on a Dataproc 1.2 cluster (which uses Hadoop 2.8 and Spark 2.2). Is there anything that would currently prevent us from upgrading to Spark 2.3 / Hadoop 2.9?

tomwhite · 2019-05-09T08:14:36Z

I successfully ran ReadsPipelineSpark on a small dataset on a Dataproc cluster with this branch. I tried Dataproc 1.3 and 1.4 and both worked.

I don't think there's a problem with upgrading to Spark 2.3 (or even Spark 2.4).

There is a problem with the tests that run a mini HDFS cluster (i.e. a cluster running in the same JVM as everything else). I tried upgrading to Spark 2.3 and Hadoop 2.9, but there are Guava conflicts (with Hadoop), which is not surprising. I'm not sure of the best way to fix these tests.

…n BigQuery

droazen · 2019-05-09T19:00:29Z

@tomwhite I've updated again to Hadoop 3.2.0 and Spark 2.4.3 -- we'll see if that resolves the MiniCluster issues.

lbergelson

@droazen Some comments, I noticed a few minor weird things but it seems sane to me.

lbergelson · 2019-05-09T19:15:30Z

src/main/java/org/broadinstitute/hellbender/utils/bigquery/BigQueryUtils.java

+     * @param bigQuery The {@link BigQuery} instance against which to execute the given {@code queryString}.  Must contain the table name in the `FROM` clause for the table from which to retrieve data.
+     * @param projectID The BigQuery {@code project ID} containing the {@code dataSet} and table from which to query data.
+     * @param dataSet The BigQuery {@code dataSet} containing the table from which to query data.
+     * @param queryString The {@link BigQuery} query string to execute.  Must use standard SQL syntax.  Must contain the project ID, data set, and table ID in the `FROM` clause for the table from which to retrieve data.


Is this comment about needing to include the project ID / dataSet still true? If so, what's the point of this overload?

lbergelson · 2019-05-09T19:17:12Z

src/main/java/org/broadinstitute/hellbender/utils/bigquery/BigQueryUtils.java

+                        .build();
+
+        final TableResult result = submitQueryAndWaitForResults( bigQuery, queryConfig );
+        logger.info( "Query returned " + result.getTotalRows() + " results." );


it's weird that this method logs on completion but the version of execute query below doesn't.

lbergelson · 2019-05-09T19:18:17Z

src/main/java/org/broadinstitute/hellbender/utils/bigquery/BigQueryUtils.java

+
+        final List<Integer> columnWidths = calculateColumnWidths( result );
+        final boolean rowsAllPrimitive =
+                StreamSupport.stream(result.iterateAll().spliterator(), false)


I would use Utils.stream() instead.

lbergelson · 2019-05-09T19:25:57Z

src/main/java/org/broadinstitute/hellbender/utils/bigquery/BigQueryUtils.java

+package org.broadinstitute.hellbender.utils.bigquery;
+
+import com.google.cloud.bigquery.*;
+import org.apache.ivy.util.StringUtils;


this is indeed weird

lbergelson · 2019-05-09T19:39:25Z

src/main/java/org/broadinstitute/hellbender/utils/bigquery/BigQueryUtils.java

+
+        // Get the results.
+        logger.info("Retrieving query results...");
+        final QueryResponse response = bigQuery.getQueryResults(jobId);


Nothing ever looks at the response variable. Also, this method is an internal method and recommends that you use job.getQueryResults instead... this line seems fishy since that's done directly below here.

lbergelson · 2019-05-09T19:44:01Z

src/main/java/org/broadinstitute/hellbender/utils/bigquery/BigQueryUtils.java

+            throw new GATKException("Interrupted while waiting for query job to complete", ex);
+        }
+
+        return result;


You could move the return into the try and save a line.

lbergelson · 2019-05-09T19:49:20Z

src/test/java/org/broadinstitute/hellbender/utils/bigquery/BigQueryUtilsUnitTest.java

+ */
+public class BigQueryUtilsUnitTest extends GATKBaseTest {
+
+    private static final String BIGQUERY_TEST_PROJECT = "broad-dsde-dev";


You should use BaseTest.getGCPTestProject() instead of redeclaring it here.

lbergelson · 2019-05-09T19:56:04Z

src/test/java/org/broadinstitute/hellbender/utils/bigquery/BigQueryUtilsUnitTest.java

+/**
+ * A class to test the functionality of {@link BigQueryUtils}.
+ */
+public class BigQueryUtilsUnitTest extends GATKBaseTest {


Could you add a comment about how to view the data for this test?

lbergelson · 2019-05-09T20:03:05Z

src/main/java/org/broadinstitute/hellbender/utils/bigquery/BigQueryUtils.java

+        // Wait for the query to complete.
+        try {
+            logger.info("Waiting for query to complete...");
+            queryJob = queryJob.waitFor();


you might want to configure timeouts for this, although it has defaults they may not be what we want

lbergelson · 2019-05-09T20:11:46Z

src/main/java/org/broadinstitute/hellbender/utils/bigquery/BigQueryUtils.java

+    private static TableResult submitQueryAndWaitForResults( final BigQuery bigQuery,
+                                                             final QueryJobConfiguration queryJobConfiguration ) {
+        // Create a job ID so that we can safely retry:
+        final JobId jobId = JobId.of(UUID.randomUUID().toString());


You might want a way to pass in a human readable job name that this gets appended too. It's always horrible to have no idea what jobs are when you look at the web ui. At the very least add "GATK" in front of it.

tomwhite · 2019-05-21T13:35:30Z

@droazen I investigated working in the other direction: i.e. refining the shading in google-cloud-java to remove the conflict. Unfortunately, I can't get it to work either.

What I did was to shade less in google-cloud-java (see this branch). With this change I could successfully run ExampleBigQueryReader from this GATK branch:

$ ./gradlew clean localJar
$ export GOOGLE_APPLICATION_CREDENTIALS=...
$ ./gatk ExampleBigQueryReader
...
14:16:43.468 INFO  BigQueryUtils - Query returned 10 results.
...

However, the mini cluster for testing doesn't work any more:

$ ./gradlew test -Dtest.single=ReadsSparkSinkUnitTest
org.broadinstitute.hellbender.engine.spark.datasources.ReadsSparkSinkUnitTest.setupMiniCluster FAILED
    java.lang.NoSuchMethodError: com.google.common.base.Objects.toStringHelper(Ljava/lang/Object;)Lcom/google/common/base/Objects$ToStringHelper;

It seems that the Guava conflict can't be resolved either way, since the fundamental problem is that the internals of Hadoop (used for the mini cluster) depend on an older, incompatible version of Guava than BigQuery does.

droazen · 2019-05-21T14:16:54Z

@tomwhite What if we created a special, shaded version of Hadoop just for the MiniCluster, and used it as a test dependency in GATK? Or perhaps we could start the MiniCluster using the command line client instead of directly from GATK? Could either of those approaches work?

Tagging @lbergelson as well for an opinion.

tomwhite · 2019-05-21T16:11:51Z

Both options would be quite involved. I'll investigate.

jean-philippe-martin · 2019-07-22T22:52:59Z

I see a new option suggested at
googleapis/google-cloud-java#5789

jean-philippe-martin · 2019-07-29T22:10:41Z

The PR at googleapis/google-cloud-java#5789 makes it possible to add a BigQuery dependency without having to move to the unshaded version. This should make our lives simpler.

lbergelson · 2019-07-30T14:26:00Z

@jean-philippe-martin That's great news. Unfortunately we can't easily update the NIO dependency until we have some solution to https://github.com/googleapis/google-cloud-java/issues/5884

jean-philippe-martin · 2019-07-30T17:26:16Z

@lbergelson That'll be tricky since there is no repro I can run.

lbergelson · 2019-07-31T15:07:15Z

@jean-philippe-martin I think we can set up a repro by creating a new github project with a simple travis build that just does an NIO access. I don't think we can reproduce it locally since I'm pretty sure it's a bad interaction with the environment.

droazen · 2019-11-21T17:30:33Z

Closing in favor of #6011, where the dependency conflict from this branch has been resolved.

droazen requested review from tomwhite, jean-philippe-martin and lbergelson May 8, 2019 13:43

droazen self-assigned this May 8, 2019

droazen added 2 commits May 9, 2019 12:32

Update Spark to 2.3.3 and Hadoop to 2.9.2

6483147

BigQueryUtils: a set of convenience functions for executing queries o…

032865b

…n BigQuery

droazen changed the title ~~Add BigQuery dependency, move to unshaded google-cloud-java, and newer guava version~~ Add BigQuery dependency, BigQueryUtils, Move to unshaded google-cloud-java, and newer Spark/Hadoop/Guava May 9, 2019

droazen changed the title ~~Add BigQuery dependency, BigQueryUtils, Move to unshaded google-cloud-java, and newer Spark/Hadoop/Guava~~ Add BigQuery dependency, BigQueryUtils, move to unshaded google-cloud-java, and newer Spark/Hadoop/Guava May 9, 2019

Hadoop 3.2.0 and Spark 2.4.3

ed0613c

droazen force-pushed the dr_add_bigquery_unshade_google-cloud-java branch from 4db491d to ed0613c Compare May 9, 2019 19:22

Exclude Picard GoogleStorageUtils from GatkDoc task

fffde62

lbergelson requested changes May 9, 2019

View reviewed changes

droazen added 4 commits May 9, 2019 16:34

Exclude Picard GoogleStorageUtils from GatkTabComplete task

b28a17f

Upgrade to gradle 5.4.1

57883b1

Force jackson to 2.6.7, fix dockertest.gradle

a8fe9bd

Remove com.google.common shading

801bd22

tomwhite mentioned this pull request Jun 17, 2019

Upgrade to gradle 5.4.1 #6007

Merged

tomwhite mentioned this pull request Sep 3, 2019

Upgrade GCS Connector to 1.9.17 #6135

Closed

droazen closed this Nov 21, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add BigQuery dependency, BigQueryUtils, move to unshaded google-cloud-java, and newer Spark/Hadoop/Guava #5928

Add BigQuery dependency, BigQueryUtils, move to unshaded google-cloud-java, and newer Spark/Hadoop/Guava #5928

droazen commented May 8, 2019 •

edited

Loading

droazen commented May 8, 2019

codecov bot commented May 8, 2019 •

edited

Loading

droazen commented May 8, 2019

tomwhite commented May 8, 2019

droazen commented May 8, 2019 •

edited

Loading

tomwhite commented May 9, 2019

droazen commented May 9, 2019

lbergelson left a comment

lbergelson May 9, 2019

lbergelson May 9, 2019

lbergelson May 9, 2019

lbergelson May 9, 2019

lbergelson May 9, 2019

lbergelson May 9, 2019

lbergelson May 9, 2019

lbergelson May 9, 2019

lbergelson May 9, 2019

lbergelson May 9, 2019

tomwhite commented May 21, 2019

droazen commented May 21, 2019

tomwhite commented May 21, 2019

jean-philippe-martin commented Jul 22, 2019

jean-philippe-martin commented Jul 29, 2019

lbergelson commented Jul 30, 2019

jean-philippe-martin commented Jul 30, 2019

lbergelson commented Jul 31, 2019

droazen commented Nov 21, 2019

Add BigQuery dependency, BigQueryUtils, move to unshaded google-cloud-java, and newer Spark/Hadoop/Guava #5928

Add BigQuery dependency, BigQueryUtils, move to unshaded google-cloud-java, and newer Spark/Hadoop/Guava #5928

Conversation

droazen commented May 8, 2019 • edited Loading

droazen commented May 8, 2019

codecov bot commented May 8, 2019 • edited Loading

Codecov Report

droazen commented May 8, 2019

tomwhite commented May 8, 2019

droazen commented May 8, 2019 • edited Loading

tomwhite commented May 9, 2019

droazen commented May 9, 2019

lbergelson left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tomwhite commented May 21, 2019

droazen commented May 21, 2019

tomwhite commented May 21, 2019

jean-philippe-martin commented Jul 22, 2019

jean-philippe-martin commented Jul 29, 2019

lbergelson commented Jul 30, 2019

jean-philippe-martin commented Jul 30, 2019

lbergelson commented Jul 31, 2019

droazen commented Nov 21, 2019

droazen commented May 8, 2019 •

edited

Loading

codecov bot commented May 8, 2019 •

edited

Loading

droazen commented May 8, 2019 •

edited

Loading