-
Notifications
You must be signed in to change notification settings - Fork 594
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add BigQuery dependency, BigQueryUtils, move to unshaded google-cloud-java, and newer Spark/Hadoop/Guava #5928
Conversation
…r guava version Add BigQuery as a GATK dependency. In order to add this dependency, we have to move to an unshaded version of google-cloud-java, as the shaded version causes breakage in BigQuery, as well as a newer version of guava.
@tomwhite, @jean-philippe-martin, and/or @lbergelson please review. @tomwhite If you could try running with this branch on a Spark cluster and let me know if anything appears broken to you, that would be helpful! The few Spark tools I tested ran fine, but my testing was very basic. |
Codecov Report
@@ Coverage Diff @@
## master #5928 +/- ##
===============================================
- Coverage 86.84% 16.728% -70.112%
+ Complexity 32326 8207 -24119
===============================================
Files 1991 1988 -3
Lines 149342 148952 -390
Branches 16482 16022 -460
===============================================
- Hits 129689 24917 -104772
- Misses 13646 121673 +108027
+ Partials 6007 2362 -3645
|
It seems like the only thing that broke was the |
@droazen I assume that this fixes the BigQuery error you were seeing. I think this may fail on a cluster due to not using a shaded version of google-cloud-java, but I'll give it a go. |
@tomwhite I found that |
I successfully ran I don't think there's a problem with upgrading to Spark 2.3 (or even Spark 2.4). There is a problem with the tests that run a mini HDFS cluster (i.e. a cluster running in the same JVM as everything else). I tried upgrading to Spark 2.3 and Hadoop 2.9, but there are Guava conflicts (with Hadoop), which is not surprising. I'm not sure of the best way to fix these tests. |
@tomwhite I've updated again to Hadoop 3.2.0 and Spark 2.4.3 -- we'll see if that resolves the MiniCluster issues. |
4db491d
to
ed0613c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@droazen Some comments, I noticed a few minor weird things but it seems sane to me.
* @param bigQuery The {@link BigQuery} instance against which to execute the given {@code queryString}. Must contain the table name in the `FROM` clause for the table from which to retrieve data. | ||
* @param projectID The BigQuery {@code project ID} containing the {@code dataSet} and table from which to query data. | ||
* @param dataSet The BigQuery {@code dataSet} containing the table from which to query data. | ||
* @param queryString The {@link BigQuery} query string to execute. Must use standard SQL syntax. Must contain the project ID, data set, and table ID in the `FROM` clause for the table from which to retrieve data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this comment about needing to include the project ID / dataSet still true? If so, what's the point of this overload?
.build(); | ||
|
||
final TableResult result = submitQueryAndWaitForResults( bigQuery, queryConfig ); | ||
logger.info( "Query returned " + result.getTotalRows() + " results." ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's weird that this method logs on completion but the version of execute query below doesn't.
|
||
final List<Integer> columnWidths = calculateColumnWidths( result ); | ||
final boolean rowsAllPrimitive = | ||
StreamSupport.stream(result.iterateAll().spliterator(), false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would use Utils.stream()
instead.
package org.broadinstitute.hellbender.utils.bigquery; | ||
|
||
import com.google.cloud.bigquery.*; | ||
import org.apache.ivy.util.StringUtils; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is indeed weird
|
||
// Get the results. | ||
logger.info("Retrieving query results..."); | ||
final QueryResponse response = bigQuery.getQueryResults(jobId); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nothing ever looks at the response variable. Also, this method is an internal method and recommends that you use job.getQueryResults instead... this line seems fishy since that's done directly below here.
throw new GATKException("Interrupted while waiting for query job to complete", ex); | ||
} | ||
|
||
return result; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could move the return into the try and save a line.
*/ | ||
public class BigQueryUtilsUnitTest extends GATKBaseTest { | ||
|
||
private static final String BIGQUERY_TEST_PROJECT = "broad-dsde-dev"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should use BaseTest.getGCPTestProject()
instead of redeclaring it here.
/** | ||
* A class to test the functionality of {@link BigQueryUtils}. | ||
*/ | ||
public class BigQueryUtilsUnitTest extends GATKBaseTest { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a comment about how to view the data for this test?
// Wait for the query to complete. | ||
try { | ||
logger.info("Waiting for query to complete..."); | ||
queryJob = queryJob.waitFor(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you might want to configure timeouts for this, although it has defaults they may not be what we want
private static TableResult submitQueryAndWaitForResults( final BigQuery bigQuery, | ||
final QueryJobConfiguration queryJobConfiguration ) { | ||
// Create a job ID so that we can safely retry: | ||
final JobId jobId = JobId.of(UUID.randomUUID().toString()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You might want a way to pass in a human readable job name that this gets appended too. It's always horrible to have no idea what jobs are when you look at the web ui. At the very least add "GATK" in front of it.
@droazen I investigated working in the other direction: i.e. refining the shading in google-cloud-java to remove the conflict. Unfortunately, I can't get it to work either. What I did was to shade less in google-cloud-java (see this branch). With this change I could successfully run
However, the mini cluster for testing doesn't work any more:
It seems that the Guava conflict can't be resolved either way, since the fundamental problem is that the internals of Hadoop (used for the mini cluster) depend on an older, incompatible version of Guava than BigQuery does. |
@tomwhite What if we created a special, shaded version of Hadoop just for the MiniCluster, and used it as a test dependency in GATK? Or perhaps we could start the MiniCluster using the command line client instead of directly from GATK? Could either of those approaches work? Tagging @lbergelson as well for an opinion. |
Both options would be quite involved. I'll investigate. |
I see a new option suggested at |
The PR at googleapis/google-cloud-java#5789 makes it possible to add a BigQuery dependency without having to move to the unshaded version. This should make our lives simpler. |
@jean-philippe-martin That's great news. Unfortunately we can't easily update the NIO dependency until we have some solution to https://github.com/googleapis/google-cloud-java/issues/5884 |
@lbergelson That'll be tricky since there is no repro I can run. |
@jean-philippe-martin I think we can set up a repro by creating a new github project with a simple travis build that just does an NIO access. I don't think we can reproduce it locally since I'm pretty sure it's a bad interaction with the environment. |
Closing in favor of #6011, where the dependency conflict from this branch has been resolved. |
Add BigQuery as a GATK dependency. In order to add this dependency, we have to move
to an unshaded version of google-cloud-java, as the shaded version causes breakage
in BigQuery, as well as newer versions of Spark/Hadoop/Guava.
This also includes basic utilities for working with BigQuery (
BigQueryUtils
)