Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add BigQuery dependency, BigQueryUtils, move to unshaded google-cloud-java, and newer Spark/Hadoop/Guava #5928

Closed
wants to merge 9 commits into from

Conversation

droazen
Copy link
Contributor

@droazen droazen commented May 8, 2019

Add BigQuery as a GATK dependency. In order to add this dependency, we have to move
to an unshaded version of google-cloud-java, as the shaded version causes breakage
in BigQuery, as well as newer versions of Spark/Hadoop/Guava.

This also includes basic utilities for working with BigQuery (BigQueryUtils)

…r guava version

Add BigQuery as a GATK dependency. In order to add this dependency, we have to move
to an unshaded version of google-cloud-java, as the shaded version causes breakage
in BigQuery, as well as a newer version of guava.
@droazen droazen self-assigned this May 8, 2019
@droazen
Copy link
Contributor Author

droazen commented May 8, 2019

@tomwhite, @jean-philippe-martin, and/or @lbergelson please review.

@tomwhite If you could try running with this branch on a Spark cluster and let me know if anything appears broken to you, that would be helpful! The few Spark tools I tested ran fine, but my testing was very basic.

@codecov
Copy link

codecov bot commented May 8, 2019

Codecov Report

Merging #5928 into master will decrease coverage by 70.112%.
The diff coverage is 1.258%.

@@               Coverage Diff               @@
##             master     #5928        +/-   ##
===============================================
- Coverage     86.84%   16.728%   -70.112%     
+ Complexity    32326      8207     -24119     
===============================================
  Files          1991      1988         -3     
  Lines        149342    148952       -390     
  Branches      16482     16022       -460     
===============================================
- Hits         129689     24917    -104772     
- Misses        13646    121673    +108027     
+ Partials       6007      2362      -3645
Impacted Files Coverage Δ Complexity Δ
...oadinstitute/hellbender/utils/gcs/BucketUtils.java 26.351% <ø> (-52.315%) 13 <0> (-27)
...er/tools/walkers/mutect/Mutect2EngineUnitTest.java 4.545% <ø> (-95.455%) 1 <0> (-4)
...ender/utils/nio/SeekableByteChannelPrefetcher.java 0% <ø> (-78.443%) 0 <0> (-27)
...icationsAndLocationAndAltSeqInferenceUnitTest.java 0.833% <0%> (-74.167%) 1 <0> (-11)
...te/hellbender/utils/nio/GcsNioIntegrationTest.java 8.696% <0%> (+0.362%) 1 <0> (ø) ⬇️
...itute/hellbender/utils/bigquery/BigQueryUtils.java 0% <0%> (ø) 0 <0> (?)
...llbender/utils/bigquery/BigQueryUtilsUnitTest.java 7.692% <7.692%> (ø) 2 <2> (?)
...ls/variant/writers/GVCFBlockCombiningIterator.java 0% <0%> (-100%) 0% <0%> (-1%)
...nder/tools/copynumber/utils/TagGermlineEvents.java 0% <0%> (-100%) 0% <0%> (-3%)
...r/tools/spark/pathseq/PSBwaArgumentCollection.java 0% <0%> (-100%) 0% <0%> (-1%)
... and 1709 more

@droazen
Copy link
Contributor Author

droazen commented May 8, 2019

It seems like the only thing that broke was the MiniClusterUtils, which is not a huge deal. I'll see if I can push a fix for that.

@tomwhite
Copy link
Contributor

tomwhite commented May 8, 2019

@droazen I assume that this fixes the BigQuery error you were seeing. I think this may fail on a cluster due to not using a shaded version of google-cloud-java, but I'll give it a go.

@droazen
Copy link
Contributor Author

droazen commented May 8, 2019

@tomwhite I found that CountReadsSpark succeeds on a Dataproc 1.3 cluster (which uses Hadoop 2.9 and Spark 2.3), but it did fail on a Dataproc 1.2 cluster (which uses Hadoop 2.8 and Spark 2.2). Is there anything that would currently prevent us from upgrading to Spark 2.3 / Hadoop 2.9?

@tomwhite
Copy link
Contributor

tomwhite commented May 9, 2019

I successfully ran ReadsPipelineSpark on a small dataset on a Dataproc cluster with this branch. I tried Dataproc 1.3 and 1.4 and both worked.

I don't think there's a problem with upgrading to Spark 2.3 (or even Spark 2.4).

There is a problem with the tests that run a mini HDFS cluster (i.e. a cluster running in the same JVM as everything else). I tried upgrading to Spark 2.3 and Hadoop 2.9, but there are Guava conflicts (with Hadoop), which is not surprising. I'm not sure of the best way to fix these tests.

@droazen droazen changed the title Add BigQuery dependency, move to unshaded google-cloud-java, and newer guava version Add BigQuery dependency, BigQueryUtils, Move to unshaded google-cloud-java, and newer Spark/Hadoop/Guava May 9, 2019
@droazen droazen changed the title Add BigQuery dependency, BigQueryUtils, Move to unshaded google-cloud-java, and newer Spark/Hadoop/Guava Add BigQuery dependency, BigQueryUtils, move to unshaded google-cloud-java, and newer Spark/Hadoop/Guava May 9, 2019
@droazen
Copy link
Contributor Author

droazen commented May 9, 2019

@tomwhite I've updated again to Hadoop 3.2.0 and Spark 2.4.3 -- we'll see if that resolves the MiniCluster issues.

@droazen droazen force-pushed the dr_add_bigquery_unshade_google-cloud-java branch from 4db491d to ed0613c Compare May 9, 2019 19:22
Copy link
Member

@lbergelson lbergelson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@droazen Some comments, I noticed a few minor weird things but it seems sane to me.

* @param bigQuery The {@link BigQuery} instance against which to execute the given {@code queryString}. Must contain the table name in the `FROM` clause for the table from which to retrieve data.
* @param projectID The BigQuery {@code project ID} containing the {@code dataSet} and table from which to query data.
* @param dataSet The BigQuery {@code dataSet} containing the table from which to query data.
* @param queryString The {@link BigQuery} query string to execute. Must use standard SQL syntax. Must contain the project ID, data set, and table ID in the `FROM` clause for the table from which to retrieve data.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this comment about needing to include the project ID / dataSet still true? If so, what's the point of this overload?

.build();

final TableResult result = submitQueryAndWaitForResults( bigQuery, queryConfig );
logger.info( "Query returned " + result.getTotalRows() + " results." );
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's weird that this method logs on completion but the version of execute query below doesn't.


final List<Integer> columnWidths = calculateColumnWidths( result );
final boolean rowsAllPrimitive =
StreamSupport.stream(result.iterateAll().spliterator(), false)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would use Utils.stream() instead.

package org.broadinstitute.hellbender.utils.bigquery;

import com.google.cloud.bigquery.*;
import org.apache.ivy.util.StringUtils;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is indeed weird


// Get the results.
logger.info("Retrieving query results...");
final QueryResponse response = bigQuery.getQueryResults(jobId);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing ever looks at the response variable. Also, this method is an internal method and recommends that you use job.getQueryResults instead... this line seems fishy since that's done directly below here.

throw new GATKException("Interrupted while waiting for query job to complete", ex);
}

return result;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could move the return into the try and save a line.

*/
public class BigQueryUtilsUnitTest extends GATKBaseTest {

private static final String BIGQUERY_TEST_PROJECT = "broad-dsde-dev";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should use BaseTest.getGCPTestProject() instead of redeclaring it here.

/**
* A class to test the functionality of {@link BigQueryUtils}.
*/
public class BigQueryUtilsUnitTest extends GATKBaseTest {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a comment about how to view the data for this test?

// Wait for the query to complete.
try {
logger.info("Waiting for query to complete...");
queryJob = queryJob.waitFor();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you might want to configure timeouts for this, although it has defaults they may not be what we want

private static TableResult submitQueryAndWaitForResults( final BigQuery bigQuery,
final QueryJobConfiguration queryJobConfiguration ) {
// Create a job ID so that we can safely retry:
final JobId jobId = JobId.of(UUID.randomUUID().toString());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might want a way to pass in a human readable job name that this gets appended too. It's always horrible to have no idea what jobs are when you look at the web ui. At the very least add "GATK" in front of it.

@tomwhite
Copy link
Contributor

@droazen I investigated working in the other direction: i.e. refining the shading in google-cloud-java to remove the conflict. Unfortunately, I can't get it to work either.

What I did was to shade less in google-cloud-java (see this branch). With this change I could successfully run ExampleBigQueryReader from this GATK branch:

$ ./gradlew clean localJar
$ export GOOGLE_APPLICATION_CREDENTIALS=...
$ ./gatk ExampleBigQueryReader
...
14:16:43.468 INFO  BigQueryUtils - Query returned 10 results.
...

However, the mini cluster for testing doesn't work any more:

$ ./gradlew test -Dtest.single=ReadsSparkSinkUnitTest
org.broadinstitute.hellbender.engine.spark.datasources.ReadsSparkSinkUnitTest.setupMiniCluster FAILED
    java.lang.NoSuchMethodError: com.google.common.base.Objects.toStringHelper(Ljava/lang/Object;)Lcom/google/common/base/Objects$ToStringHelper;

It seems that the Guava conflict can't be resolved either way, since the fundamental problem is that the internals of Hadoop (used for the mini cluster) depend on an older, incompatible version of Guava than BigQuery does.

@droazen
Copy link
Contributor Author

droazen commented May 21, 2019

@tomwhite What if we created a special, shaded version of Hadoop just for the MiniCluster, and used it as a test dependency in GATK? Or perhaps we could start the MiniCluster using the command line client instead of directly from GATK? Could either of those approaches work?

Tagging @lbergelson as well for an opinion.

@tomwhite
Copy link
Contributor

Both options would be quite involved. I'll investigate.

@jean-philippe-martin
Copy link
Contributor

I see a new option suggested at
googleapis/google-cloud-java#5789

@jean-philippe-martin
Copy link
Contributor

The PR at googleapis/google-cloud-java#5789 makes it possible to add a BigQuery dependency without having to move to the unshaded version. This should make our lives simpler.

@lbergelson
Copy link
Member

@jean-philippe-martin That's great news. Unfortunately we can't easily update the NIO dependency until we have some solution to https://github.com/googleapis/google-cloud-java/issues/5884

@jean-philippe-martin
Copy link
Contributor

@lbergelson That'll be tricky since there is no repro I can run.

@lbergelson
Copy link
Member

@jean-philippe-martin I think we can set up a repro by creating a new github project with a simple travis build that just does an NIO access. I don't think we can reproduce it locally since I'm pretty sure it's a bad interaction with the environment.

@droazen
Copy link
Contributor Author

droazen commented Nov 21, 2019

Closing in favor of #6011, where the dependency conflict from this branch has been resolved.

@droazen droazen closed this Nov 21, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants