Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade to Spark 3.1.1 with testing #349

Merged
merged 24 commits into from
Jun 28, 2022
Merged

Upgrade to Spark 3.1.1 with testing #349

merged 24 commits into from
Jun 28, 2022

Conversation

nssalian
Copy link
Contributor

@nssalian nssalian commented May 8, 2022

This PR does the following:

  1. Upgrades the docker-compose image version to Spark 3.1.1
  2. Upgrades the integration-spark-thrift and integration-spark-session test images to Spark 3.1.1.

Description

Checklist

  • I have signed the CLA
  • I have run this code in development and it appears to resolve the stated issue
  • This PR includes tests or tests that are not required/relevant for this PR
  • I have updated the CHANGELOG.md and added information about my change to the "dbt-spark next" section.

@cla-bot
Copy link

cla-bot bot commented May 8, 2022

Thanks for your pull request, and welcome to our community! We require contributors to sign our Contributor License Agreement and we don't seem to have your signature on file. Check out this article for more information on why we have a CLA.

In order for us to review and merge your code, please submit the Individual Contributor License Agreement form attached above above. If you have questions about the CLA, or if you believe you've received this message in error, don't hesitate to ping @drewbanin.

CLA has not been signed by users: @nssalian

@cla-bot cla-bot bot added the cla:yes label May 8, 2022
@nssalian nssalian self-assigned this May 8, 2022
@nssalian nssalian marked this pull request as draft May 8, 2022 16:32
README.md Outdated
@@ -26,7 +26,7 @@ more information, consult [the docs](https://docs.getdbt.com/docs/profile-spark)

## Running locally
A `docker-compose` environment starts a Spark Thrift server and a Postgres database as a Hive Metastore backend.
Note that this is spark 2 not spark 3 so some functionalities might not be available.
Note: Spark has moved to Spark 3 (formerly on Spark 2).
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you mean dbt / dbt-spark moved to spark-3 right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about this statement and I'm a bit confused.
If dbt-spark moved to 3, the testing module still is on 2 and this line in the docs doesn't add up.
Since this is a Draft PR, I'll finish up the testing and clean this up in either this PR or a separate one.
But @rvacaru do you know the context behind why the compose file has spark:3 but the testing didn't move over ?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I read it initially, the impression I got was more that the compose file was moved to Spark 3. So maybe the same warning applies given the library has been / is being updated for spark 3 functionality?

@@ -33,20 +33,22 @@ jobs:
DBT_INVOCATION_ENV: circle
docker:
- image: fishtownanalytics/test-container:10
- image: godatadriven/spark:2
- image: godatadriven/spark:3.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we move to the apache image in docker-compose, I would suggest doing that here as well 👍🏻

Copy link
Contributor Author

@nssalian nssalian May 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. that's the part that's being tested at the moment since the tests are failing in thrift. Was trying to eliminate possible reasons for failure. I'm trying with 3.0 right now.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might consider trying with 3.1 or 3.2. There were absolutely some bugs in Spark 3.0.0, but I'm admittedly somewhat distrustful of the whole of Spark 3.0.x.

Just throwing that out there in case you're still stuck.

Copy link

@kbendick kbendick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @nssalian! Glad to see that this is being worked on. Left a comment, will be reviewing further (accidentally hit submit), and as always feel free to reach out if I can be of any assistance in some way. Specifically the eventual Iceberg support is what drew me over here - as Iceberg works on Spark 2.4 but really shines on 3+ given its rich support for SQL operations.

And for dbt specifically 3.2+ given the ability to declare an ordering / clustering on write in a create table DDL that will be respected by Iceberg. We like to think of it as more declarative data engineering, which I believe is very in-line with dbt in general. 👍

README.md Outdated
@@ -26,7 +26,7 @@ more information, consult [the docs](https://docs.getdbt.com/docs/profile-spark)

## Running locally
A `docker-compose` environment starts a Spark Thrift server and a Postgres database as a Hive Metastore backend.
Note that this is spark 2 not spark 3 so some functionalities might not be available.
Note: Spark has moved to Spark 3 (formerly on Spark 2).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I read it initially, the impression I got was more that the compose file was moved to Spark 3. So maybe the same warning applies given the library has been / is being updated for spark 3 functionality?

@nssalian
Copy link
Contributor Author

Hey @nssalian! Glad to see that this is being worked on. Left a comment, will be reviewing further (accidentally hit submit), and as always feel free to reach out if I can be of any assistance in some way. Specifically the eventual Iceberg support is what drew me over here (as Iceberg works on 2.4 but really shines on 3+, specifically 3.2+). 👍

Thanks for the input @kbendick . Trying to get this to work and I open up a clean PR to get in this. There's a lot of value to unlock for users in Spark 3.

@nssalian
Copy link
Contributor Author

nssalian commented May 13, 2022

I wanted to update the folks involved in the PR regarding what's the goal and the current status. I've seen questions come up here and I thought it's best to address this and bring folks to the same page so there's some clarity.

2 main goals:

  1. Move completely over to Spark 3. This means the testing and documentation. The current dbt-spark master is pointing to 3.0 Spark but the testing infra hasn't successfully moved over.
  2. godatadriven/spark has been serving as the base image for the containers. Ideally, we should be in parity with Open source Spark: https://hub.docker.com/r/apache/spark/tags rather than other images. Open source spark is already on version 3.2.1 while the godatadriven/spark image is still on 3.1.1 (not even 3.1.3).
    Spark 3.2.1 will be great to unlock features in Apache Iceberg and be on the latest Spark version.

We can accomplish 1 or 2 at the moment as long as the testing infrastructure works.

Current status:

  1. To solve the 1. goal - the only current blocker as it stands is the moving of the integration-spark-thrift

    1. Attempts at this include testing godatadriven/spark: versions 3.0, 3.1 and latest. All of them don't seem to succeed.
    2. Test failures by versions of images: 3.0, 3.1
    3. At this point my theory is some permissions issue for the user, since show queries seem to work fine. The docker-compose command works and the image comes up but the thrift test is the blocker here.
    4. I looked at the godatadriven/spark images from Spark 2 to 3 and the only large changes look like
      Java 8 -> 11, Spark 2.4.5 -> 3.1.1, Hadoop 2.7 -> 3.2.0
      These are big changes but doubt they would impact thrift particularly.
  2. With regards to solving for 2. I hit a few roadblocks there:

    1. With the regular Spark image, there's a failure in the integration-spark-thrift coming from the image itself. I think this is a bug from the entrypoint.sh in Spark.
    2. With the pyspark image: I see failures like above and the integration-spark-session as well. Looks like a permissions issue.

Other possible issues:

  1. The fishtownanalytics/test-container is also pretty old - but looking at its Dockerfile, it might not be the root cause.

Any help, input is appreciated here.

@kbendick
Copy link

Is there possibly any help I can provide with this? I noticed the godatadriven image was at Spark 3.0 which personally, I don't think there's a good reason to use Spark 3.0.x series now.

Do we know what comes from the godatadriven spark image that we aren't getting elsewhere? Is it just convenience (e.g. to have a Spark image) or is it more than that?

@nssalian nssalian changed the title [Draft PR]: Upgrade to Spark 3. [Draft PR]: Upgrade to Spark 3.1.1 with testing Jun 18, 2022
@nssalian nssalian changed the title [Draft PR]: Upgrade to Spark 3.1.1 with testing Upgrade to Spark 3.1.1 with testing Jun 18, 2022
@nssalian nssalian marked this pull request as ready for review June 18, 2022 00:25
@nssalian nssalian requested a review from jtcohen6 June 18, 2022 00:25
Copy link
Contributor

@jtcohen6 jtcohen6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nssalian Thanks for the hard work getting to the bottom of those errors, and somehow finding a way to simplify our setup at the same time!

@nssalian
Copy link
Contributor Author

Thanks for the help and context along the way, and also the review, @jtcohen6.

@jtcohen6
Copy link
Contributor

Pressing merge! 🤞

@jtcohen6 jtcohen6 merged commit 0082e73 into main Jun 28, 2022
@jtcohen6 jtcohen6 deleted the ns/spark3 branch June 28, 2022 15:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants