-
Notifications
You must be signed in to change notification settings - Fork 227
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Upgrade to Spark 3.1.1 with testing #349
Conversation
Thanks for your pull request, and welcome to our community! We require contributors to sign our Contributor License Agreement and we don't seem to have your signature on file. Check out this article for more information on why we have a CLA. In order for us to review and merge your code, please submit the Individual Contributor License Agreement form attached above above. If you have questions about the CLA, or if you believe you've received this message in error, don't hesitate to ping @drewbanin. CLA has not been signed by users: @nssalian |
README.md
Outdated
@@ -26,7 +26,7 @@ more information, consult [the docs](https://docs.getdbt.com/docs/profile-spark) | |||
|
|||
## Running locally | |||
A `docker-compose` environment starts a Spark Thrift server and a Postgres database as a Hive Metastore backend. | |||
Note that this is spark 2 not spark 3 so some functionalities might not be available. | |||
Note: Spark has moved to Spark 3 (formerly on Spark 2). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you mean dbt / dbt-spark moved to spark-3 right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about this statement and I'm a bit confused.
If dbt-spark
moved to 3, the testing module still is on 2 and this line in the docs doesn't add up.
Since this is a Draft PR, I'll finish up the testing and clean this up in either this PR or a separate one.
But @rvacaru do you know the context behind why the compose file has spark:3 but the testing didn't move over ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I read it initially, the impression I got was more that the compose file was moved to Spark 3. So maybe the same warning applies given the library has been / is being updated for spark 3 functionality?
.circleci/config.yml
Outdated
@@ -33,20 +33,22 @@ jobs: | |||
DBT_INVOCATION_ENV: circle | |||
docker: | |||
- image: fishtownanalytics/test-container:10 | |||
- image: godatadriven/spark:2 | |||
- image: godatadriven/spark:3.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we move to the apache image in docker-compose, I would suggest doing that here as well 👍🏻
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. that's the part that's being tested at the moment since the tests are failing in thrift. Was trying to eliminate possible reasons for failure. I'm trying with 3.0 right now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You might consider trying with 3.1 or 3.2. There were absolutely some bugs in Spark 3.0.0, but I'm admittedly somewhat distrustful of the whole of Spark 3.0.x.
Just throwing that out there in case you're still stuck.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @nssalian! Glad to see that this is being worked on. Left a comment, will be reviewing further (accidentally hit submit), and as always feel free to reach out if I can be of any assistance in some way. Specifically the eventual Iceberg support is what drew me over here - as Iceberg works on Spark 2.4 but really shines on 3+ given its rich support for SQL operations.
And for dbt specifically 3.2+ given the ability to declare an ordering / clustering on write in a create table DDL that will be respected by Iceberg. We like to think of it as more declarative data engineering, which I believe is very in-line with dbt in general. 👍
README.md
Outdated
@@ -26,7 +26,7 @@ more information, consult [the docs](https://docs.getdbt.com/docs/profile-spark) | |||
|
|||
## Running locally | |||
A `docker-compose` environment starts a Spark Thrift server and a Postgres database as a Hive Metastore backend. | |||
Note that this is spark 2 not spark 3 so some functionalities might not be available. | |||
Note: Spark has moved to Spark 3 (formerly on Spark 2). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I read it initially, the impression I got was more that the compose file was moved to Spark 3. So maybe the same warning applies given the library has been / is being updated for spark 3 functionality?
Thanks for the input @kbendick . Trying to get this to work and I open up a clean PR to get in this. There's a lot of value to unlock for users in Spark 3. |
I wanted to update the folks involved in the PR regarding what's the goal and the current status. I've seen questions come up here and I thought it's best to address this and bring folks to the same page so there's some clarity. 2 main goals:
We can accomplish 1 or 2 at the moment as long as the testing infrastructure works. Current status:
Other possible issues:
Any help, input is appreciated here. |
Is there possibly any help I can provide with this? I noticed the Do we know what comes from the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nssalian Thanks for the hard work getting to the bottom of those errors, and somehow finding a way to simplify our setup at the same time!
Thanks for the help and context along the way, and also the review, @jtcohen6. |
Pressing merge! 🤞 |
This PR does the following:
docker-compose
image version to Spark 3.1.1integration-spark-thrift
andintegration-spark-session
test images to Spark 3.1.1.Description
Checklist
CHANGELOG.md
and added information about my change to the "dbt-spark next" section.