-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SPARK-1556: bump jets3t version to 0.9.0 #468
Conversation
Merged build triggered. |
Merged build started. |
Merged build triggered. |
Merged build started. |
Merged build finished. All automated tests passed. |
All automated tests passed. |
Merged build finished. All automated tests passed. |
All automated tests passed. |
Unfortunately this will not work in older Hadoop versions as far as I know. Can you still build Spark against Hadoop 1.0.4 and run it with this change? It might be better to receive jets3t from Hadoop instead of depending on it ourselves. I'm not sure if hadoop-client depends on it... |
@mateiz I thought the same thing, that I agree with updating the dependency, but to match the Hadoop version. So the 0.9.0 version belong in the Hadoop 2 profiles. (Also it should be a runtime scope dependency in Maven.) |
In that case let's see exactly which Hadoop 2.x version bumped up the dependency, because I don't think 2.0 and 2.1 did it (could be wrong though). |
@mateiz It looks like it went to 0.8.1 in (the unreleased) Hadoop 1.3.0 (https://issues.apache.org/jira/browse/HADOOP-8136) and 0.9.0 in 2.3.0 (https://issues.apache.org/jira/browse/HADOOP-9623) |
Great, so there's no easy way to set it based on profiles and support all Hadoop versions :). Maybe for Hadoop 2.3+ users, we can just tell them to add a new version of jets3t to their own project's build? We can certainly have our pre-built binaries include the right one too. |
Sure, that would work. Please try it. Unfortunately I remember it having problems, but I could be wrong. |
@mateiz you are right, I received the exception of ```java.lang.NoSuchMethodError: org.jets3t.service.impl.rest.httpclient.RestS3Service.(Lorg/jets3t/service/security/AWSCredentials;)V" in both |
Merged build triggered. |
Merged build started. |
I recovered the build files and updated the documents to indicate this situation for the user |
Merged build finished. All automated tests passed. |
All automated tests passed. |
Is there any way to apply this fix without a rebuild of spark? E.g., to just replace jets3t-0.7.1.jar with jets3t-0.9.0.jar in a deployed spark package? I'm running into this issue on a machine where I have the CDH5 hadoop and spark packages installed. |
I think the possible way to do that is compile a jets3t0.9.0-enabled version by yourself then compile your application against this version .... I think to access HDFS-compatible fs, we eventually call the code in application jar |
You can try adding jets3t 0.9 as a Maven dependency in your application, but unfortunately I think that goes after the Spark assembly JAR when running an app. In 1.0 there will be a setting to put the user's classpath first. It sounds like the Spark bundle for CDH needs to be updated with this; CCing @srowen. For this patch, we probably want to create a new Maven profile to use a new Jets3t when that's enabled. |
BTW the right way to do it would be to make hadoop-client have a Maven dependency on the right version of Jets3t. Then Spark would just build with the right version out of the box when it linked to the right Hadoop version. |
Definitely worth a shot! Will give that a try and report back. |
Hi, @srowen, do you want to take over the patch? I'm concerning I cannot fix it in the following days, considering my schedule and my knowledge level on mvn and sbt? |
Sigh. Was a promising idea, but no dice. Even with the 0.7 jars out of the way, I'm still getting java.lang.NoClassDefFoundError: org/jets3t/service/S3ServiceException |
@CodingCat I can make a patch, but it will mean introducing a new profile like "hadoop230" that one has to enable when building for Hadoop 2.3.0. I always hate to add that complexity and hope someone has a better idea. But I'll propose the PR if a committer nods and says it's worth changing. I imagine it won't be the last time the dependencies have to be fudged by Hadoop version -- isn't this already an existing issue with Avro anyway? |
FYI - I think I might have figured out why deleting the jets3t jar didn't fix the issue. It looks like the spark build process bundles the jets3t classes into the spark assembly jar. So I'm guessing that whacking the stand-alone jar file wouldn't fix the issue if there's still 0.7 classes bundled in another jar. |
Man oh man, I cannot get this to work no way no how. I tried rebuilding spark using the jets3t 0.9 jar, then tried rebuilding shark doing the same. I keep getting a verify error - presumably because something in the call stack isn't compatible with the new jets3t version. Anyone have any ideas/suggestions? I'm at my wits' end on this. Spent days, and still unable to get a working version of spark/shark running with CDH5. Output below.
|
I think I'm going to have to give up on getting Shark working on my existing CDH5 cluster right now. I've tried everything I can think of (various binary releases, building both spark and shark myself against jets3t 0.9, various config tweaks, etc.) but I'm stuck at either the class not found error in https://issues.apache.org/jira/browse/SPARK-1556, or the verify error above. I'll have to either wait until there's a new binary release, or look for an alternative. |
@srowen I'd prefer not to remove it from the dependency graph if possible because it will break local builds. The best solution I see is to add a profile for Hadoop 2.3 and 2.4. For now I'd be fine to just require users to manually trigger it and document this in I don't quite understand why the hadoop-client library doesn't advertise jets3 specifically... if I write a Java application that opens an S3 FileSystem and reads and writes data, don't I need jets3 to do that (i.e if this is outside a MapReduce job)? Is this just a bug hadoop's dependencies? |
@srowen if you'd like to take a crack at this by the way, please do. I'll probably look at it on Sunday if no one else has. |
@pwendell Before I begin can I propose a refactoring of profiles that will make this and similar issues easy to deal with? Probably it's for a different PR, but will probably make this and similar changes easy. We need profiles to deal with this. Profiles can be triggered explicitly (e.g. So it seems necessary to use a series of named profiles. Those profiles can set default version values, and those versions can be overridden. For example, it's nice to have a (The SBT build can shadow these changes.) After reading over the build and docs, I propose the following:
... and then the fix for this issue is trivial. All of the build permutations listed in the documentation work under this new arrangement. Anyone want to see a PR or have objections? |
@srowen Not every one uses the same version of HDFS vs YARN. |
@witgo Hm, is there an example that comes up repeatedly? Is it ever intentional, or just some accident of someone's legacy deployment? I don't know of a case of this, and it wouldn't come up with a distro or any semi-recent release of Hadoop, but maybe someone will say this comes up with the 1.x / 0.23.x lines somehow? |
I think in general is an edge case but there are folks still using hdfs I like what you suggested in another PR where you reused the variable value <yarn.version>$hadoop.version</yarn.version> Let me know if I should associate the small commits to specific PRs. Thanks On Saturday, May 3, 2014, Guoqiang Li [email protected] wrote:
|
@srowen YARN version does need to be separate from hadoop version. Downstream consumers of our build sometimes do this. For instance, if they want to build against a custom HDFS distro (e.g. pivotal, IBM, or something) but want to link against the upstream apache YARN repo. It's not something we do in binaries we distribute but it would be good to support it. Think it's fine to remove hadoop.major.version - it seems unused. Adding fancy profile activation would also be nice, but I think that it's not necessary as an immediate fix. We can just say in the build doc that "you need special profiles for the following hadoop versions" and give a small table or list explaining which profiles to activate. |
…ions See related discussion at #468 This PR may still overstep what you have in mind, but let me put it on the table to start. Besides fixing the issue, it has one substantive change, and that is to manage Hadoop-specific things only in Hadoop-related profiles. This does _not_ remove `yarn.version`. - Moves the YARN and Hadoop profiles together in pom.xml. Sorry that this makes the diff a little hard to grok but the changes are only as follows. - Removes `hadoop.major.version` - Introduce `hadoop-2.2` and `hadoop-2.3` profiles to control Hadoop-specific changes: - like the protobuf version issue - this was only 'solved' now by enabling YARN for 2.2+, which is really an orthogonal issue - like the jets3t version issue now - Hadoop profiles set an appropriate default `hadoop.version`, that can be overridden - _(YARN profiles in the parent now only exist to add the sub-module)_ - Fixes the jets3t dependency issue - and makes it a runtime dependency - and centralizes config of this guy in the parent pom - Updates build docs - Updates SBT build too - and fixes a regex problem along the way Author: Sean Owen <[email protected]> Closes #629 from srowen/SPARK-1556 and squashes the following commits: c3fa967 [Sean Owen] Fix hadoop-2.4 profile typo in doc a2105fd [Sean Owen] Add hadoop-2.4 profile and don't set hadoop.version in profiles 274f4f9 [Sean Owen] Make jets3t a runtime dependency, and bring its exclusion up into parent config bbed826 [Sean Owen] Use jets3t 0.9.0 for Hadoop 2.3+ (and correct similar regex issue in SBT build) f21f356 [Sean Owen] Build changes to set up for jets3t fix (cherry picked from commit 73b0cbc) Signed-off-by: Patrick Wendell <[email protected]>
…ions See related discussion at #468 This PR may still overstep what you have in mind, but let me put it on the table to start. Besides fixing the issue, it has one substantive change, and that is to manage Hadoop-specific things only in Hadoop-related profiles. This does _not_ remove `yarn.version`. - Moves the YARN and Hadoop profiles together in pom.xml. Sorry that this makes the diff a little hard to grok but the changes are only as follows. - Removes `hadoop.major.version` - Introduce `hadoop-2.2` and `hadoop-2.3` profiles to control Hadoop-specific changes: - like the protobuf version issue - this was only 'solved' now by enabling YARN for 2.2+, which is really an orthogonal issue - like the jets3t version issue now - Hadoop profiles set an appropriate default `hadoop.version`, that can be overridden - _(YARN profiles in the parent now only exist to add the sub-module)_ - Fixes the jets3t dependency issue - and makes it a runtime dependency - and centralizes config of this guy in the parent pom - Updates build docs - Updates SBT build too - and fixes a regex problem along the way Author: Sean Owen <[email protected]> Closes #629 from srowen/SPARK-1556 and squashes the following commits: c3fa967 [Sean Owen] Fix hadoop-2.4 profile typo in doc a2105fd [Sean Owen] Add hadoop-2.4 profile and don't set hadoop.version in profiles 274f4f9 [Sean Owen] Make jets3t a runtime dependency, and bring its exclusion up into parent config bbed826 [Sean Owen] Use jets3t 0.9.0 for Hadoop 2.3+ (and correct similar regex issue in SBT build) f21f356 [Sean Owen] Build changes to set up for jets3t fix
fixed in #629 |
…ions See related discussion at apache#468 This PR may still overstep what you have in mind, but let me put it on the table to start. Besides fixing the issue, it has one substantive change, and that is to manage Hadoop-specific things only in Hadoop-related profiles. This does _not_ remove `yarn.version`. - Moves the YARN and Hadoop profiles together in pom.xml. Sorry that this makes the diff a little hard to grok but the changes are only as follows. - Removes `hadoop.major.version` - Introduce `hadoop-2.2` and `hadoop-2.3` profiles to control Hadoop-specific changes: - like the protobuf version issue - this was only 'solved' now by enabling YARN for 2.2+, which is really an orthogonal issue - like the jets3t version issue now - Hadoop profiles set an appropriate default `hadoop.version`, that can be overridden - _(YARN profiles in the parent now only exist to add the sub-module)_ - Fixes the jets3t dependency issue - and makes it a runtime dependency - and centralizes config of this guy in the parent pom - Updates build docs - Updates SBT build too - and fixes a regex problem along the way Author: Sean Owen <[email protected]> Closes apache#629 from srowen/SPARK-1556 and squashes the following commits: c3fa967 [Sean Owen] Fix hadoop-2.4 profile typo in doc a2105fd [Sean Owen] Add hadoop-2.4 profile and don't set hadoop.version in profiles 274f4f9 [Sean Owen] Make jets3t a runtime dependency, and bring its exclusion up into parent config bbed826 [Sean Owen] Use jets3t 0.9.0 for Hadoop 2.3+ (and correct similar regex issue in SBT build) f21f356 [Sean Owen] Build changes to set up for jets3t fix
@darose I am facing the VerifyError you mentioned in one of the comments. Can you tell me how you solved that error? |
Are you aware that all this regexp hacks will break when hadoop changes version to 3.0.0? |
@mag- if you're talking about what I think you are, it was a temporary thing that's long since gone already https://github.com/apache/spark/pull/629/files |
Well: |
Agree but that doesn't exist in |
On 04/27/2015 07:11 AM, Sean Owen wrote:
I think @srowen is correct. A while back I upgraded to use a newer DR |
Allow passing env variables for conda so that we can enable instrumentation/other flags when required.
In Hadoop 2.3.0 or newer, Jets3t 0.9.0 which defines S3ServiceException/ServiceException is introduced, however, Spark still relies on Jets3t 0.7.x which has no definition of these classes
What I met (when I try to load data from s3) is as