Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SPARK-1556: bump jets3t version to 0.9.0 #468

Closed
wants to merge 3 commits into from

Conversation

CodingCat
Copy link
Contributor

In Hadoop 2.3.0 or newer, Jets3t 0.9.0 which defines S3ServiceException/ServiceException is introduced, however, Spark still relies on Jets3t 0.7.x which has no definition of these classes

What I met (when I try to load data from s3) is as

14/04/21 19:30:53 INFO deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id
14/04/21 19:30:53 INFO deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id
14/04/21 19:30:53 INFO deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id
14/04/21 19:30:53 INFO deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap
14/04/21 19:30:53 INFO deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition
java.lang.NoClassDefFoundError: org/jets3t/service/S3ServiceException
at org.apache.hadoop.fs.s3native.NativeS3FileSystem.createDefaultStore(NativeS3FileSystem.java:280)
at org.apache.hadoop.fs.s3native.NativeS3FileSystem.initialize(NativeS3FileSystem.java:270)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2316)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:90)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2350)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2332)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:369)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:221)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:270)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:140)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:207)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:205)
at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:207)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:205)
at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:207)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:205)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:891)
at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:741)
at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:692)
at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:574)
at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:900)
at $iwC$$iwC$$iwC$$iwC.<init>(<console>:15)
at $iwC$$iwC$$iwC.<init>(<console>:20)
at $iwC$$iwC.<init>(<console>:22)
at $iwC.<init>(<console>:24)
at <init>(<console>:26)
at .<init>(<console>:30)
at .<clinit>(<console>)
at .<init>(<console>:7)
at .<clinit>(<console>)
at $print(<console>)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:772)
at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1040)
at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:609)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:640)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:604)
at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:793)
at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:838)
at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:750)
at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:598)
at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:605)
at org.apache.spark.repl.SparkILoop.loop(SparkILoop.scala:608)
at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:931)
at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:881)
at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:881)
at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:881)
at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:973)
at org.apache.spark.repl.Main$.main(Main.scala:31)
at org.apache.spark.repl.Main.main(Main.scala)
Caused by: java.lang.ClassNotFoundException: org.jets3t.service.S3ServiceException
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
... 63 more

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@AmplabJenkins
Copy link

Merged build finished. All automated tests passed.

@AmplabJenkins
Copy link

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14297/

@AmplabJenkins
Copy link

Merged build finished. All automated tests passed.

@AmplabJenkins
Copy link

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14299/

@CodingCat CodingCat changed the title SPARK-1556: bump jet3st version to 0.9.0 SPARK-1556: bump jets3t version to 0.9.0 Apr 21, 2014
@mateiz
Copy link
Contributor

mateiz commented Apr 22, 2014

Unfortunately this will not work in older Hadoop versions as far as I know. Can you still build Spark against Hadoop 1.0.4 and run it with this change?

It might be better to receive jets3t from Hadoop instead of depending on it ourselves. I'm not sure if hadoop-client depends on it...

@srowen
Copy link
Member

srowen commented Apr 22, 2014

@mateiz I thought the same thing, that hadoop-client pulls this in, but it does not. Only things like hadoop-hdfs.

I agree with updating the dependency, but to match the Hadoop version. So the 0.9.0 version belong in the Hadoop 2 profiles.

(Also it should be a runtime scope dependency in Maven.)

@mateiz
Copy link
Contributor

mateiz commented Apr 22, 2014

In that case let's see exactly which Hadoop 2.x version bumped up the dependency, because I don't think 2.0 and 2.1 did it (could be wrong though).

@srowen
Copy link
Member

srowen commented Apr 22, 2014

@mateiz It looks like it went to 0.8.1 in (the unreleased) Hadoop 1.3.0 (https://issues.apache.org/jira/browse/HADOOP-8136) and 0.9.0 in 2.3.0 (https://issues.apache.org/jira/browse/HADOOP-9623)

@mateiz
Copy link
Contributor

mateiz commented Apr 22, 2014

Great, so there's no easy way to set it based on profiles and support all Hadoop versions :). Maybe for Hadoop 2.3+ users, we can just tell them to add a new version of jets3t to their own project's build? We can certainly have our pre-built binaries include the right one too.

@CodingCat
Copy link
Contributor Author

Hi, @mateiz @srowen , if Spark built with Hadoop 1.0.4/2.x (x < 3) and jets3t 0.9.0 can access S3 smoothly, does it also mean that bumping to 0.9.0 is safe?

I'm going to give a manual test tonight or tomorrow

@mateiz
Copy link
Contributor

mateiz commented Apr 22, 2014

Sure, that would work. Please try it. Unfortunately I remember it having problems, but I could be wrong.

@CodingCat
Copy link
Contributor Author

@mateiz you are right, I received the exception of ```java.lang.NoSuchMethodError: org.jets3t.service.impl.rest.httpclient.RestS3Service.(Lorg/jets3t/service/security/AWSCredentials;)V" in both

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@CodingCat
Copy link
Contributor Author

I recovered the build files and updated the documents to indicate this situation for the user

@AmplabJenkins
Copy link

Merged build finished. All automated tests passed.

@AmplabJenkins
Copy link

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14551/

@darose
Copy link

darose commented Apr 29, 2014

Is there any way to apply this fix without a rebuild of spark? E.g., to just replace jets3t-0.7.1.jar with jets3t-0.9.0.jar in a deployed spark package? I'm running into this issue on a machine where I have the CDH5 hadoop and spark packages installed.

@CodingCat
Copy link
Contributor Author

I think the possible way to do that is compile a jets3t0.9.0-enabled version by yourself

then compile your application against this version .... I think to access HDFS-compatible fs, we eventually call the code in application jar

@mateiz
Copy link
Contributor

mateiz commented Apr 30, 2014

You can try adding jets3t 0.9 as a Maven dependency in your application, but unfortunately I think that goes after the Spark assembly JAR when running an app. In 1.0 there will be a setting to put the user's classpath first.

It sounds like the Spark bundle for CDH needs to be updated with this; CCing @srowen.

For this patch, we probably want to create a new Maven profile to use a new Jets3t when that's enabled.

@CodingCat
Copy link
Contributor Author

@mateiz for @darose 's question, how about compile the application against a customized spark jar (with newer jets3t)? I think in that case, he does not need to restart the cluster?

@mateiz
Copy link
Contributor

mateiz commented Apr 30, 2014

BTW the right way to do it would be to make hadoop-client have a Maven dependency on the right version of Jets3t. Then Spark would just build with the right version out of the box when it linked to the right Hadoop version.

@darose
Copy link

darose commented Apr 30, 2014

Definitely worth a shot! Will give that a try and report back.

@CodingCat
Copy link
Contributor Author

Hi, @srowen, do you want to take over the patch? I'm concerning I cannot fix it in the following days, considering my schedule and my knowledge level on mvn and sbt?

@darose
Copy link

darose commented Apr 30, 2014

Sigh. Was a promising idea, but no dice. Even with the 0.7 jars out of the way, I'm still getting java.lang.NoClassDefFoundError: org/jets3t/service/S3ServiceException
at org.apache.hadoop.fs.s3native.NativeS3FileSystem.createDefaultStore(NativeS3FileSystem.java:280)
at org.apache.hadoop.fs.s3native.NativeS3FileSystem.initialize(NativeS3FileSystem.java:270)
...
at shark.SharkCliDriver.main(SharkCliDriver.scala)

@srowen
Copy link
Member

srowen commented Apr 30, 2014

@CodingCat I can make a patch, but it will mean introducing a new profile like "hadoop230" that one has to enable when building for Hadoop 2.3.0. I always hate to add that complexity and hope someone has a better idea. But I'll propose the PR if a committer nods and says it's worth changing.

I imagine it won't be the last time the dependencies have to be fudged by Hadoop version -- isn't this already an existing issue with Avro anyway?

@darose
Copy link

darose commented Apr 30, 2014

FYI - I think I might have figured out why deleting the jets3t jar didn't fix the issue. It looks like the spark build process bundles the jets3t classes into the spark assembly jar. So I'm guessing that whacking the stand-alone jar file wouldn't fix the issue if there's still 0.7 classes bundled in another jar.

@darose
Copy link

darose commented May 2, 2014

Man oh man, I cannot get this to work no way no how. I tried rebuilding spark using the jets3t 0.9 jar, then tried rebuilding shark doing the same. I keep getting a verify error - presumably because something in the call stack isn't compatible with the new jets3t version. Anyone have any ideas/suggestions? I'm at my wits' end on this. Spent days, and still unable to get a working version of spark/shark running with CDH5. Output below.

14/05/02 06:34:14 WARN scheduler.TaskSetManager: Loss was due to java.lang.VerifyError
java.lang.VerifyError: Bad type on operand stack
Exception Details:
  Location:
    org/apache/hadoop/fs/s3native/Jets3tNativeFileSystemStore.initialize(Ljava/net/URI;Lorg/apache/hadoop/conf/Configuration;)V @38: invokespecial
  Reason:
    Type 'org/jets3t/service/security/AWSCredentials' (current frame, stack[3]) is not assignable to 'org/jets3t/service/security/ProviderCredentials
'
  Current Frame:
    bci: @38
    flags: { }
    locals: { 'org/apache/hadoop/fs/s3native/Jets3tNativeFileSystemStore', 'java/net/URI', 'org/apache/hadoop/conf/Configuration', 'org/apache/hadoop
/fs/s3/S3Credentials', 'org/jets3t/service/security/AWSCredentials' }
    stack: { 'org/apache/hadoop/fs/s3native/Jets3tNativeFileSystemStore', uninitialized 32, uninitialized 32, 'org/jets3t/service/security/AWSCredent
ials' }
  Bytecode:
    0000000: bb00 0259 b700 034e 2d2b 2cb6 0004 bb00
    0000010: 0559 2db6 0006 2db6 0007 b700 083a 042a
    0000020: bb00 0959 1904 b700 0ab5 000b a700 0b3a
    0000030: 042a 1904 b700 0d2a 2c12 0e03 b600 0fb5
    0000040: 0010 2a2c 1211 1400 12b6 0014 1400 15b8
    0000050: 0017 b500 182a 2c12 1914 0015 b600 1414
    0000060: 0015 b800 17b5 001a 2abb 001b 592b b600
    0000070: 1cb7 001d b500 1eb1                    
  Exception Handler Table:
    bci [14, 44] => handler: 47
  Stackmap Table:                                                                                                                          [344/1956]
    full_frame(@47,{Object[#176],Object[#177],Object[#178],Object[#179]},{Object[#180]})
    same_frame(@55)

        at org.apache.hadoop.fs.s3native.NativeS3FileSystem.createDefaultStore(NativeS3FileSystem.java:280)
        at org.apache.hadoop.fs.s3native.NativeS3FileSystem.initialize(NativeS3FileSystem.java:270)
        at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2397)
        at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:89)
        at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2431)
        at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2413)
        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:368)
        at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
        at org.apache.hadoop.mapred.LineRecordReader.<init>(LineRecordReader.java:107)
        at org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:67)
        at org.apache.spark.rdd.HadoopRDD$$anon$1.<init>(HadoopRDD.scala:156)
        at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:149)
        at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:64)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
        at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:34)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:34)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:34)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:109)
        at org.apache.spark.scheduler.Task.run(Task.scala:53)
        at org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:213)
        at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:42)
        at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:41)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:415)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
        at org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:41)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:744)

@darose
Copy link

darose commented May 2, 2014

I think I'm going to have to give up on getting Shark working on my existing CDH5 cluster right now. I've tried everything I can think of (various binary releases, building both spark and shark myself against jets3t 0.9, various config tweaks, etc.) but I'm stuck at either the class not found error in https://issues.apache.org/jira/browse/SPARK-1556, or the verify error above. I'll have to either wait until there's a new binary release, or look for an alternative.

@pwendell
Copy link
Contributor

pwendell commented May 3, 2014

@srowen I'd prefer not to remove it from the dependency graph if possible because it will break local builds. The best solution I see is to add a profile for Hadoop 2.3 and 2.4. For now I'd be fine to just require users to manually trigger it and document this in building-with-maven. In SBT we can actually just insert logic in the build based on the Hadoop profile. I'm guessing we'll have to get into the habit of doing this, since it seems like Spark is good at finding bugs in Hadoop's dependency graph. We should probably start testing Spark against Hadoop RC's if they publish them to maven so we can give feedback.

I don't quite understand why the hadoop-client library doesn't advertise jets3 specifically... if I write a Java application that opens an S3 FileSystem and reads and writes data, don't I need jets3 to do that (i.e if this is outside a MapReduce job)? Is this just a bug hadoop's dependencies?

@pwendell
Copy link
Contributor

pwendell commented May 3, 2014

@srowen if you'd like to take a crack at this by the way, please do. I'll probably look at it on Sunday if no one else has.

@srowen
Copy link
Member

srowen commented May 3, 2014

@pwendell Before I begin can I propose a refactoring of profiles that will make this and similar issues easy to deal with? Probably it's for a different PR, but will probably make this and similar changes easy.

We need profiles to deal with this. Profiles can be triggered explicitly (e.g. -Phadoop-2.3) or by property values (-Dhadoop.version=2.3.0). It's necessary to have things like hadoop.version be customizable, so it would be nice to also trigger needed profiles from this. However, Maven lacks ability to trigger on a range of property values; you can trigger on a particular value like "2.3.0" but not "2.3.*" or "[2.3.0,2.4.0)" syntax.

So it seems necessary to use a series of named profiles. Those profiles can set default version values, and those versions can be overridden. For example, it's nice to have a hadoop-2.3 profile set hadoop.version=2.3.0 for you, even though that can still be overridden.

(The SBT build can shadow these changes.)

After reading over the build and docs, I propose the following:

  • Introduce a hadoop-2.3 profile, similar to hadoop-0.23, to encompass 2.3+-specific build changes, and one for hadoop-2.2 as well (see later)
  • hadoop.major.version appears to be unused -- remove it?
  • I believe yarn.version can be removed; use hadoop.version in its place. Ideally these are always synced, no? All doc examples show yarn.version matching hadoop.version and the distribution script uses SPARK_HADOOP_VERSION for yarn.version. Now, the default Hadoop version is 1.0.4 and there is no such YARN version. But the yarn-alpha profile sets hadoop.version=0.23.7 to match the default yarn.version=0.23.7 anyway. It seems like Hadoop 1.x + YARN is not intended anyway, which seems corroborated by the build documentation.
  • So, YARN-related profiles should not set hadoop.version, and in fact only serve to add the yarn child module

... and then the fix for this issue is trivial.

All of the build permutations listed in the documentation work under this new arrangement. Anyone want to see a PR or have objections?

@witgo
Copy link
Contributor

witgo commented May 3, 2014

@srowen Not every one uses the same version of HDFS vs YARN.

@srowen
Copy link
Member

srowen commented May 3, 2014

@witgo Hm, is there an example that comes up repeatedly? Is it ever intentional, or just some accident of someone's legacy deployment? I don't know of a case of this, and it wouldn't come up with a distro or any semi-recent release of Hadoop, but maybe someone will say this comes up with the 1.x / 0.23.x lines somehow?

@witgo
Copy link
Contributor

witgo commented May 3, 2014

@srowen Related discussion in PR 502.
@berngp Can you explain the reason of not using the same version of HDFS vs YARN ?

@berngp
Copy link
Contributor

berngp commented May 3, 2014

I think in general is an edge case but there are folks still using hdfs
1.0.x with a different version of YARN, that said it is not my case.

I like what you suggested in another PR where you reused the variable value
of the hadoop.version to specify the yarn.version. Eg

<yarn.version>$hadoop.version</yarn.version>

Let me know if I should associate the small commits to specific PRs. Thanks
again for following up on those commits.

On Saturday, May 3, 2014, Guoqiang Li [email protected] wrote:

@srowen https://github.com/srowen Related discussion in PR 502#502
.
@berngp https://github.com/berngp Can you explain the reason of not
using the same version of HDFS vs YARN ?


Reply to this email directly or view it on GitHubhttps://github.com//pull/468#issuecomment-42110042
.

@pwendell
Copy link
Contributor

pwendell commented May 3, 2014

@srowen YARN version does need to be separate from hadoop version. Downstream consumers of our build sometimes do this. For instance, if they want to build against a custom HDFS distro (e.g. pivotal, IBM, or something) but want to link against the upstream apache YARN repo. It's not something we do in binaries we distribute but it would be good to support it.

Think it's fine to remove hadoop.major.version - it seems unused.

Adding fancy profile activation would also be nice, but I think that it's not necessary as an immediate fix. We can just say in the build doc that "you need special profiles for the following hadoop versions" and give a small table or list explaining which profiles to activate.

asfgit pushed a commit that referenced this pull request May 5, 2014
…ions

See related discussion at #468

This PR may still overstep what you have in mind, but let me put it on the table to start. Besides fixing the issue, it has one substantive change, and that is to manage Hadoop-specific things only in Hadoop-related profiles. This does _not_ remove `yarn.version`.

- Moves the YARN and Hadoop profiles together in pom.xml. Sorry that this makes the diff a little hard to grok but the changes are only as follows.
- Removes `hadoop.major.version`
- Introduce `hadoop-2.2` and `hadoop-2.3` profiles to control Hadoop-specific changes:
  - like the protobuf version issue - this was only 'solved' now by enabling YARN for 2.2+, which is really an orthogonal issue
  - like the jets3t version issue now
- Hadoop profiles set an appropriate default `hadoop.version`, that can be overridden
- _(YARN profiles in the parent now only exist to add the sub-module)_
- Fixes the jets3t dependency issue
 - and makes it a runtime dependency
 - and centralizes config of this guy in the parent pom
- Updates build docs
- Updates SBT build too
  - and fixes a regex problem along the way

Author: Sean Owen <[email protected]>

Closes #629 from srowen/SPARK-1556 and squashes the following commits:

c3fa967 [Sean Owen] Fix hadoop-2.4 profile typo in doc
a2105fd [Sean Owen] Add hadoop-2.4 profile and don't set hadoop.version in profiles
274f4f9 [Sean Owen] Make jets3t a runtime dependency, and bring its exclusion up into parent config
bbed826 [Sean Owen] Use jets3t 0.9.0 for Hadoop 2.3+ (and correct similar regex issue in SBT build)
f21f356 [Sean Owen] Build changes to set up for jets3t fix
(cherry picked from commit 73b0cbc)

Signed-off-by: Patrick Wendell <[email protected]>
asfgit pushed a commit that referenced this pull request May 5, 2014
…ions

See related discussion at #468

This PR may still overstep what you have in mind, but let me put it on the table to start. Besides fixing the issue, it has one substantive change, and that is to manage Hadoop-specific things only in Hadoop-related profiles. This does _not_ remove `yarn.version`.

- Moves the YARN and Hadoop profiles together in pom.xml. Sorry that this makes the diff a little hard to grok but the changes are only as follows.
- Removes `hadoop.major.version`
- Introduce `hadoop-2.2` and `hadoop-2.3` profiles to control Hadoop-specific changes:
  - like the protobuf version issue - this was only 'solved' now by enabling YARN for 2.2+, which is really an orthogonal issue
  - like the jets3t version issue now
- Hadoop profiles set an appropriate default `hadoop.version`, that can be overridden
- _(YARN profiles in the parent now only exist to add the sub-module)_
- Fixes the jets3t dependency issue
 - and makes it a runtime dependency
 - and centralizes config of this guy in the parent pom
- Updates build docs
- Updates SBT build too
  - and fixes a regex problem along the way

Author: Sean Owen <[email protected]>

Closes #629 from srowen/SPARK-1556 and squashes the following commits:

c3fa967 [Sean Owen] Fix hadoop-2.4 profile typo in doc
a2105fd [Sean Owen] Add hadoop-2.4 profile and don't set hadoop.version in profiles
274f4f9 [Sean Owen] Make jets3t a runtime dependency, and bring its exclusion up into parent config
bbed826 [Sean Owen] Use jets3t 0.9.0 for Hadoop 2.3+ (and correct similar regex issue in SBT build)
f21f356 [Sean Owen] Build changes to set up for jets3t fix
@CodingCat
Copy link
Contributor Author

fixed in #629

@CodingCat CodingCat closed this May 5, 2014
pdeyhim pushed a commit to pdeyhim/spark-1 that referenced this pull request Jun 25, 2014
…ions

See related discussion at apache#468

This PR may still overstep what you have in mind, but let me put it on the table to start. Besides fixing the issue, it has one substantive change, and that is to manage Hadoop-specific things only in Hadoop-related profiles. This does _not_ remove `yarn.version`.

- Moves the YARN and Hadoop profiles together in pom.xml. Sorry that this makes the diff a little hard to grok but the changes are only as follows.
- Removes `hadoop.major.version`
- Introduce `hadoop-2.2` and `hadoop-2.3` profiles to control Hadoop-specific changes:
  - like the protobuf version issue - this was only 'solved' now by enabling YARN for 2.2+, which is really an orthogonal issue
  - like the jets3t version issue now
- Hadoop profiles set an appropriate default `hadoop.version`, that can be overridden
- _(YARN profiles in the parent now only exist to add the sub-module)_
- Fixes the jets3t dependency issue
 - and makes it a runtime dependency
 - and centralizes config of this guy in the parent pom
- Updates build docs
- Updates SBT build too
  - and fixes a regex problem along the way

Author: Sean Owen <[email protected]>

Closes apache#629 from srowen/SPARK-1556 and squashes the following commits:

c3fa967 [Sean Owen] Fix hadoop-2.4 profile typo in doc
a2105fd [Sean Owen] Add hadoop-2.4 profile and don't set hadoop.version in profiles
274f4f9 [Sean Owen] Make jets3t a runtime dependency, and bring its exclusion up into parent config
bbed826 [Sean Owen] Use jets3t 0.9.0 for Hadoop 2.3+ (and correct similar regex issue in SBT build)
f21f356 [Sean Owen] Build changes to set up for jets3t fix
@LuqmanSahaf
Copy link

@darose I am facing the VerifyError you mentioned in one of the comments. Can you tell me how you solved that error?

@mag-
Copy link

mag- commented Apr 27, 2015

Are you aware that all this regexp hacks will break when hadoop changes version to 3.0.0?

@srowen
Copy link
Member

srowen commented Apr 27, 2015

@mag- if you're talking about what I think you are, it was a temporary thing that's long since gone already https://github.com/apache/spark/pull/629/files

@mag-
Copy link

mag- commented Apr 27, 2015

Well:
val jets3tVersion = if ("^2\\.[3-9]+".r.findFirstIn(hadoopVersion).isDefined) "0.9.0" else "0.7.1"
It probably should be other way round, if hadoop version is lower than 2.3 we use 0.7.1
Also someone needs to test it with hadoop 2.6/2.7 where s3 support was splitted to hadoop-aws.
( I'm thinking that mvn profile approach was maybe cleaner than this if/else... )

@srowen
Copy link
Member

srowen commented Apr 27, 2015

Agree but that doesn't exist in master anyway. Now the SBT build drives off the Maven build.

@darose
Copy link

darose commented Apr 28, 2015

On 04/27/2015 07:11 AM, Sean Owen wrote:

@mag- if you're talking about what I think you are, it was a temporary thing that's long since gone already https://github.com/apache/spark/pull/629/files

I think @srowen is correct. A while back I upgraded to use a newer
version of Spark (and built it using the correct -Dhadoop.version= and
-Phadoop- flags) and the problem went away.

DR

j-esse pushed a commit to j-esse/spark that referenced this pull request Jan 24, 2019
Allow passing env variables for conda so that we can enable instrumentation/other flags when required.
arjunshroff pushed a commit to arjunshroff/spark that referenced this pull request Nov 24, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants