Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] SPARK-1699: Python relative independence from the core, becomes subprojects #624

Closed
wants to merge 28 commits into from

Conversation

witgo
Copy link
Contributor

@witgo witgo commented May 3, 2014

No description provided.

berngp and others added 22 commits April 15, 2014 14:03
The change adds the `./yarn/stable/target/<scala-version>/classes` to
the _Classpath_ when a _dependencies_ assembly is available at the
assembly directory.

Why is this change necessary?
Ease the development features and bug-fixes for Spark-YARN.

[ticket: X] : NA

Author      : [email protected]
Reviewer    : ?
Testing     : ?
…ectory.

Why is this change necessary?

While developing in Spark I found myself rebuilding either the
dependencies assembly or the full spark assembly. I kept running into
the case of having both the dep-assembly and full-assembly in the same
directory and getting an error when I called either `spark-shell` or
`spark-submit`.

Quick fix: move either of them as a .bkp file depending on
the development work flow you are executing at the moment and enabling
the `spark-class` to ignore non-jar files. An other option could be to
move the "offending" jar to a different directory but in my opinion
keeping them in there is a bit tidier.

e.g.

```
ll ./assembly/target/scala-2.10
spark-assembly-1.0.0-SNAPSHOT-hadoop2.3.0-deps.jar
spark-assembly-1.0.0-SNAPSHOT-hadoop2.3.0.jar.bkp
```

[ticket: X] : ?
…UNCH_COMMAND .

Why is this change necessary?
Most likely when enabling the `--log-conf` through the `spark-shell` you
are also interested on the full invocation of the java command including the
_classpath_ and extended options. e.g.

```
INFO: Base Directory set to /Users/bernardo/work/github/berngp/spark
INFO: Spark Master is yarn-client
INFO: Spark REPL options   -Dspark.logConf=true
Spark Command: /Library/Java/JavaVirtualMachines/jdk1.8.0.jdk/Contents/Home/bin/java -cp :/Users/bernardo/work/github/berngp/spark/conf:/Users/bernardo/work/github/berngp/spark/core/target/scala-2.10/classes:/Users/bernardo/work/github/berngp/spark/repl/target/scala-2.10/classes:/Users/bernardo/work/github/berngp/spark/mllib/target/scala-2.10/classes:/Users/bernardo/work/github/berngp/spark/bagel/target/scala-2.10/classes:/Users/bernardo/work/github/berngp/spark/graphx/target/scala-2.10/classes:/Users/bernardo/work/github/berngp/spark/streaming/target/scala-2.10/classes:/Users/bernardo/work/github/berngp/spark/tools/target/scala-2.10/classes:/Users/bernardo/work/github/berngp/spark/sql/catalyst/target/scala-2.10/classes:/Users/bernardo/work/github/berngp/spark/sql/core/target/scala-2.10/classes:/Users/bernardo/work/github/berngp/spark/sql/hive/target/scala-2.10/classes:/Users/bernardo/work/github/berngp/spark/yarn/stable/target/scala-2.10/classes:/Users/bernardo/work/github/berngp/spark/assembly/target/scala-2.10/spark-assembly-1.0.0-SNAPSHOT-hadoop2.3.0-deps.jar:/usr/local/Cellar/hadoop/2.2.0/libexec/etc/hadoop -XX:ErrorFile=/tmp/spark-shell-hs_err_pid.log -XX:HeapDumpPath=/tmp/spark-shell-java_pid.hprof -XX:-HeapDumpOnOutOfMemoryError -XX:-PrintGC -XX:-PrintGCDetails -XX:-PrintGCTimeStamps -XX:-PrintTenuringDistribution -XX:-PrintAdaptiveSizePolicy -XX:GCLogFileSize=1024K -XX:-UseGCLogFileRotation -Xloggc:/tmp/spark-shell-gc.log -XX:+UseConcMarkSweepGC -Dspark.cleaner.ttl=10000 -Dspark.driver.host=33.33.33.1 -Dspark.logConf=true -Djava.library.path= -Xms400M -Xmx400M org.apache.spark.repl.Main
```

[ticket: X] : ?
Why is this change necessary?
Renamed the SBT "root" project to "spark" to enhance readability.

Currently the assembly is qualified with the Hadoop Version but not if
YARN has been enabled or not. This change qualifies the assembly such
that it is easy to identify if YARN was enabled.

e.g

```
./make-distribution.sh --hadoop 2.3.0 --with-yarn

ls -l ./assembly/target/scala-2.10
    spark-assembly-1.0.0-SNAPSHOT-hadoop2.3.0-yarn.jar
```

vs

```
./make-distribution.sh --hadoop 2.3.0

ls -l ./assembly/target/scala-2.10
    spark-assembly-1.0.0-SNAPSHOT-hadoop2.3.0.jar
```

[ticket: X] : ?
Upgraded to YARN 2.3.0, removed unnecessary `relativePath` values and
removed incorrect version for the "org.apache.hadoop:hadoop-client"
dependency at yarn/pom.xml.
…ad to throw an SecurityException when Spark built for hadoop 2.3.0 , 2.4.0
@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@witgo witgo changed the title SPARK-1699: Python relative independence from the core, becomes subprojects [WIP]SPARK-1699: Python relative independence from the core, becomes subprojects May 3, 2014
@witgo witgo changed the title [WIP]SPARK-1699: Python relative independence from the core, becomes subprojects [WIP] SPARK-1699: Python relative independence from the core, becomes subprojects May 3, 2014
@witgo
Copy link
Contributor Author

witgo commented May 3, 2014

Branch is wrong, temporarily closed.

@witgo witgo closed this May 3, 2014
gzm55 pushed a commit to MediaV/spark that referenced this pull request Jul 17, 2014
The original poster of this bug is @guojc, who opened a PR that preceded this one at https://github.com/apache/incubator-spark/pull/612.

ExternalAppendOnlyMap uses key hash code to order the buffer streams from which spilled files are read back into memory. When a buffer stream is empty, the default hash code for that stream is equal to Int.MaxValue. This is, however, a perfectly legitimate candidate for a key hash code. When reading from a spilled map containing such a key, a hash collision may occur, in which case we attempt to read from an empty stream and throw NoSuchElementException.

The fix is to maintain the invariant that empty buffer streams are never added back to the merge queue to be considered. This guarantees that we never read from an empty buffer stream, ever again.

This PR also includes two new tests for hash collisions.

Author: Andrew Or <[email protected]>

Closes apache#624 from andrewor14/spilling-bug and squashes the following commits:

9e7263d [Andrew Or] Slightly optimize next()
2037ae2 [Andrew Or] Move a few comments around...
cf95942 [Andrew Or] Remove default value of Int.MaxValue for minKeyHash
c11f03b [Andrew Or] Fix Int.MaxValue hash collision bug in ExternalAppendOnlyMap
21c1a39 [Andrew Or] Add hash collision tests to ExternalAppendOnlyMapSuite
(cherry picked from commit fefd22f)

Signed-off-by: Patrick Wendell <[email protected]>
andrewor14 added a commit to andrewor14/spark that referenced this pull request Jan 8, 2015
… bug

The original poster of this bug is @guojc, who opened a PR that preceded this one at https://github.com/apache/incubator-spark/pull/612.

ExternalAppendOnlyMap uses key hash code to order the buffer streams from which spilled files are read back into memory. When a buffer stream is empty, the default hash code for that stream is equal to Int.MaxValue. This is, however, a perfectly legitimate candidate for a key hash code. When reading from a spilled map containing such a key, a hash collision may occur, in which case we attempt to read from an empty stream and throw NoSuchElementException.

The fix is to maintain the invariant that empty buffer streams are never added back to the merge queue to be considered. This guarantees that we never read from an empty buffer stream, ever again.

This PR also includes two new tests for hash collisions.

Author: Andrew Or <[email protected]>

Closes apache#624 from andrewor14/spilling-bug and squashes the following commits:

9e7263d [Andrew Or] Slightly optimize next()
2037ae2 [Andrew Or] Move a few comments around...
cf95942 [Andrew Or] Remove default value of Int.MaxValue for minKeyHash
c11f03b [Andrew Or] Fix Int.MaxValue hash collision bug in ExternalAppendOnlyMap
21c1a39 [Andrew Or] Add hash collision tests to ExternalAppendOnlyMapSuite
(cherry picked from commit fefd22f)

Signed-off-by: Patrick Wendell <[email protected]>
@witgo witgo deleted the python-api branch March 13, 2015 09:02
RolatZhang pushed a commit to RolatZhang/spark that referenced this pull request Aug 18, 2023
* KE-40433 add page index filter log

* KE-40433 update parquet version
RolatZhang pushed a commit to RolatZhang/spark that referenced this pull request Dec 8, 2023
KE-40433 add page index filter log (apache#619) (apache#624)

* KE-40433 add page index filter log

* KE-40433 update parquet version
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants