Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Qualification tool: Error handling while processing large event logs #3714

Merged
merged 9 commits into from
Oct 8, 2021

Conversation

nartal1
Copy link
Collaborator

@nartal1 nartal1 commented Sep 30, 2021

This fixes #3430 .
In this PR, we catch OOM Error and log it with a hint to increase heap size to the user. Earlier it would complete without any log.
Not sure how to add tests for this. I ran it locally with small heap size and got the logError on the console.

 java -Xmx512M -cp tools/target/rapids-4-spark-tools_2.12-21.12.0-SNAPSHOT.jar:/home/nartal/spark-3.1.1/spark-3.1.1-bin-hadoop3.2/jars/* com.nvidia.spark.rapids.tool.profiling.ProfileMain --csv /home/nartal/CPU_GPU_eventLogs/CPU_runs/


21/09/29 17:44:53 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
21/09/29 17:44:56 ERROR ApplicationInfo: OOM error while processing large files. Increase heap size to process file:/home/nartal/CPU_GPU_eventLogs/CPU_runs/application_1630450374626_0001_1

@nartal1 nartal1 added the bug Something isn't working label Sep 30, 2021
@nartal1 nartal1 added this to the Sep 27 - Oct 1 milestone Sep 30, 2021
@nartal1 nartal1 self-assigned this Sep 30, 2021
@tgravescs
Copy link
Collaborator

so this should not just be for OOM's, i've seen us swallow other exceptions as well, we should at least print out when an exception happens to let the user know.

Signed-off-by: Niranjan Artal <[email protected]>
val event = JsonProtocol.sparkEventFromJson(parse(line))
processEvent(event)
try {
val lines = Source.fromInputStream(in)(Codec.UTF8).getLines().toList
Copy link
Collaborator

@gerashegalov gerashegalov Sep 30, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the root cause of OOM is likely the fact that we materialize the whole file on Heap. Remove toList to keep it a line iterator.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @gerashegalov for taking a look. I was still seeing OOM error even after removing toList as we are reading one event log per thread. I am wrapping the checks at thread level now.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are multiple things that can cause OOM, Gera is just saying this is potentially one of those, it depends on the file sizes. we should file a separate followup if we want to optimize it more.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you post the stack trace, it might give us a hint what to look for?

And I'd get a heapdump with -XX:+HeapDumpOnOutOfMemoryError

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Below is the stack trace. It looks like while reading the file from inputStream.

Dumping heap to java_pid31797.hprof ...
Heap dump file created [427465673 bytes in 1.749 secs]
21/09/30 15:15:32 ERROR Profiler: OOM error while processing large file file:/home/nartal/CPU_runs/application_1630450374626_0001_1.Increase heap size.
java.lang.OutOfMemoryError: GC overhead limit exceeded
	at java.util.Arrays.copyOf(Arrays.java:3332)
	at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:124)
	at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:596)
	at java.lang.StringBuilder.append(StringBuilder.java:190)
	at java.io.BufferedReader.readLine(BufferedReader.java:358)
	at java.io.BufferedReader.readLine(BufferedReader.java:389)
	at scala.io.BufferedSource$BufferedLineIterator.hasNext(BufferedSource.scala:74)
	at scala.collection.Iterator.foreach(Iterator.scala:941)
	at scala.collection.Iterator.foreach$(Iterator.scala:941)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
	at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)
	at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53)
	at scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:189)
	at scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:47)
	at scala.collection.TraversableOnce.to(TraversableOnce.scala:313)
	at scala.collection.TraversableOnce.to$(TraversableOnce.scala:311)
	at scala.collection.AbstractIterator.to(Iterator.scala:1429)
	at scala.collection.TraversableOnce.toList(TraversableOnce.scala:297)
	at scala.collection.TraversableOnce.toList$(TraversableOnce.scala:297)
	at scala.collection.AbstractIterator.toList(Iterator.scala:1429)
	at org.apache.spark.sql.rapids.tool.AppBase.$anonfun$processEvents$4(AppBase.scala:87)
	at org.apache.spark.sql.rapids.tool.AppBase$$Lambda$203/903990770.apply(Unknown Source)
	at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2611)
	at org.apache.spark.sql.rapids.tool.AppBase.$anonfun$processEvents$2(AppBase.scala:86)
	at org.apache.spark.sql.rapids.tool.AppBase$$Lambda$199/136377256.apply(Unknown Source)
	at scala.collection.immutable.List.foreach(List.scala:392)
	at org.apache.spark.sql.rapids.tool.AppBase.processEvents(AppBase.scala:85)
	at org.apache.spark.sql.rapids.tool.profiling.ApplicationInfo.<init>(ApplicationInfo.scala:240)
	at com.nvidia.spark.rapids.tool.profiling.Profiler.com$nvidia$spark$rapids$tool$profiling$Profiler$$createApp(Profiler.scala:248)
	at com.nvidia.spark.rapids.tool.profiling.Profiler$ProfileProcessThread$1.run(Profiler.scala:205)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)

I got the heapdump by including the argument you had specified. I opened it with jhat. It's too much info in there. Could you please let me pointers about where I should be looking at?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like you still have toList in there.

VisualVM (part of JDK) and Eclipse MAT are more user friendly for analyzing the heap dump. You want to look for objects with large "retained" heap size.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Gera. Filed follow on issue to improve memory consumption: https://github.com/NVIDIA/spark-rapids/issues/3727
This PR is mostly to identify OOM and throw a meaningful error so that users can increase the heap size.

@nartal1
Copy link
Collaborator Author

nartal1 commented Oct 1, 2021

build

Signed-off-by: Niranjan Artal <[email protected]>
@tgravescs
Copy link
Collaborator

build

@sameerz sameerz modified the milestones: Sep 27 - Oct 1, Oct 4 - Oct 15 Oct 4, 2021
Signed-off-by: Niranjan Artal <[email protected]>
@nartal1
Copy link
Collaborator Author

nartal1 commented Oct 4, 2021

build

gerashegalov
gerashegalov previously approved these changes Oct 4, 2021
Copy link
Collaborator

@gerashegalov gerashegalov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

sys.exit(1)
case NonFatal(e) =>
logWarning(s"Exception occurred processing file: ${path.eventLog.getName}", e)
case o =>
Copy link
Collaborator

@gerashegalov gerashegalov Oct 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: there is no warning in the function but if we ever move it back inside a catch will get a warning like #3743.

Suggested change
case o =>
case o: Throwable =>

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! updated it.

Signed-off-by: Niranjan Artal <[email protected]>
@nartal1
Copy link
Collaborator Author

nartal1 commented Oct 4, 2021

build

@nartal1
Copy link
Collaborator Author

nartal1 commented Oct 5, 2021

build

1 similar comment
@tgravescs
Copy link
Collaborator

build

@nartal1
Copy link
Collaborator Author

nartal1 commented Oct 5, 2021

build

@nartal1
Copy link
Collaborator Author

nartal1 commented Oct 5, 2021

It's failing due to timeout error.

@tgravescs
Copy link
Collaborator

the tests are failing with known issue #3742, need to wait for a fix there

@gerashegalov
Copy link
Collaborator

build

@nartal1
Copy link
Collaborator Author

nartal1 commented Oct 8, 2021

build

@gerashegalov gerashegalov merged commit 23a8a99 into NVIDIA:branch-21.12 Oct 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] Profiling tool silently stops without producing any output on a Synapse Spark event log
4 participants