-
Notifications
You must be signed in to change notification settings - Fork 79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
java.lang.AssertionError: assertion failed: Byte array does not have correct length #702
Comments
Hi @devoplib , Thanks for your interest in Cobrix! I've looked at the stack trace. It seems like the issue is with the reshuffling or the writer since there are no Could you elaborate on how do you run tasks in parallel? Do you execute multiple threads within executors, or do you run multiple Spark actions in a multi-threaded fashion? Could you share an example code? |
Initially we created one task and that task runs a note book that reads the EBCDIC files and converts to ASCII and creates dataframe. We save the dataframe onto Databricks unity catalog. We have many files and to increase the output we created multiple tasks and pass the files to each one of them and each task runs the same notebook. When we run the single task with all the files we don't see any issues but when we are running the notebook in multiple tasks it is giving the error. The difference between the two runs is in the first one we send all the files to one task in the second one we split the files and pass it to multiple tasks. Say we have 10 files we run 5 tasks and send 2 files randomly to each of the 5 tasks. When we run multiple tasks it is failing while doing transformations on the Dataframe for example df.count(). |
Actually, Cobrix already runs conversion in parallel. You don't need to process each file separatey. You can just use '*' the path name: spark.read.format("cobol").load("s3://bucket/path/path/*") or just spark.read.format("cobol").load("s3://bucket/path/path") |
EBCDIC to ASCII conversion process is running in parallel for each table since layout is same like you mentioned above. We have couple of tables that needs to be processed and the schema is different so we need to run one by one and they run parallel. |
I see, makes sense. Thanks for the explanation. From what I can see, you are processing files properly, and there should not be any issues. I suspect this is related to the environment or an issue on the writer's side, not |
Thank you very much. When we retries the task it completes or it fails couple of times and completes finally. Databricks is saying since we are using Cobrix and the issue is with Cobrix and I know it is not Cobrix module but want to make sure. I will check with them again and really appreciate your quick response. |
Hi Devoolib, Let me know the solution if you hear back anything from Databricks. We have similar kind of issues and we think it’s Databricks but it doesn’t throw any proper error. |
Definitely I will share If we find anything related to this error. |
If there is the same assertion error on @yruslan, during the read flow of the shuffle map task execution in the executor, assertion failure is from I see the ie .
Is there any chance that @devoplib devoplib @pinakigit
|
We have tried wit |
@pinakigit , I didn't mean that the photon is affecting this. Avoiding photon is to isolate the issue towards the executor JVM and CobolScanner path. |
Thank you very much for looking into it. Note: In order to complete the job we are retrying the job multiple times. Since we have many streams running parallel they are getting completed and after couple of retries they are all getting done. The issue is it taking resources and ultimately costing more $s. If it runs with out failure it will run faster and help us overall. Please note that I tried with the following 2 options and it is still failing from my end with the same error. Here is the error: Driver stacktrace: |
@devoplib , Thanks for the stack trace.
|
Please find below the code that creates the dataframe code_page_name = 'cp037_extended' We do count and rdd.isEmpty() etc.. as part of validation. Please let me know If you need additional information |
@devoplib , @pinakigit , Thanks for sharing the details. I locally tested the fix and then raised a PR #714 |
@vinodkc, Thank you very much for the fix! 🚀 |
Merged. @devoplib, @pinakigit, please, text. Will release a new version tomorrow. |
Sure. We are testing and will let you know the results after the batch is complete. |
Our testing was successfull. @pinakigit : How did your batch testing go? |
Our testing is successful for 2 day's batch run. Seems the fix is working. Let us know when the new version is available |
@pinakigit: Thank you very much for the testing. |
Cobrox |
Background [Optional]
A clear explanation of the reason for raising the question.
This gives us a better understanding of your use cases and how we might accommodate them.
Question
A clear and concise inquiry
First of all Thank you very much for all your work and support.
We are using COBRIX to convert mainframe files from EBCDIC to ASCII and it is working perfectly fine in Databricks.
To increase the throughput and process the data with more speed we are running the same job in parallel i.e. processing multiple files using the same program by passing different files to each one of them instead of sending it to one task. Say earlier we have fileload task to load all the 42 files we are running the same cobrix conversion module 6 times with 7 files each.
We are getting following "java.lang.AssertionError: assertion failed: Byte array does not have correct length" when we are wrting the data from dataframe to databricks UC table i.e. saveastable or when we do transformations on the dataframe like df.count() or df.rdd.isEmpty etc.
Note: When failed task is resubmitted it completes and it seems some kind of memory contention on Databricks driver since when run as a single task it never fails.
df.write.mode("overWrite").format("delta").saveAsTable(f"{table_name}")
looking for any advise on where to look and debug the error. Any help is appreciated.
Please find the complete error below.
Py4JJavaError: An error occurred while calling o2411.saveAsTable.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 22 in stage 3238.0 failed 4 times, most recent failure: Lost task 22.3 in stage 3238.0 (TID 50850) (100.126.48.51 executor 27): java.lang.AssertionError: assertion failed: Byte array does not have correct length
at scala.Predef$.assert(Predef.scala:223)
at org.apache.spark.SparkContext.$anonfun$binaryRecords$2(SparkContext.scala:1603)
at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)
at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage24.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:43)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at scala.collection.convert.Wrappers$IteratorWrapper.hasNext(Wrappers.scala:32)
at com.google.common.collect.Iterators$PeekingImpl.hasNext(Iterators.java:1139)
at com.databricks.photon.NativeRowBatchIterator.hasNext(NativeRowBatchIterator.java:44)
at 0xa37b947 .HasNext(external/workspace_spark_3_5/photon/jni-wrappers/jni-row-batch-iterator.cc:50)
at com.databricks.photon.JniApiImpl.hasNext(Native Method)
at com.databricks.photon.JniApi.hasNext(JniApi.scala)
at com.databricks.photon.JniExecNode.hasNext(JniExecNode.java:76)
at com.databricks.photon.BasePhotonResultHandler$$anon$1.hasNext(PhotonExec.scala:862)
at com.databricks.photon.PhotonBasicEvaluatorFactory$PhotonBasicEvaluator$$anon$1.$anonfun$hasNext$1(PhotonBasicEvaluatorFactory.scala:211)
at scala.runtime.java8.JFunction0$mcZ$sp.apply(JFunction0$mcZ$sp.java:23)
at com.databricks.photon.PhotonResultHandler.timeit(PhotonResultHandler.scala:30)
at com.databricks.photon.PhotonResultHandler.timeit$(PhotonResultHandler.scala:28)
at com.databricks.photon.BasePhotonResultHandler.timeit(PhotonExec.scala:849)
at com.databricks.photon.PhotonBasicEvaluatorFactory$PhotonBasicEvaluator$$anon$1.hasNext(PhotonBasicEvaluatorFactory.scala:211)
at com.databricks.photon.CloseableIterator$$anon$10.hasNext(CloseableIterator.scala:211)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage46.columnartorow_nextBatch_0$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage46.hashAgg_doAggregateWithKeys_0$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage46.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:43)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:195)
at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:56)
at org.apache.spark.scheduler.ShuffleMapTask.$anonfun$runTask$3(ShuffleMapTask.scala:92)
at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
at org.apache.spark.scheduler.ShuffleMapTask.$anonfun$runTask$1(ShuffleMapTask.scala:87)
at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:58)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:39)
at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:201)
at org.apache.spark.scheduler.Task.doRunTask(Task.scala:186)
at org.apache.spark.scheduler.Task.$anonfun$run$5(Task.scala:151)
at com.databricks.unity.UCSEphemeralState$Handle.runWith(UCSEphemeralState.scala:45)
at com.databricks.unity.HandleImpl.runWith(UCSHandle.scala:103)
at com.databricks.unity.HandleImpl.$anonfun$runWithAndClose$1(UCSHandle.scala:108)
at scala.util.Using$.resource(Using.scala:269)
at com.databricks.unity.HandleImpl.runWithAndClose(UCSHandle.scala:107)
at org.apache.spark.scheduler.Task.$anonfun$run$1(Task.scala:145)
at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$9(Executor.scala:958)
at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)
at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:105)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:961)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:853)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:3908)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:3830)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:3817)
at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:3817)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1695)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1680)
at scala.Option.foreach(Option.scala:407)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1680)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:4154)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:4066)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:4054)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:54)
Caused by: java.lang.AssertionError: assertion failed: Byte array does not have correct length
at scala.Predef$.assert(Predef.scala:223)
at org.apache.spark.SparkContext.$anonfun$binaryRecords$2(SparkContext.scala:1603)
at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)
at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage24.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:43)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at scala.collection.convert.Wrappers$IteratorWrapper.hasNext(Wrappers.scala:32)
at com.google.common.collect.Iterators$PeekingImpl.hasNext(Iterators.java:1139)
at com.databricks.photon.NativeRowBatchIterator.hasNext(NativeRowBatchIterator.java:44)
at 0xa37b947 .HasNext(external/workspace_spark_3_5/photon/jni-wrappers/jni-row-batch-iterator.cc:50)
at com.databricks.photon.JniApiImpl.hasNext(Native Method)
at com.databricks.photon.JniApi.hasNext(JniApi.scala)
at com.databricks.photon.JniExecNode.hasNext(JniExecNode.java:76)
at com.databricks.photon.BasePhotonResultHandler$$anon$1.hasNext(PhotonExec.scala:862)
at com.databricks.photon.PhotonBasicEvaluatorFactory$PhotonBasicEvaluator$$anon$1.$anonfun$hasNext$1(PhotonBasicEvaluatorFactory.scala:211)
at scala.runtime.java8.JFunction0$mcZ$sp.apply(JFunction0$mcZ$sp.java:23)
at com.databricks.photon.PhotonResultHandler.timeit(PhotonResultHandler.scala:30)
at com.databricks.photon.PhotonResultHandler.timeit$(PhotonResultHandler.scala:28)
at com.databricks.photon.BasePhotonResultHandler.timeit(PhotonExec.scala:849)
at com.databricks.photon.PhotonBasicEvaluatorFactory$PhotonBasicEvaluator$$anon$1.hasNext(PhotonBasicEvaluatorFactory.scala:211)
at com.databricks.photon.CloseableIterator$$anon$10.hasNext(CloseableIterator.scala:211)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage46.columnartorow_nextBatch_0$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage46.hashAgg_doAggregateWithKeys_0$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage46.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:43)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:195)
at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:56)
at org.apache.spark.scheduler.ShuffleMapTask.$anonfun$runTask$3(ShuffleMapTask.scala:92)
at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
at org.apache.spark.scheduler.ShuffleMapTask.$anonfun$runTask$1(ShuffleMapTask.scala:87)
at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:58)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:39)
at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:201)
at org.apache.spark.scheduler.Task.doRunTask(Task.scala:186)
at org.apache.spark.scheduler.Task.$anonfun$run$5(Task.scala:151)
at com.databricks.unity.UCSEphemeralState$Handle.runWith(UCSEphemeralState.scala:45)
at com.databricks.unity.HandleImpl.runWith(UCSHandle.scala:103)
at com.databricks.unity.HandleImpl.$anonfun$runWithAndClose$1(UCSHandle.scala:108)
at scala.util.Using$.resource(Using.scala:269)
at com.databricks.unity.HandleImpl.runWithAndClose(UCSHandle.scala:107)
at org.apache.spark.scheduler.Task.$anonfun$run$1(Task.scala:145)
at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$9(Executor.scala:958)
at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)
at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:105)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:961)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:853)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
The text was updated successfully, but these errors were encountered: