[Question]: Use of memory #587
-
I would like to know if this library can be used to execute spark batch jobs in a loop. Imagine the
The scenario I'm describing works fine except for the fact that the use of memory keeps increasing up to the infinite. Is there a way to dispose the DataFrame from memory? Is it a problem related with reusing the spark session? I will further investigate this memory leak, but I thought it would be worth to ask here just in case you have any valuable insight. As a clarification, I'm compiling the code and running the DLL with spark-submit using the DotnetRunner class. Thank you very much for your time. |
Beta Was this translation helpful? Give feedback.
Replies: 9 comments
-
@JavierLight Can you try calling |
Beta Was this translation helpful? Give feedback.
-
Thanks for the quick reply. I tried your suggestion but it didn't work, the problem is related to the RAM-memory consumption (it keeps growing even though I called Unpersist() on the DF and saw on the logs that resources were being freed up). Just to clarify it is not related to the disk usage, just RAM. Maybe the issue is that reusing the SparkSession is not a good idea? |
Beta Was this translation helpful? Give feedback.
-
Is the memory increase from CLR or JVM? Are you brining in any data to driver? Also, what do you mean by |
Beta Was this translation helpful? Give feedback.
-
@imback82 sorry I didn't explain myself properly. What I meant by
Since the reader, writer and dataframe are local variables scoped to this method that returns true, I thought it wasn't necessary to dispose anything. I'm calling this method several times (for different json files). After each call, the memory increases several MBs. I'm going to try to see if I can diagnose where the memory increase is coming from (JVM vs CLR) and I will let you know. Regarding the driver, this code is the only thing that is sending data to it. I thought I didn't need to do anything else, but maybe my problem is that I should be disposing the data in the driver side? Edit: I see there's a |
Beta Was this translation helpful? Give feedback.
-
@JavierLight I have a couple of questions:
|
Beta Was this translation helpful? Give feedback.
-
I will try to clarify some of the doubts, but I still couldn't find where the memory increase is coming from. Working on it now.
Thank you for your interest, I will give you more details as soon as I am certain. |
Beta Was this translation helpful? Give feedback.
-
Well, it seems that the problem is in the JVM side (using top inside the pod):
These are the values after processing 50 elements and having disposed the spark session, obviously the CPU usage changes when the application is performing certain operations, but then goes to zero. Even though I dispose the SparkSession (which also stops the SparkContext) the use of memory of the pod (kubectl top pod <pod_id>) remains unchanged. So I guess my problem is originated by the driver that handles the |
Beta Was this translation helpful? Give feedback.
-
What's the raw value of JVM memory usage? Also how does it change as you run the function in a loop? Can you run the following on the C# side if it helps: {
var spark = ...
spark.stop
}
GC.Collect();
GC.WaitForPendingFinalizers(); It will force to remove reference to JVM objects: spark/src/csharp/Microsoft.Spark/Interop/Ipc/JvmObjectReference.cs Lines 33 to 39 in f803aa8 |
Beta Was this translation helpful? Give feedback.
-
Hello again, thank you all very much for trying to help, being patient and asking right questions. I found a "solution" for my problem, leaving some details in case anyone finds them useful in the future (searching in the issues by the memory keyword). The raw value of the JVM memory usage is over 1G, it keeps increasing until the limit is reached. The dotnet process is always under 50MB. I have 72MB in the input of the driver (also increasing, so it's never disposed). I'm using the following workaround:
Originally I had my pod limited to 500MB. I increased the memory limit to 1.5GB and experienced a better evolution of the data frames appending process. Let's say these were the values before reaching the memory limit:
The evolution is not lineal, so maybe I was setting a limit that was too low. I tried to understand why this is happening but unfortunately I'm unable to provide more information about this behavior. Maybe this is related with partitions, or the way I'm creating the data frame writer... No idea. Maybe custom data streaming approach is wrong, so I will leave this issue open just in case anyone wants to provide a better solution. In case you don't find this useful, feel free to close it. I will leave below an explanation of the feature I wanted to implement.
|
Beta Was this translation helpful? Give feedback.
Hello again, thank you all very much for trying to help, being patient and asking right questions. I found a "solution" for my problem, leaving some details in case anyone finds them useful in the future (searching in the issues by the memory keyword).
The raw value of the JVM memory usage is over 1G, it keeps increasing until the limit is reached. The dotnet process is always under 50MB. I have 72MB in the input of the driver (also increasing, so it's never disposed).
I'm using the following workaround:
The entrypoint of my docker container is a bash script that executes the DLL via spark-submit (using local as master). This script is a loop, so when a spark-submit job is finished a n…