[Question]: Use of memory #587

JavierLight · 2020-07-09T09:54:26Z

JavierLight
Jul 9, 2020

I would like to know if this library can be used to execute spark batch jobs in a loop. Imagine the Program.cs creates a SparkSession and reuses it in a block of code that is executed several times. Each iteration loads a DataFrame from a file and then appends the data to a parquet table in HDFS.

Program.cs
// Instance spark session

// -Execute several times the next block (it resides in a different class)-
// Load DataFrame from file (reusing the instanced spark session)
// Append DataFrame to a parquet table
// return (I assumed resources would be released when returning control to caller)

The scenario I'm describing works fine except for the fact that the use of memory keeps increasing up to the infinite. Is there a way to dispose the DataFrame from memory? Is it a problem related with reusing the spark session?

I will further investigate this memory leak, but I thought it would be worth to ask here just in case you have any valuable insight.

As a clarification, I'm compiling the code and running the DLL with spark-submit using the DotnetRunner class.

Thank you very much for your time.

Answered by JavierLight

Jul 27, 2020

Hello again, thank you all very much for trying to help, being patient and asking right questions. I found a "solution" for my problem, leaving some details in case anyone finds them useful in the future (searching in the issues by the memory keyword).

The raw value of the JVM memory usage is over 1G, it keeps increasing until the limit is reached. The dotnet process is always under 50MB. I have 72MB in the input of the driver (also increasing, so it's never disposed).

I'm using the following workaround:

The entrypoint of my docker container is a bash script that executes the DLL via spark-submit (using local as master). This script is a loop, so when a spark-submit job is finished a n…

View full answer

Niharikadutta · 2020-07-09T12:30:45Z

Niharikadutta
Jul 9, 2020
Collaborator

@JavierLight Can you try calling Unpersist() (https://github.com/dotnet/spark/blob/master/src/csharp/Microsoft.Spark/Sql/DataFrame.cs#L822) on your dataframe after processing of appending to parquet has finished? It should delete all the blocks of the dataframe from memory.

0 replies

JavierLight · 2020-07-09T14:02:11Z

JavierLight
Jul 9, 2020
Author

Thanks for the quick reply. I tried your suggestion but it didn't work, the problem is related to the RAM-memory consumption (it keeps growing even though I called Unpersist() on the DF and saw on the logs that resources were being freed up). Just to clarify it is not related to the disk usage, just RAM.

Maybe the issue is that reusing the SparkSession is not a good idea?

0 replies

imback82 · 2020-07-09T19:15:03Z

imback82
Jul 9, 2020

Is the memory increase from CLR or JVM? Are you brining in any data to driver? Also, what do you mean by Append DataFrame to a parquet table?

0 replies

JavierLight · 2020-07-13T10:03:05Z

JavierLight
Jul 13, 2020
Author

@imback82 sorry I didn't explain myself properly. What I meant by append dataframe to a parquet table is this:

DataFrameReader dataFrameReader = sparkSession.Read().Option("multiLine", true).Format("json");
DataFrame loadedDataFrame = dataFrameReader.Load(inputFilePath);
DataFrameWriter writer = loadedDataFrame.Write();
if (sparkSession.Catalog.TableExists(tableName))
{
    writer.InsertInto(tableName);
}
else
{
    writer.SaveAsTable(tableName);
}
// this is what I added after Niharikadutta's comment
loadedDataFrame.Unpersist(true);
return true;

Since the reader, writer and dataframe are local variables scoped to this method that returns true, I thought it wasn't necessary to dispose anything. I'm calling this method several times (for different json files). After each call, the memory increases several MBs.

I'm going to try to see if I can diagnose where the memory increase is coming from (JVM vs CLR) and I will let you know. Regarding the driver, this code is the only thing that is sending data to it. I thought I didn't need to do anything else, but maybe my problem is that I should be disposing the data in the driver side?

Edit: I see there's a Destroy() method in spark, but in dotnet-spark the only method available is Unpersist(), maybe there's another way to dispose cached data from the driver that I'm not able to see? I'm disposing the spark session before returning but the memory keeps increasing.

0 replies

rapoth · 2020-07-14T18:54:30Z

rapoth
Jul 14, 2020

@JavierLight I have a couple of questions:

You did not seem to specify .mode("append") Was that intentional?
It would be great to understand where the memory increase is coming from: JVM or CLR.
What is the scale of data you are operating and can you also specify the compute configuration you are using?
Can you describe more details about your input? Is data getting streamed or written into your input folder path inputFilePath? If it is, then you are reading all the data, every time. Is this intentional?

0 replies

JavierLight · 2020-07-15T07:12:23Z

JavierLight
Jul 15, 2020
Author

I will try to clarify some of the doubts, but I still couldn't find where the memory increase is coming from. Working on it now.

I'm using the InsertInto method when the table already exists. Otherwise, I'm using SaveAsTable. Should I be doing this in another way? I guess using these two methods there's no need to specify the write mode, probably I'm wrong.
Working on it, I've been busy the last days, I'll let you know once I'm positive.
I'm deploying a docker container in kubernetes, using the local resources of the pod (master=local[*]). This way I can isolate the behavior of the application and monitor the resources used. The scale of the data is not huge, each file contains 500 JSON objects with 10 fields. The idea is to use a proper resource manager once the application is tested.
The input is coming from somewhere else, received in JSON and saved as a temp file that should be processed. Once the file is processed it's deleted and then another batch of items are received and processed. The size of each file is around 1MB.

Thank you for your interest, I will give you more details as soon as I am certain.

0 replies

JavierLight · 2020-07-16T08:53:58Z

JavierLight
Jul 16, 2020
Author

Well, it seems that the problem is in the JVM side (using top inside the pod):

%CPU	%MEM	COMMAND
0.0	4.7	java
0.0	0.7	dotnet

These are the values after processing 50 elements and having disposed the spark session, obviously the CPU usage changes when the application is performing certain operations, but then goes to zero.

Even though I dispose the SparkSession (which also stops the SparkContext) the use of memory of the pod (kubectl top pod <pod_id>) remains unchanged. So I guess my problem is originated by the driver that handles the spark-submit command (instanced inside the pod).

0 replies

imback82 · 2020-07-16T20:31:11Z

imback82
Jul 16, 2020

What's the raw value of JVM memory usage? Also how does it change as you run the function in a loop?

Can you run the following on the C# side if it helps:

{
   var spark = ...
   spark.stop
} 
GC.Collect();
GC.WaitForPendingFinalizers();

It will force to remove reference to JVM objects:

spark/src/csharp/Microsoft.Spark/Interop/Ipc/JvmObjectReference.cs

Lines 33 to 39 in f803aa8

    
           ~JvmObjectId() 
        
           { 
        
               if (!Environment.HasShutdownStarted && (Id != null) && (Jvm != null)) 
        
               { 
        
                   ThreadPool.QueueUserWorkItem(_ => RemoveId()); 
        
               } 
        
           }

0 replies

JavierLight · 2020-07-27T12:02:07Z

JavierLight
Jul 27, 2020
Author

Hello again, thank you all very much for trying to help, being patient and asking right questions. I found a "solution" for my problem, leaving some details in case anyone finds them useful in the future (searching in the issues by the memory keyword).

The raw value of the JVM memory usage is over 1G, it keeps increasing until the limit is reached. The dotnet process is always under 50MB. I have 72MB in the input of the driver (also increasing, so it's never disposed).

I'm using the following workaround:

The entrypoint of my docker container is a bash script that executes the DLL via spark-submit (using local as master). This script is a loop, so when a spark-submit job is finished a new one is launched. This means the resources are released and the pod is not restarted.
I'm limiting the number of iterations in my dotnet code to avoid reaching the memory limit set in the pod. This way I'm shutting down the entire dotnet application and a new one is launched after the previous one exits.

Originally I had my pod limited to 500MB. I increased the memory limit to 1.5GB and experienced a better evolution of the data frames appending process. Let's say these were the values before reaching the memory limit:

500MB limit: Process 6 chunks of 500 elements
1500MB limit: Process over 100 chunks of 500 elements.

The evolution is not lineal, so maybe I was setting a limit that was too low. I tried to understand why this is happening but unfortunately I'm unable to provide more information about this behavior. Maybe this is related with partitions, or the way I'm creating the data frame writer... No idea.

Maybe custom data streaming approach is wrong, so I will leave this issue open just in case anyone wants to provide a better solution. In case you don't find this useful, feel free to close it. I will leave below an explanation of the feature I wanted to implement.

There is a buffer of items to be stored in parquet. It is receiving more items, and the purpuse of the dotnet spark application is to process a chunk of buffered data, then another...
The data comes in JSON via REST, so it's flattened and saved into a JSON temp file (deleted after it's processed).
Once the file is loaded as a DataFrame, we check if the store destination exists, creating/appending to parquet files.
Everything is disposed, the JSON file is removed.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question]: Use of memory #587

{{title}}

Replies: 9 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

[Question]: Use of memory #587

JavierLight Jul 9, 2020

Replies: 9 comments

Niharikadutta Jul 9, 2020 Collaborator

JavierLight Jul 9, 2020 Author

imback82 Jul 9, 2020

JavierLight Jul 13, 2020 Author

rapoth Jul 14, 2020

JavierLight Jul 15, 2020 Author

JavierLight Jul 16, 2020 Author

imback82 Jul 16, 2020

JavierLight Jul 27, 2020 Author

JavierLight
Jul 9, 2020

Niharikadutta
Jul 9, 2020
Collaborator

JavierLight
Jul 9, 2020
Author

imback82
Jul 9, 2020

JavierLight
Jul 13, 2020
Author

rapoth
Jul 14, 2020

JavierLight
Jul 15, 2020
Author

JavierLight
Jul 16, 2020
Author

imback82
Jul 16, 2020

JavierLight
Jul 27, 2020
Author