-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Addressing Context memory leakage in Java #245
Comments
The status is as described in #205. We may need to implement our own finalization queue to keep track of which Contexts are ready for disposal and which ones aren't. You can try adding some output here to see if your Context objects do indeed not get cleaned up because Java attempts to finalize it before all dependant objects are finalized. If this doesn't get triggered, then your problem is a different one. Does -XX:+UseSerialGC only imply a single-thread GC, or does it also imply finalization in reverse order of creation? If it's the latter then you're seeing a different, possibly unrelated issue/leak. |
To answer the question at the end of your last comment, here is the diff of the jvm flags with and without -XX:+UseSerialGC:
arifogel@thedatamatrix:~$diff java-flags.withoutserialgc.txt java-flags.withserialgc.txt
287c287 < uintx MarkSweepDeadRatio = 1 {product} --- > uintx MarkSweepDeadRatio = 5 {product} 294c294 < uintx MaxHeapFreeRatio = 100 {manageable} --- > uintx MaxHeapFreeRatio = 70 {manageable} 316c316 < uintx MinHeapFreeRatio = 0 {manageable} --- > uintx MinHeapFreeRatio = 40 {manageable} 365c365 < uintx ParallelGCThreads = 10 {product} --- > uintx ParallelGCThreads = 0 {product} 388c388 < uintx PermMarkSweepDeadRatio = 5 {product} --- > uintx PermMarkSweepDeadRatio = 20 {product} 632,633c632,633 < bool UseParallelGC := true {product} < bool UseParallelOldGC = true {product} --- > bool UseParallelGC = false {product} > bool UseParallelOldGC = false {product} 640c640 < bool UseSerialGC = false {product} --- > bool UseSerialGC := true {product} |
OK I added instrumentation to that function. I print m_refCount before and after the call to dispose, and have a print statement saying m_refCount != 0 in the else branch you indicated. I am not manually calling Context.dispose() anymore. For reference, the Context object should be out of scope after a job terminates. Here are my results with 8 jobs executing in parallel with concurrent GC (I terminated it this time before it ran out of memory): *** EXECUTING NOD JOBS *** ^C^C^CJob terminated successfully with result: org.batfish.z3.NodJobResult@573757ec after elapsed time: 00:01:2.23 - 24/75 (32.0%) complete |
Alright, I committed fixes for the non-Interlocked inc/dec problem and potential super.finalize() problems (fixes are in master now). Also, I (finally!) created a test program that predictably triggers some of the remaining issues, see Gh245.java. This triggers the m_refCount != 0 upon finalization problem and it also triggers ast_manager memory leaks rarely and at least on Windows (we can ignore those for the purposes of this issue)). Another "typical" concurrency problem may be that the creation of Context objects is done in parallel; in the test program I serialized those:
At the moment this is definitely required, but I'm thinking about adding a similar lock directly to the Context constructor as there is basically no impact for single-context users and multi-context users need to do that anyways. |
Instead of just talking about it, I figured I should just write that fix down, so now concurrent Context creation should be safe in Java and .NET. |
Is this still an issue or is the memory being managed well enough for now? I remember that the last few changes for this problem were defensive, i.e., safe but potentially slow. Do you see any of those negative effects? |
I'm afraid I haven't been using z3 lately, but to my recollection, performance is fine. On December 3, 2015 5:43:22 AM PST, "Christoph M. Wintersteiger" [email protected] wrote:
|
I'd like to reopen this issue. To work around this, I partitioned the jobs into 13 more or less equally-sized chunks of jobs, to be run separately. Each chunk was run in a separate process, also with 12 parallel Z3 worker threads.. The chunks were run one after the other. All chunks were completed after only 20 minutes. I assume that the crashing case took much longer for 2 reasons: We can potentially address (1) by reducing the complexity of garbage collection, e.g. by using PhantomReferences instead of finalizers as alluded to in #442 by cheshire. |
@arifogel in my applications ironically I've found the best combination (by far) not to do garbage collection at all, as the overhead associated with calling "finalize" is so ENORMOUS is that it far outweighs any benefit provided by garbage collecting the formulas. If you need better performance now and you don't mind using a wrapper, you can try using our JavaSMT library, which requires closing contexts manually, and gives an option on whether to use phantom references or nothing for keeping track of formulas. |
I would be curious to test this now, after I've changed the memory handling in Java bindings. |
@cheshire: Do you mean you changed it in your JavaSMT library? If so, I'm not sure that will work for my use case. I need to be able to use the Network-Optimized Datalog (NoD) plugin of Z3. Do you support that in your library? |
It applies to all z3 functionality also nod |
I see. But to clarify, does this mean I should use Z3, or JavaSMT (sosy-lab/java-smt?)? |
3e96a79
|
I rebuilt the latest Z3, and batfish no longer compiled because of references to Context.dispose which no longer exists. So I changed all references to new Context() to use try( Context ctx = new Context() {...}, and removed calls to dispose. Unfortunately, I see no difference in any of the behavior I have described in this thread; memory usage steadily rises until the system runs out of memory and kills the batfish java process. |
Perhaps there is something wrong with the way I am invoking Z3? For reference, you can look at: https://github.com/arifogel/batfish/blob/master/projects/batfish/src/org/batfish/z3/NodSatJob.java |
@arifogel =( do you think you could use jProfiler (https://www.ej-technologies.com/products/jprofiler/overview.html, it has a 30 days evaluation period) or some other profiling tool (visualvm is free, but the results are often suboptimal) and see where the memory is actually spent? [of course you won't see what's happening in the native code, but we could get some hint] |
I used JProfiler. The memory use goes up a little bit as time goes on, but stays more or less constant. It never uses more than about 11.5GB, and periodically drops down to about 3GB. However, the memory reported used by the java process as reported by htop goes up at a constant rate, until it uses all 64GB of my memory and crashes. I have a very strong feeling that the native code has a memory leak. What is the next step? |
Here is some of the profiling data for reference: |
So you have about 14M java.lang.ref.PhantomReference objects around. |
@NikolajBjorner I'm sorry I'm not sure I understand your comment fully. There is indeed a leak in current Java bindings: when closing contexts This is easily fixable: when closing the context, instead of going through the reference queue, we can simply remove all objects stored in
However, in the case I'm wrong with (2), let us try to use the fixed version first, and then investigate other options. |
thanks, this would be rather good to try. |
@arifogel Can you try my Z3 branch? (https://github.com/cheshire/z3/tree/fix_java_leak) |
You want me to run everything serially with a single context? Or just avoid
|
@arifogel My branch is the same as master now, so re-testing with master is sufficient. |
OK so now it actually terminates. I am attaching profiling information. At about 90 minutes in (marked at the top of ps.out), I started periodically recording the total virtual and |
I am currently running without profiling, and I estimate it will take about 3 hours and 20 minutes to complete. I should note that both in the profiling case and non-profiling case it appears that memory usage steadily increases significantly beyond that used for Java objects. I wonder if this still constitutes a memory leak and/or is related to the terrible performance relative to running partitions sequentially with each given run conducted in parallel (takes about 20 minutes for everything). |
So your profile indicates that a lot of objects are left uncollected of the type: org.batfish.z3.node.VarIntExpr. These are of course batfish objects. The NodJob.java code does not return any such objects as far as I can see, but there are maybe some container objects that don't get collected sufficiently eagerly. Did you try to force a GC before returning from NodJob? Alternatively dispose of objects that possibly contain references to these objects. |
The varintexpr objects' total size is bounded throughout the computation.
|
@arifogel So I assume we have achieved the goal of not crashing, as now 64GB is sufficient for your application? As for the leak, I would do the following sanity check: after all context objects are closed and disposed of, and GC is called, [based on my understanding of Z3] there should be almost no memory occupied by native code. That should be relatively easy to check. |
Based on the latest results I've provided, we see that the JRE does not allocate more than about 18GB at any given time (including both used and free heap), yet the total memory used by the java process running batfish now peaks at about 43GB. This means that we have 25GB of memory unaccounted for. If we are generous and say that 2GB of that 25GB is JRE process overhead, that still leaves 23GB of native memory that is unaccounted for. |
As an aside, I realize that I brought up issues of space and performance in this thread. Let us just address the memory issue. When we are finished, if there is still a significant slowdown compared to running the same jobs partitioned into multiple executions, I will open a new bug. |
@arifogel Why are those 25GB unaccounted for? After all Z3 has to use some memory. That's why I've asked about closing all contexts first. Creating a coredump is easy, just force Z3 to segfault. I think that creating a formula object, than closing the parent context, than trying to print that formula object (or any operations on a context in general) will do. |
Update: While I still believe there is a memory leak on the native side, I'm afraid I don't have a compelling test case anymore. For now I think we can close this. I'll let you know if I run into this bug again as I do different analyses. |
That's fantastic, thanks for reporting back @arifogel. I'll close this issues for now, but feel free to open a new issue or ask us to reopen this one if you run into new problems. |
What is the status of Context cleanup in Java? Batfish (github/arifogel/batfish) uses JNI to spawn threads in parallel for an unlimited number of jobs (75 in my most recent use case), and it looks like these Contexts are not cleaning up after themselves after calling Context.dispose(). I even tried using -XX:+UseSerialGC since you seemed to imply it was necessary in issue #205 for soundness, but this doesn't seem to affect the garbage that never gets cleaned up when each job finishes. The process just ends up crashing after available memory is all allocated (or earlier if too much time is spent in GC). As such, the only way to run Batfish's Z3 analysis in a single process for all these jobs is to have memory that scales with the total number of jobs, which is prohibitively expensive and unnecessary. I can try to rewrite my program to use multiple processes rather than threads so the memory is guaranteed to be freed, but that is a terrible hack and will require IPC I'd rather not implement.
The text was updated successfully, but these errors were encountered: