Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Addressing Context memory leakage in Java #245

Closed
arifogel opened this issue Oct 13, 2015 · 35 comments
Closed

Addressing Context memory leakage in Java #245

arifogel opened this issue Oct 13, 2015 · 35 comments
Assignees

Comments

@arifogel
Copy link

What is the status of Context cleanup in Java? Batfish (github/arifogel/batfish) uses JNI to spawn threads in parallel for an unlimited number of jobs (75 in my most recent use case), and it looks like these Contexts are not cleaning up after themselves after calling Context.dispose(). I even tried using -XX:+UseSerialGC since you seemed to imply it was necessary in issue #205 for soundness, but this doesn't seem to affect the garbage that never gets cleaned up when each job finishes. The process just ends up crashing after available memory is all allocated (or earlier if too much time is spent in GC). As such, the only way to run Batfish's Z3 analysis in a single process for all these jobs is to have memory that scales with the total number of jobs, which is prohibitively expensive and unnecessary. I can try to rewrite my program to use multiple processes rather than threads so the memory is guaranteed to be freed, but that is a terrible hack and will require IPC I'd rather not implement.

@wintersteiger
Copy link
Contributor

The status is as described in #205. We may need to implement our own finalization queue to keep track of which Contexts are ready for disposal and which ones aren't. You can try adding some output here to see if your Context objects do indeed not get cleaned up because Java attempts to finalize it before all dependant objects are finalized. If this doesn't get triggered, then your problem is a different one.

Does -XX:+UseSerialGC only imply a single-thread GC, or does it also imply finalization in reverse order of creation? If it's the latter then you're seeing a different, possibly unrelated issue/leak.

@arifogel
Copy link
Author

To answer the question at the end of your last comment, here is the diff of the jvm flags with and without -XX:+UseSerialGC:

arifogel@thedatamatrix:~$diff java-flags.withoutserialgc.txt java-flags.withserialgc.txt
287c287
< uintx MarkSweepDeadRatio = 1 {product}
---
> uintx MarkSweepDeadRatio = 5 {product}
294c294
< uintx MaxHeapFreeRatio = 100 {manageable}
---
> uintx MaxHeapFreeRatio = 70 {manageable}
316c316
< uintx MinHeapFreeRatio = 0 {manageable}
---
> uintx MinHeapFreeRatio = 40 {manageable}
365c365
< uintx ParallelGCThreads = 10 {product}
---
> uintx ParallelGCThreads = 0 {product}
388c388
< uintx PermMarkSweepDeadRatio = 5 {product}
---
> uintx PermMarkSweepDeadRatio = 20 {product}
632,633c632,633
< bool UseParallelGC := true {product}
< bool UseParallelOldGC = true {product}
---
> bool UseParallelGC = false {product}
> bool UseParallelOldGC = false {product}
640c640
< bool UseSerialGC = false {product}
---
> bool UseSerialGC := true {product}

@arifogel
Copy link
Author

OK I added instrumentation to that function. I print m_refCount before and after the call to dispose, and have a print statement saying m_refCount != 0 in the else branch you indicated. I am not manually calling Context.dispose() anymore. For reference, the Context object should be out of scope after a job terminates. Here are my results with 8 jobs executing in parallel with concurrent GC (I terminated it this time before it ran out of memory):

*** EXECUTING NOD JOBS ***
Job terminated successfully with result: org.batfish.z3.NodJobResult@73e20f87 after elapsed time: 00:01:1.58 - 1/75 (1.3%) complete
before dispose, m_refCount == 323507
Job terminated successfully with result: org.batfish.z3.NodJobResult@616a37dc after elapsed time: 00:01:2.03 - 2/75 (2.7%) complete
Job terminated successfully with result: org.batfish.z3.NodJobResult@547d573 after elapsed time: 00:01:1.51 - 3/75 (4.0%) complete
after dispose, m_refCount == 323507
m_refCount != 0
before dispose, m_refCount == 120
after dispose, m_refCount == 120
m_refCount != 0
Job terminated successfully with result: org.batfish.z3.NodJobResult@4260904e after elapsed time: 00:01:9.30 - 4/75 (5.3%) complete
before dispose, m_refCount == -491
after dispose, m_refCount == -491
m_refCount != 0
Job terminated successfully with result: org.batfish.z3.NodJobResult@16ecb69e after elapsed time: 00:01:12.83 - 5/75 (6.7%) complete
Job terminated successfully with result: org.batfish.z3.NodJobResult@1cb2d75c after elapsed time: 00:01:13.50 - 6/75 (8.0%) complete
before dispose, m_refCount == 132528
after dispose, m_refCount == 132528
m_refCount != 0
Job terminated successfully with result: org.batfish.z3.NodJobResult@18893021 after elapsed time: 00:01:24.12 - 7/75 (9.3%) complete
before dispose, m_refCount == 192730
after dispose, m_refCount == 192730
m_refCount != 0
before dispose, m_refCount == 82584
after dispose, m_refCount == 82584
m_refCount != 0
before dispose, m_refCount == -105
after dispose, m_refCount == -105
m_refCount != 0
Job terminated successfully with result: org.batfish.z3.NodJobResult@163fa023 after elapsed time: 00:00:52.68 - 8/75 (10.7%) complete
Job terminated successfully with result: org.batfish.z3.NodJobResult@7805cbba after elapsed time: 00:01:4.24 - 9/75 (12.0%) complete
Job terminated successfully with result: org.batfish.z3.NodJobResult@290163e1 after elapsed time: 00:00:51.84 - 10/75 (13.3%) complete
before dispose, m_refCount == 96220
after dispose, m_refCount == 96220
m_refCount != 0
before dispose, m_refCount == 680
after dispose, m_refCount == 680
m_refCount != 0
Job terminated successfully with result: org.batfish.z3.NodJobResult@65409285 after elapsed time: 00:01:14.32 - 11/75 (14.7%) complete
Job terminated successfully with result: org.batfish.z3.NodJobResult@a1193a9 after elapsed time: 00:01:13.34 - 12/75 (16.0%) complete
Job terminated successfully with result: org.batfish.z3.NodJobResult@75baec8e after elapsed time: 00:01:11.30 - 13/75 (17.3%) complete
Job terminated successfully with result: org.batfish.z3.NodJobResult@b9f370 after elapsed time: 00:01:0.26 - 14/75 (18.7%) complete
before dispose, m_refCount == -274
after dispose, m_refCount == -274
m_refCount != 0
Job terminated successfully with result: org.batfish.z3.NodJobResult@2707f992 after elapsed time: 00:01:2.22 - 15/75 (20.0%) complete
before dispose, m_refCount == 697
after dispose, m_refCount == 697
m_refCount != 0
Job terminated successfully with result: org.batfish.z3.NodJobResult@137a9034 after elapsed time: 00:01:9.51 - 16/75 (21.3%) complete
Job terminated successfully with result: org.batfish.z3.NodJobResult@322c2f49 after elapsed time: 00:00:55.44 - 17/75 (22.7%) complete
Job terminated successfully with result: org.batfish.z3.NodJobResult@757b7533 after elapsed time: 00:00:46.89 - 18/75 (24.0%) complete
before dispose, m_refCount == 390
after dispose, m_refCount == 390
m_refCount != 0
Job terminated successfully with result: org.batfish.z3.NodJobResult@65c95a57 after elapsed time: 00:01:17.95 - 19/75 (25.3%) complete
Job terminated successfully with result: org.batfish.z3.NodJobResult@50bbaeee after elapsed time: 00:01:7.80 - 20/75 (26.7%) complete
before dispose, m_refCount == 2027
after dispose, m_refCount == 2027
m_refCount != 0
Job terminated successfully with result: org.batfish.z3.NodJobResult@1943fd4d after elapsed time: 00:03:37.47 - 21/75 (28.0%) complete
before dispose, m_refCount == 564
after dispose, m_refCount == 564
m_refCount != 0
before dispose, m_refCount == -420
after dispose, m_refCount == -420
m_refCount != 0
before dispose, m_refCount == 1448
after dispose, m_refCount == 1448
m_refCount != 0
before dispose, m_refCount == 1182
after dispose, m_refCount == 1182
m_refCount != 0
Job terminated successfully with result: org.batfish.z3.NodJobResult@ce82ec9 after elapsed time: 00:00:58.29 - 22/75 (29.3%) complete
before dispose, m_refCount == 163382
after dispose, m_refCount == 163382
m_refCount != 0
before dispose, m_refCount == 3800
^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^Cafter dispose, m_refCount == 3800
m_refCount != 0
Job terminated successfully with result: org.batfish.z3.NodJobResult@7c5cc270 after elapsed time: 00:01:2.99 - 23/75 (30.7%) complete

^C^C^CJob terminated successfully with result: org.batfish.z3.NodJobResult@573757ec after elapsed time: 00:01:2.23 - 24/75 (32.0%) complete
Job terminated successfully with result: org.batfish.z3.NodJobResult@74e3391b after elapsed time: 00:00:36.32 - 25/75 (33.3%) complete

wintersteiger pushed a commit to wintersteiger/z3 that referenced this issue Oct 14, 2015
wintersteiger pushed a commit to wintersteiger/z3 that referenced this issue Oct 14, 2015
wintersteiger pushed a commit to wintersteiger/z3 that referenced this issue Oct 14, 2015
wintersteiger pushed a commit to wintersteiger/z3 that referenced this issue Oct 14, 2015
@wintersteiger
Copy link
Contributor

Alright, I committed fixes for the non-Interlocked inc/dec problem and potential super.finalize() problems (fixes are in master now).

Also, I (finally!) created a test program that predictably triggers some of the remaining issues, see Gh245.java. This triggers the m_refCount != 0 upon finalization problem and it also triggers ast_manager memory leaks rarely and at least on Windows (we can ignore those for the purposes of this issue)).
(+ also reproduces on OSX.)

Another "typical" concurrency problem may be that the creation of Context objects is done in parallel; in the test program I serialized those:

synchronized (m_lock) { 
    ctx = new Context(); 
} 

At the moment this is definitely required, but I'm thinking about adding a similar lock directly to the Context constructor as there is basically no impact for single-context users and multi-context users need to do that anyways.

wintersteiger pushed a commit to wintersteiger/z3 that referenced this issue Oct 14, 2015
@wintersteiger
Copy link
Contributor

Instead of just talking about it, I figured I should just write that fix down, so now concurrent Context creation should be safe in Java and .NET.

wintersteiger pushed a commit to wintersteiger/z3 that referenced this issue Oct 14, 2015
wintersteiger pushed a commit to wintersteiger/z3 that referenced this issue Oct 14, 2015
@wintersteiger wintersteiger self-assigned this Oct 19, 2015
@wintersteiger
Copy link
Contributor

Is this still an issue or is the memory being managed well enough for now? I remember that the last few changes for this problem were defensive, i.e., safe but potentially slow. Do you see any of those negative effects?

@arifogel
Copy link
Author

arifogel commented Dec 3, 2015

I'm afraid I haven't been using z3 lately, but to my recollection, performance is fine.

On December 3, 2015 5:43:22 AM PST, "Christoph M. Wintersteiger" [email protected] wrote:

Is this still an issue or is the memory being managed well enough for
now? I remember that the last few changes for this problem were
defensive, i.e., safe but potentially slow. Do you see any of those
negative effects?


Reply to this email directly or view it on GitHub:
#245 (comment)

@arifogel
Copy link
Author

I'd like to reopen this issue.
Here is my use case:
I need to run ~5000 jobs in batfish using 12 parallel Z3 worker threads at a time.
I have 64GiB of RAM on my machine, and have set the -Xmx60g jvm flag.
The output of each job is stored in memory and not garbage collected, but the output is relatively small. I believe the saved output from each job is <1KiB.
When I run the whole thing at once as described, batfish crashes with an out-of-memory exception after several hours with full CPU utilization on all 12 cores.

To work around this, I partitioned the jobs into 13 more or less equally-sized chunks of jobs, to be run separately. Each chunk was run in a separate process, also with 12 parallel Z3 worker threads.. The chunks were run one after the other. All chunks were completed after only 20 minutes.

I assume that the crashing case took much longer for 2 reasons:
(1) Instead of just instantly freeing the memory associated with a chunk of several hundred jobs when the process ended, the memory had to be meticulously freed by the garbage collector.
(2) Old contexts leaving uncollectable garbage eventually reduced the maximum available free memory for new jobs, forcing more and more frequent garbage collection.

We can potentially address (1) by reducing the complexity of garbage collection, e.g. by using PhantomReferences instead of finalizers as alluded to in #442 by cheshire.
I believe that such a solution may also alleviate (2) if engineered properly. However, I do not know enough about the difference between finalizers and PhantomReferences to say so authoritatively.

@cheshire
Copy link
Contributor

@arifogel in my applications ironically I've found the best combination (by far) not to do garbage collection at all, as the overhead associated with calling "finalize" is so ENORMOUS is that it far outweighs any benefit provided by garbage collecting the formulas.

If you need better performance now and you don't mind using a wrapper, you can try using our JavaSMT library, which requires closing contexts manually, and gives an option on whether to use phantom references or nothing for keeping track of formulas.

@wintersteiger wintersteiger reopened this May 11, 2016
@cheshire
Copy link
Contributor

I would be curious to test this now, after I've changed the memory handling in Java bindings.

@arifogel
Copy link
Author

@cheshire: Do you mean you changed it in your JavaSMT library? If so, I'm not sure that will work for my use case. I need to be able to use the Network-Optimized Datalog (NoD) plugin of Z3. Do you support that in your library?

@NikolajBjorner
Copy link
Contributor

It applies to all z3 functionality also nod

@arifogel
Copy link
Author

I see. But to clarify, does this mean I should use Z3, or JavaSMT (sosy-lab/java-smt?)?

@cheshire
Copy link
Contributor

3e96a79
aims to address the memory leak.
Be aware that now the context should be manually closed (eg using
try-with-resources)
On Jun 27, 2016 9:48 PM, "Ari Fogel" [email protected] wrote:

I see. But to clarify, does this mean I should use Z3, or JavaSMT
(sosy-lab/java-smt?)?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#245 (comment), or mute
the thread
https://github.com/notifications/unsubscribe/AAVTH2T1LdYkHM6Ij7YGJbnrVT2TApBgks5qQCjwgaJpZM4GOB_V
.

@arifogel
Copy link
Author

arifogel commented Jun 27, 2016

I rebuilt the latest Z3, and batfish no longer compiled because of references to Context.dispose which no longer exists. So I changed all references to new Context() to use try( Context ctx = new Context() {...}, and removed calls to dispose. Unfortunately, I see no difference in any of the behavior I have described in this thread; memory usage steadily rises until the system runs out of memory and kills the batfish java process.
EDIT:
I should note that I tested this both with and without the -Xmx60g java flag, and that I am now using JDK 8 in Ubuntu 16.04, whereas I was previously using JDK 7 in Ubuntu 14.04. For all current runs, it does not appear that the -Xmx60g flag is respected or necessary; the program uses well more than 60GB of memory (all available memory) and then crashes.

@arifogel
Copy link
Author

Perhaps there is something wrong with the way I am invoking Z3? For reference, you can look at: https://github.com/arifogel/batfish/blob/master/projects/batfish/src/org/batfish/z3/NodSatJob.java

@cheshire
Copy link
Contributor

@arifogel =( do you think you could use jProfiler (https://www.ej-technologies.com/products/jprofiler/overview.html, it has a 30 days evaluation period) or some other profiling tool (visualvm is free, but the results are often suboptimal) and see where the memory is actually spent? [of course you won't see what's happening in the native code, but we could get some hint]

@arifogel
Copy link
Author

I used JProfiler. The memory use goes up a little bit as time goes on, but stays more or less constant. It never uses more than about 11.5GB, and periodically drops down to about 3GB. However, the memory reported used by the java process as reported by htop goes up at a constant rate, until it uses all 64GB of my memory and crashes. I have a very strong feeling that the native code has a memory leak. What is the next step?

@arifogel
Copy link
Author

Here is some of the profiling data for reference:
profiling.zip

@NikolajBjorner
Copy link
Contributor

So you have about 14M java.lang.ref.PhantomReference objects around.
As far as I understand these are references to Z3 expressions (6M+ of which are Boolean expressions).
"Someone" would have to tell the GC to collect these as the context object would not go away unless the expressions that are owned by the context go away.
To my understanding there are two possible approaches: (1) Batfish keeps track of the expressions it has created and ensures the dispose/finalize methods are called by the scope exit (2) the Java runtime uses a double indirection to the context object instead of the single indirection. When the context is disposed, the set of allocated reference counts to ast objects are decremented at the time the context is finalized. Dangling expression objects are no longer valid as the underlying context pointer is now set to null. @cheshire what is your take?

@cheshire
Copy link
Contributor

@NikolajBjorner I'm sorry I'm not sure I understand your comment fully.

There is indeed a leak in current Java bindings: when closing contexts c we do clean reference queues belonging to c, yet not all objects in c might be GC'd yet at that point (in that case, the reference would not appear in the reference queue, and would not be cleaned). We could call System.gc() but I'm not sure it is guaranteed to actually do what we want.

This is easily fixable: when closing the context, instead of going through the reference queue, we can simply remove all objects stored in referenceMap. I would submit a pull request shortly.
However,

  1. In that case the client should be responsible for not storing references to objects belonging to context outside of that context, as any operation on them after the context was closed (e.g. converting to a string or equality checking) would segfault.
  2. I think this is a very tiny, local, leak which only affects objects which were already unreachable but not GC'd when the context was closed.
    Unless there are thousands of contexts created and closed all the time, I would not expect this to cause any actual problems.
    This hypothesis is consistent with 14M number occupying 470MB out of overall 64GB.
    So currently I agree with @arifogel that there's probably a leak in the native code.

However, in the case I'm wrong with (2), let us try to use the fixed version first, and then investigate other options.

@NikolajBjorner
Copy link
Contributor

thanks, this would be rather good to try.

@cheshire
Copy link
Contributor

@arifogel Can you try my Z3 branch? (https://github.com/cheshire/z3/tree/fix_java_leak)
Also when load-testing multiple contexts I've found out that Z3 slows down to a crawl, and running same queries serially on a single thread is often much faster (at least on my benchmarks).
What happens to your performance if you force only one thread to execute at a time?

@arifogel
Copy link
Author

You want me to run everything serially with a single context? Or just avoid
having multiple contexts open at once? The former case would almost
certainly use too much memory. The latter case will definitely run more
slowly than running in parallel in my experience. However, I can test this
with your branch and the current master if you wish.
On Jun 29, 2016 4:31 AM, "George Karpenkov" [email protected]
wrote:

@arifogel https://github.com/arifogel Can you try my Z3 branch? (
https://github.com/cheshire/z3/tree/fix_java_leak)
Also when load-testing multiple contexts I've found out that Z3 slows down
to a crawl, and running same queries serially on a single thread is often
much faster (at least on my benchmarks).
What happens to your performance if you force only one thread to execute
at a time?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#245 (comment), or mute
the thread
https://github.com/notifications/unsubscribe/AHYSEkrZK8i-0JnTAOGUq0yvNejM8Accks5qQld1gaJpZM4GOB_V
.

@cheshire
Copy link
Contributor

@arifogel My branch is the same as master now, so re-testing with master is sufficient.

@arifogel
Copy link
Author

arifogel commented Jun 30, 2016

OK so now it actually terminates. I am attaching profiling information. At about 90 minutes in (marked at the top of ps.out), I started periodically recording the total virtual and shared resident memory allocated to the process. For the reference/memory count per-class, I'm not actually sure when that snapshot was taken.
profile-new.zip

@arifogel
Copy link
Author

I am currently running without profiling, and I estimate it will take about 3 hours and 20 minutes to complete. I should note that both in the profiling case and non-profiling case it appears that memory usage steadily increases significantly beyond that used for Java objects. I wonder if this still constitutes a memory leak and/or is related to the terrible performance relative to running partitions sequentially with each given run conducted in parallel (takes about 20 minutes for everything).

@NikolajBjorner
Copy link
Contributor

So your profile indicates that a lot of objects are left uncollected of the type: org.batfish.z3.node.VarIntExpr. These are of course batfish objects. The NodJob.java code does not return any such objects as far as I can see, but there are maybe some container objects that don't get collected sufficiently eagerly. Did you try to force a GC before returning from NodJob? Alternatively dispose of objects that possibly contain references to these objects.

@arifogel
Copy link
Author

The varintexpr objects' total size is bounded throughout the computation.
It is negligible compared to the unaccounted for memory usage that is
non-jvm-allocated. As I said in an earlier post, I cannot account for the
time of the snapshot you are referencing. It is definitely not the final
situation, since the process terminates successfully in this case.
On Jun 29, 2016 5:28 PM, "Nikolaj Bjorner" [email protected] wrote:

So your profile indicates that a lot of objects are left uncollected of
the type: org.batfish.z3.node.VarIntExpr. These are of course batfish
objects. The NodJob.java code does not return any such objects as far as I
can see, but there are maybe some container objects that don't get
collected sufficiently eagerly. Did you try to force a GC before returning
from NodJob? Alternatively dispose of objects that possibly contain
references to these objects.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#245 (comment), or mute
the thread
https://github.com/notifications/unsubscribe/AHYSEg29t1gCHlCKGAKOtD_dq6mKAb4mks5qQw2FgaJpZM4GOB_V
.

@cheshire
Copy link
Contributor

@arifogel So I assume we have achieved the goal of not crashing, as now 64GB is sufficient for your application?

As for the leak, I would do the following sanity check: after all context objects are closed and disposed of, and GC is called, [based on my understanding of Z3] there should be almost no memory occupied by native code. That should be relatively easy to check.
If after all context objects are closed there is still a significant amount of memory used, I would try to either use Z3 diagnostic tools (I don't know whether it provides any functions to see where the memory is going?) or to generate a coredump and to manually inspect it.

@arifogel
Copy link
Author

Based on the latest results I've provided, we see that the JRE does not allocate more than about 18GB at any given time (including both used and free heap), yet the total memory used by the java process running batfish now peaks at about 43GB. This means that we have 25GB of memory unaccounted for. If we are generous and say that 2GB of that 25GB is JRE process overhead, that still leaves 23GB of native memory that is unaccounted for.
So @cheshire, can you guide me through creating and inspecting a coredump? I know how to do this when a z3 process is started separately, but I'm not sure what to do when it is called through JNI.

@arifogel
Copy link
Author

As an aside, I realize that I brought up issues of space and performance in this thread. Let us just address the memory issue. When we are finished, if there is still a significant slowdown compared to running the same jobs partitioned into multiple executions, I will open a new bug.

@cheshire
Copy link
Contributor

@arifogel Why are those 25GB unaccounted for? After all Z3 has to use some memory. That's why I've asked about closing all contexts first.

Creating a coredump is easy, just force Z3 to segfault. I think that creating a formula object, than closing the parent context, than trying to print that formula object (or any operations on a context in general) will do.

@arifogel
Copy link
Author

arifogel commented Aug 4, 2016

Update:
The reason why partitioning the jobs was improving performance was that I was accidentally sending an amount of information (in terms of rules) to each parallel Z3 job that incidentally scaled with the total number of jobs. After changing it so that only necessary information was provided to each job, the performance increased by about 1000x and memory usage is now insubstantial.

While I still believe there is a memory leak on the native side, I'm afraid I don't have a compelling test case anymore. For now I think we can close this. I'll let you know if I run into this bug again as I do different analyses.

@wintersteiger
Copy link
Contributor

That's fantastic, thanks for reporting back @arifogel. I'll close this issues for now, but feel free to open a new issue or ask us to reopen this one if you run into new problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants