Scala inference memory leak fix #11204

andrewfayres · 2018-06-08T17:54:34Z

Description

This fixes a memory leak that results from the FeedForward.predict method not properly disposing of the results from NDArray.slice().

Testing

Verifying leak exists

To verify that there was a leak in the existing code I created and trained a basic model off of MNIST data then called the predict method inside a while(true) loop then monitored the process's memory usage.

Verifying leak fix

After making the code changes, I repeated the process I had used to verify there was a leak. Memory consumption was much improved and looks to be stable.

More testing

In addition to the tests I ran to monitor memory usage, I also ran the existing tests with 'make scalatest'. All tests pass.

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage:
Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
Code is well-documented:
For user-facing API changes, API doc string has been updated.
For new C++ functions in header files, their functionalities and arguments are documented.
For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
Check the API doc at http://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

andrewfayres · 2018-06-08T18:16:09Z

@nswamy @lanking520

lanking520 · 2018-06-08T21:58:33Z

scala-package/core/src/main/scala/org/apache/mxnet/FeedForward.scala

-        list += nd.slice(0, realSize).copy()
+        val ndSliced = nd.slice(0, realSize)
+        list += ndSliced.copy()
+        ndSliced.dispose()


A little concern here as we cannot dispose the whole nd but a slice of it. But as tested to be stable, it's better than let customer run it with memory leak. LGTM for now.

because predExec.outputs is reused and has same lifecycle as predExec

but how about not copy(), simply do list += nd.slice(0, realSize)?

We tried that and it just breaks for unknown reasons. First inference goes well and second just break.

@lanking520 Can you show the exception information?

@lanking520 @yzhliu Is it because the jni doesn't free memory?

@liuzx32 The way that Scala works is more like playing around shared pointers. When we create a NDArray, we request a pointer that point to a memory space on the C++ side. If we want to release this piece of memory, we need to call dispose somewhere in the Scala code in order to get it executed. However, if we forgot to do that, Scala/JVM will discard this pointer rather than helping us release the memory and that cause the memory leak.

lupesko

Some comments.
On a wider note: how can we verify this with automated testing? (unit tests or integration tests)

lupesko · 2018-06-10T07:05:11Z

scala-package/core/src/main/scala/org/apache/mxnet/FeedForward.scala

-        list += nd.slice(0, realSize).copy()
+        val ndSliced = nd.slice(0, realSize)
+        list += ndSliced.copy()
+        ndSliced.dispose()


Shouldn't we do a try/finally to make sure dispose always happens, even in case of an exception?

If we're getting an exception here something has gone very wrong. Most likely the problem would be memory access/allocation and I'm not confident dispose would work correctly under those conditions.

I'll make the change anyway because although I think it's unlikely to ever help it definitely won't hurt and is a good practice to follow.

Feel a little bit overkilled adding try/finally for each time we dispose sth.
Eventually that should be handled by GC.

lupesko · 2018-06-10T07:05:37Z

scala-package/core/src/main/scala/org/apache/mxnet/FeedForward.scala

      }
+      batch.dispose()


Same comment as above about using try/finally to make sure dispose always happens, even in case of an exception.

lupesko · 2018-06-10T07:06:44Z

scala-package/core/src/main/scala/org/apache/mxnet/FeedForward.scala

@@ -230,8 +230,11 @@ class FeedForward private(
      val padded = batch.pad
      val realSize = batchSize - padded
      for ((list, nd) <- outputs zip predExec.outputs) {
-        list += nd.slice(0, realSize).copy()
+        val ndSliced = nd.slice(0, realSize)


I think we should add a comment explaining why id has to be done this way, and not the more compact way used before. Otherwise, someone may easily revert this change unknowingly in the future.

lanking520 · 2018-06-10T08:42:42Z

Hi @lupesko , this is covered in the unit test to do with a single inference with MNIST example. The initial bug came from the inference serving (crash after an hour). We have tested it offline to check the memory usage and saw that goes stable for a long time. It will be hard to track in that case. I think we can place this test as a part of Scala Benchmark AI if they can do it in nightly test.

nswamy · 2018-06-12T15:02:16Z

scala-package/core/src/main/scala/org/apache/mxnet/FeedForward.scala

+          }
+        }
+      } finally {
+        batch.dispose()


shouldn't it be the dataArrays that needs to be disposed ? see loadData and loadDatageneral where its doing a copy from batch to dataArrays

I think we shouldn't dispose the original input from the Iterator, for example if the user wants to use the same input on an another model

We aren't disposing of the original input. When data.next() is called, a slice of the data is made here. This slice is what should be getting disposed of when we dispose of batch.

Thanks Andrew, what about dataArrays?

Summarizing the conversation we had offline: It appears that the copyTo in loadData is copying to memory in the Executor predExec. The memory in the executor seems to get reused across predict calls. I think that the proper place to dispose of this part would be to add a dispose method for FeedForward and have that dispose of it's predExec.

* Fixes Scala memory leak (#10436) * Replaced the copy and disposed of sliced ndArray to resolve memory leak * Wrapped disposes in a finally to ensure they are called.

* Fixes Scala memory leak (apache#10436) * Replaced the copy and disposed of sliced ndArray to resolve memory leak * Wrapped disposes in a finally to ensure they are called.

* Fixes Scala memory leak (#10436) * Replaced the copy and disposed of sliced ndArray to resolve memory leak * Wrapped disposes in a finally to ensure they are called.

* Fixes Scala memory leak (apache#10436) * Replaced the copy and disposed of sliced ndArray to resolve memory leak * Wrapped disposes in a finally to ensure they are called.

* Fixes Scala memory leak (apache#10436)

Ayres added 2 commits June 7, 2018 13:42

Fixes Scala memory leak (apache#10436)

3772e1a

Replaced the copy and disposed of sliced ndArray to resolve memory leak

471b7a9

andrewfayres requested a review from yzhliu as a code owner June 8, 2018 17:54

lanking520 approved these changes Jun 8, 2018

View reviewed changes

lupesko reviewed Jun 10, 2018

View reviewed changes

lanking520 mentioned this pull request Jun 10, 2018

Scala inference memory leak fix #11204 #11216

Merged

5 tasks

yzhliu approved these changes Jun 12, 2018

View reviewed changes

nswamy reviewed Jun 12, 2018

View reviewed changes

nswamy approved these changes Jun 12, 2018

View reviewed changes

andrewfayres force-pushed the master branch from 6e00d62 to 96f1556 Compare June 12, 2018 21:34

Wrapped disposes in a finally to ensure they are called.

7b79707

andrewfayres force-pushed the master branch from 96f1556 to 7b79707 Compare June 13, 2018 17:11

nswamy merged commit bbc7a22 into apache:master Jun 14, 2018

zheng-da pushed a commit to zheng-da/incubator-mxnet that referenced this pull request Jun 28, 2018

Scala inference memory leak fix (apache#11204)

5da2f41

* Fixes Scala memory leak (apache#10436)

XinYao1994 pushed a commit to XinYao1994/incubator-mxnet that referenced this pull request Aug 29, 2018

Scala inference memory leak fix (apache#11204)

a3d7315

* Fixes Scala memory leak (apache#10436)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scala inference memory leak fix #11204

Scala inference memory leak fix #11204

andrewfayres commented Jun 8, 2018 •

edited

Loading

andrewfayres commented Jun 8, 2018

lanking520 Jun 8, 2018

yzhliu Jun 9, 2018

yzhliu Jun 9, 2018 •

edited

Loading

lanking520 Jun 9, 2018

liuzx32 Jun 11, 2018

liuzx32 Jun 11, 2018

lanking520 Jun 11, 2018

lupesko left a comment

lupesko Jun 10, 2018

andrewfayres Jun 11, 2018

yzhliu Jun 11, 2018

lupesko Jun 10, 2018

lupesko Jun 10, 2018

lanking520 commented Jun 10, 2018

nswamy Jun 12, 2018

andrewfayres Jun 12, 2018

nswamy Jun 12, 2018

andrewfayres Jun 12, 2018

Scala inference memory leak fix #11204

Scala inference memory leak fix #11204

Conversation

andrewfayres commented Jun 8, 2018 • edited Loading

Description

Testing

Verifying leak exists

Verifying leak fix

More testing

Checklist

Essentials

andrewfayres commented Jun 8, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yzhliu Jun 9, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lupesko left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lanking520 commented Jun 10, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andrewfayres commented Jun 8, 2018 •

edited

Loading

yzhliu Jun 9, 2018 •

edited

Loading