Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Scala inference memory leak fix #11204

Merged
merged 3 commits into from
Jun 14, 2018
Merged

Scala inference memory leak fix #11204

merged 3 commits into from
Jun 14, 2018

Conversation

andrewfayres
Copy link
Contributor

@andrewfayres andrewfayres commented Jun 8, 2018

Description

This fixes a memory leak that results from the FeedForward.predict method not properly disposing of the results from NDArray.slice().

Testing

Verifying leak exists

To verify that there was a leak in the existing code I created and trained a basic model off of MNIST data then called the predict method inside a while(true) loop then monitored the process's memory usage.

Verifying leak fix

After making the code changes, I repeated the process I had used to verify there was a leak. Memory consumption was much improved and looks to be stable.

More testing

In addition to the tests I ran to monitor memory usage, I also ran the existing tests with 'make scalatest'. All tests pass.

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

  • The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage:
  • Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
  • Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
  • Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
  • Code is well-documented:
  • For user-facing API changes, API doc string has been updated.
  • For new C++ functions in header files, their functionalities and arguments are documented.
  • For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
  • Check the API doc at http://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
  • To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

@andrewfayres andrewfayres requested a review from yzhliu as a code owner June 8, 2018 17:54
@andrewfayres
Copy link
Contributor Author

@nswamy @lanking520

list += nd.slice(0, realSize).copy()
val ndSliced = nd.slice(0, realSize)
list += ndSliced.copy()
ndSliced.dispose()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A little concern here as we cannot dispose the whole nd but a slice of it. But as tested to be stable, it's better than let customer run it with memory leak. LGTM for now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because predExec.outputs is reused and has same lifecycle as predExec

Copy link
Member

@yzhliu yzhliu Jun 9, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but how about not copy(), simply do list += nd.slice(0, realSize)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We tried that and it just breaks for unknown reasons. First inference goes well and second just break.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lanking520 Can you show the exception information?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lanking520 @yzhliu Is it because the jni doesn't free memory?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@liuzx32 The way that Scala works is more like playing around shared pointers. When we create a NDArray, we request a pointer that point to a memory space on the C++ side. If we want to release this piece of memory, we need to call dispose somewhere in the Scala code in order to get it executed. However, if we forgot to do that, Scala/JVM will discard this pointer rather than helping us release the memory and that cause the memory leak.

Copy link
Contributor

@lupesko lupesko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments.
On a wider note: how can we verify this with automated testing? (unit tests or integration tests)

list += nd.slice(0, realSize).copy()
val ndSliced = nd.slice(0, realSize)
list += ndSliced.copy()
ndSliced.dispose()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we do a try/finally to make sure dispose always happens, even in case of an exception?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're getting an exception here something has gone very wrong. Most likely the problem would be memory access/allocation and I'm not confident dispose would work correctly under those conditions.

I'll make the change anyway because although I think it's unlikely to ever help it definitely won't hurt and is a good practice to follow.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feel a little bit overkilled adding try/finally for each time we dispose sth.
Eventually that should be handled by GC.

}
batch.dispose()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above about using try/finally to make sure dispose always happens, even in case of an exception.

@@ -230,8 +230,11 @@ class FeedForward private(
val padded = batch.pad
val realSize = batchSize - padded
for ((list, nd) <- outputs zip predExec.outputs) {
list += nd.slice(0, realSize).copy()
val ndSliced = nd.slice(0, realSize)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should add a comment explaining why id has to be done this way, and not the more compact way used before. Otherwise, someone may easily revert this change unknowingly in the future.

@lanking520
Copy link
Member

Hi @lupesko , this is covered in the unit test to do with a single inference with MNIST example. The initial bug came from the inference serving (crash after an hour). We have tested it offline to check the memory usage and saw that goes stable for a long time. It will be hard to track in that case. I think we can place this test as a part of Scala Benchmark AI if they can do it in nightly test.

}
}
} finally {
batch.dispose()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't it be the dataArrays that needs to be disposed ? see loadData and loadDatageneral where its doing a copy from batch to dataArrays

I think we shouldn't dispose the original input from the Iterator, for example if the user wants to use the same input on an another model

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We aren't disposing of the original input. When data.next() is called, a slice of the data is made here. This slice is what should be getting disposed of when we dispose of batch.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Andrew, what about dataArrays?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summarizing the conversation we had offline: It appears that the copyTo in loadData is copying to memory in the Executor predExec. The memory in the executor seems to get reused across predict calls. I think that the proper place to dispose of this part would be to add a dispose method for FeedForward and have that dispose of it's predExec.

@nswamy nswamy merged commit bbc7a22 into apache:master Jun 14, 2018
anirudh2290 pushed a commit that referenced this pull request Jun 14, 2018
* Fixes Scala memory leak (#10436)

* Replaced the copy and disposed of sliced ndArray to resolve memory leak

* Wrapped disposes in a finally to ensure they are called.
marcoabreu pushed a commit to marcoabreu/incubator-mxnet that referenced this pull request Jun 14, 2018
* Fixes Scala memory leak (apache#10436)

* Replaced the copy and disposed of sliced ndArray to resolve memory leak

* Wrapped disposes in a finally to ensure they are called.
marcoabreu pushed a commit that referenced this pull request Jun 14, 2018
* Fixes Scala memory leak (#10436)

* Replaced the copy and disposed of sliced ndArray to resolve memory leak

* Wrapped disposes in a finally to ensure they are called.
anirudh2290 pushed a commit to anirudh2290/mxnet that referenced this pull request Jun 15, 2018
* Fixes Scala memory leak (apache#10436)

* Replaced the copy and disposed of sliced ndArray to resolve memory leak

* Wrapped disposes in a finally to ensure they are called.
zheng-da pushed a commit to zheng-da/incubator-mxnet that referenced this pull request Jun 28, 2018
XinYao1994 pushed a commit to XinYao1994/incubator-mxnet that referenced this pull request Aug 29, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants