Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NDArray.set() fails on linux with " Inplace update to inference tensor outside InferenceMode" #1774

Closed
demq opened this issue Jul 6, 2022 · 7 comments
Labels
bug Something isn't working

Comments

@demq
Copy link
Contributor

demq commented Jul 6, 2022

Description

The NDArray.set() fails when trying to update tensors in a post-processing stage with a message:

Inplace update to inference tensor outside InferenceMode is not allowed.You can make a clone to get a normal tensor before doing inplace update.See pytorch/rfcs#17 for more details.

This behavior is observed when running the code with PyTorch engine on a linux machine, the same code runs without any errors on a Mac M1 Pro. The work-around is to first duplicate the tensor by calling a NDArray.duplicate() and performing the .set() on the new tensor.

Expected Behavior

The PyTorch implementation of the NDArray should either perform the tensor duplication when trying to modify the tensors outside of the InferenceMode, or these tensors should be made immutable.

Error Message

Exception in thread "main" ai.djl.translate.TranslateException: ai.djl.engine.EngineException: Inplace update to inference tensor outside InferenceMode is not allowed.You can make a clone to get a normal tensor before doing inplace update.See pytorch/rfcs#17 for more details.
Caused by: ai.djl.engine.EngineException: Inplace update to inference tensor outside InferenceMode is not allowed.You can make a clone to get a normal tensor before doing inplace update.See pytorch/rfcs#17 for more details.
at ai.djl.pytorch.jni.PyTorchLibrary.torchMaskedPut(Native Method)
at ai.djl.pytorch.jni.JniUtils.booleanMaskSet(JniUtils.java:416)
at ai.djl.pytorch.engine.PtNDArrayIndexer.set(PtNDArrayIndexer.java:82)
at ai.djl.ndarray.index.NDArrayIndexer.set(NDArrayIndexer.java:157)
at ai.djl.ndarray.NDArray.set(NDArray.java:469)
at ai.djl.ndarray.NDArray.set(NDArray.java:490)
at processOutput(PtBertQATranslator.java:116)

How to Reproduce?

Create a custom QATranslator, override the processOutput() method like

   public List<QAResult> processOutput(TranslatorContext ctx, NDList list) {
        NDManager manager = ctx.getNDManager();
        NDArray start_logits = list.get(0);
        boolean[] bad_tokens_mask = new boolean[128];
        NDArray nd_bad_tokens_mask = manager.create(bad_tokens_mask);
        start_logits.set(nd_bad_tokens_mask, -10000.);

Steps to reproduce

(Paste the commands you ran that produced the error.)

Create a QA predictor using a model based on the custom translator and using "PyTorch" engine. Run predictor.predict() on a linux machine.

What have you tried to solve it?

Making a duplicate of the "output" tensors resolves the issue: NDArray startLogits = list.get(0).duplicate();

Environment Info

Please run the command ./gradlew debugEnv from the root directory of DJL (if necessary, clone DJL first). It will output information about your system, environment, and installation that can help us debug your issue. Paste the output of the command below:

./gradlew debugEnv
Starting a Gradle Daemon (subsequent builds will be faster)

> Task :integration:debugEnv
[DEBUG] - Registering EngineProvider: XGBoost
[DEBUG] - Registering EngineProvider: MXNet
[DEBUG] - Registering EngineProvider: PyTorch
[DEBUG] - Registering EngineProvider: TensorFlow
[DEBUG] - Found default engine: MXNet
----------- System Properties -----------
java.specification.version: 17
sun.jnu.encoding: UTF-8
java.class.path: /mnt/ssd4tb/user/Software/djl/integration/build/classes/java/main:/mnt/ssd4tb/user/Software/djl/integration/build/resources/main:/home/user/.gradle/caches/modules-2/files-2.1/commons-cli/commons-cli/1.5.0/dc98be5d5390230684a092589d70ea76a147925c/commons-cli-1.5.0.jar:/home/user/.gradle/caches/modules-2/files-2.1/org.apache.logging.log4j/log4j-slf4j-impl/2.17.2/183f7c95fc981f3e97d008b363341343508848e/log4j-slf4j-impl-2.17.2.jar:/mnt/ssd4tb/user/Software/djl/basicdataset/build/libs/basicdataset-0.18.0-SNAPSHOT.jar:/mnt/ssd4tb/user/Software/djl/model-zoo/build/libs/model-zoo-0.18.0-SNAPSHOT.jar:/mnt/ssd4tb/user/Software/djl/testing/build/libs/testing-0.18.0-SNAPSHOT.jar:/home/user/.gradle/caches/modules-2/files-2.1/org.testng/testng/7.5/1416a607fae667c14e390b484e8d02b5824c0674/testng-7.5.jar:/mnt/ssd4tb/user/Software/djl/engines/mxnet/mxnet-model-zoo/build/libs/mxnet-model-zoo-0.18.0-SNAPSHOT.jar:/mnt/ssd4tb/user/Software/djl/engines/pytorch/pytorch-model-zoo/build/libs/pytorch-model-zoo-0.18.0-SNAPSHOT.jar:/mnt/ssd4tb/user/Software/djl/engines/pytorch/pytorch-jni/build/libs/pytorch-jni-1.11.0-0.18.0-SNAPSHOT.jar:/mnt/ssd4tb/user/Software/djl/engines/tensorflow/tensorflow-model-zoo/build/libs/tensorflow-model-zoo-0.18.0-SNAPSHOT.jar:/mnt/ssd4tb/user/Software/djl/engines/ml/xgboost/build/libs/xgboost-0.18.0-SNAPSHOT.jar:/mnt/ssd4tb/user/Software/djl/engines/mxnet/mxnet-engine/build/libs/mxnet-engine-0.18.0-SNAPSHOT.jar:/mnt/ssd4tb/user/Software/djl/engines/pytorch/pytorch-engine/build/libs/pytorch-engine-0.18.0-SNAPSHOT.jar:/mnt/ssd4tb/user/Software/djl/engines/tensorflow/tensorflow-engine/build/libs/tensorflow-engine-0.18.0-SNAPSHOT.jar:/mnt/ssd4tb/user/Software/djl/api/build/libs/api-0.18.0-SNAPSHOT.jar:/home/user/.gradle/caches/modules-2/files-2.1/org.slf4j/slf4j-api/1.7.36/6c62681a2f655b49963a5983b8b0950a6120ae14/slf4j-api-1.7.36.jar:/home/user/.gradle/caches/modules-2/files-2.1/org.apache.logging.log4j/log4j-core/2.17.2/fa43ba4467f5300b16d1e0742934149bfc5ac564/log4j-core-2.17.2.jar:/home/user/.gradle/caches/modules-2/files-2.1/org.apache.logging.log4j/log4j-api/2.17.2/f42d6afa111b4dec5d2aea0fe2197240749a4ea6/log4j-api-2.17.2.jar:/home/user/.gradle/caches/modules-2/files-2.1/org.apache.commons/commons-csv/1.9.0/b59d8f64cd0b83ee1c04ff1748de2504457018c1/commons-csv-1.9.0.jar:/home/user/.gradle/caches/modules-2/files-2.1/com.google.code.findbugs/jsr305/3.0.1/f7be08ec23c21485b9b5a1cf1654c2ec8c58168d/jsr305-3.0.1.jar:/home/user/.gradle/caches/modules-2/files-2.1/com.beust/jcommander/1.78/a3927de9bd6f351429bcf763712c9890629d8f51/jcommander-1.78.jar:/home/user/.gradle/caches/modules-2/files-2.1/org.webjars/jquery/3.5.1/2392938e374f561c27c53872bdc9b6b351b6ba34/jquery-3.5.1.jar:/home/user/.gradle/caches/modules-2/files-2.1/ml.dmlc/xgboost4j_2.12/1.6.0/4623e78f614c998b4600c1cc58441ce06d80ba49/xgboost4j_2.12-1.6.0.jar:/home/user/.gradle/caches/modules-2/files-2.1/commons-logging/commons-logging/1.2/4bfc12adfe4842bf07b657f0369c4cb522955686/commons-logging-1.2.jar:/home/user/.gradle/caches/modules-2/files-2.1/com.google.code.gson/gson/2.9.0/8a1167e089096758b49f9b34066ef98b2f4b37aa/gson-2.9.0.jar:/home/user/.gradle/caches/modules-2/files-2.1/net.java.dev.jna/jna/5.11.0/27770efb6329f092f895c7329662d1aa8ee8c0ac/jna-5.11.0.jar:/home/user/.gradle/caches/modules-2/files-2.1/org.apache.commons/commons-compress/1.21/4ec95b60d4e86b5c95a0e919cb172a0af98011ef/commons-compress-1.21.jar:/mnt/ssd4tb/user/Software/djl/engines/tensorflow/tensorflow-api/build/libs/tensorflow-api-0.18.0-SNAPSHOT.jar:/home/user/.gradle/caches/modules-2/files-2.1/org.tensorflow/tensorflow-core-api/0.4.0/2ac35ca087607cce0e5419953cc1ef0c3a5edaea/tensorflow-core-api-0.4.0.jar:/home/user/.gradle/caches/modules-2/files-2.1/org.bytedeco/javacpp/1.5.6/1f18a820aadd943577b0b372554f9e35e1232e25/javacpp-1.5.6.jar:/home/user/.gradle/caches/modules-2/files-2.1/com.google.protobuf/protobuf-java/3.19.2/e958ce38f96b612d3819ff1c753d4d70609aea74/protobuf-java-3.19.2.jar:/home/user/.gradle/caches/modules-2/files-2.1/org.tensorflow/ndarray/0.3.3/1b6d8cc3e3762f6e465b884580d9fc17ab7aeb4/ndarray-0.3.3.jar
java.vm.vendor: Red Hat, Inc.
sun.arch.data.model: 64
user.variant: 
java.vendor.url: https://www.redhat.com/
java.vm.specification.version: 17
os.name: Linux
sun.java.launcher: SUN_STANDARD
sun.boot.library.path: /usr/lib/jvm/java-17-openjdk-17.0.3.0.7-1.fc36.x86_64/lib:/usr/lib/jvm/java-17-openjdk-17.0.3.0.7-1.fc36.x86_64/lib
sun.java.command: ai.djl.integration.util.DebugEnvironment
jdk.debug: release
sun.cpu.endian: little
org.gradle.appname: gradlew
user.language: en
java.specification.vendor: Oracle Corporation
java.version.date: 2022-04-19
java.home: /usr/lib/jvm/java-17-openjdk-17.0.3.0.7-1.fc36.x86_64
ai.djl.logging.level: debug
org.gradle.internal.http.connectionTimeout: 60000
file.separator: /
java.vm.compressedOopsMode: Zero based
line.separator: 

java.vm.specification.vendor: Oracle Corporation
java.specification.name: Java Platform API Specification
sun.management.compiler: HotSpot 64-Bit Tiered Compilers
java.runtime.version: 17.0.3+7
path.separator: :
os.version: 5.17.12-300.fc36.x86_64
java.runtime.name: OpenJDK Runtime Environment
file.encoding: UTF-8
java.vm.name: OpenJDK 64-Bit Server VM
java.vendor.version: 21.9
java.vendor.url.bug: https://bugzilla.redhat.com/enter_bug.cgi?product=Fedora&component=java-17-openjdk&version=36
java.io.tmpdir: /tmp
org.gradle.internal.http.socketTimeout: 120000
java.version: 17.0.3
user.dir: /mnt/ssd4tb/user/Software/djl/integration
os.arch: amd64
java.vm.specification.name: Java Virtual Machine Specification
native.encoding: UTF-8
java.library.path: /usr/local/cuda-10.2/lib64:/usr/java/packages/lib:/usr/lib64:/lib64:/lib:/usr/lib
java.vm.info: mixed mode, sharing
java.vendor: Red Hat, Inc.
java.vm.version: 17.0.3+7
sun.io.unicode.encoding: UnicodeLittle
library.jansi.path: /home/user/.gradle/native/jansi/1.18/linux64
java.class.version: 61.0
org.gradle.internal.publish.checksums.insecure: true

--------- Environment Variables ---------
PATH: /usr/local/cuda-10.2/bin:/usr/lib64/ccache:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin
LD_LIBRARY_PATH: /usr/local/cuda-10.2/lib64
-------------- Directories --------------
temp directory: /tmp
DJL cache directory: /home/user/.djl.ai
Engine cache directory: /home/user/.djl.ai

------------------ CUDA -----------------
GPU Count: 1
CUDA: 102
ARCH: 52
GPU(0) memory used: 295370752 bytes

----------------- Engines ---------------
DJL version: 0.18.0
Default Engine: MXNet
[WARN ] - No matching cuda flavor for linux found: cu102mkl/sm_52.
[DEBUG] - Loading mxnet library from: /home/user/.djl.ai/mxnet/1.9.0-mkl-linux-x86_64/libmxnet.so
Default Device: cpu()
PyTorch: 2
MXNet: 0
XGBoost: 10
TensorFlow: 3

--------------- Hardware --------------
Available processors (cores): 48
Byte Order: LITTLE_ENDIAN
Free memory (bytes): 2084036752
Maximum memory (bytes): 32178700288
Total memory available to JVM (bytes): 2147483648
Heap committed: 2147483648
Heap nonCommitted: 31391744
GCC: 
gcc (GCC) 12.1.1 20220507 (Red Hat 12.1.1-1)
Copyright (C) 2022 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.


BUILD SUCCESSFUL in 8s
44 actionable tasks: 1 executed, 43 up-to-date
@demq demq added the bug Something isn't working label Jul 6, 2022
@KexinFeng
Copy link
Contributor

KexinFeng commented Jul 12, 2022

@demq
From the error message, it shows that you are using the old feature of setter.

at ai.djl.pytorch.jni.PyTorchLibrary.torchMaskedPut(Native Method)
at ai.djl.pytorch.jni.JniUtils.booleanMaskSet(JniUtils.java:416)
at ai.djl.pytorch.engine.PtNDArrayIndexer.set(PtNDArrayIndexer.java:82)
at ai.djl.ndarray.index.NDArrayIndexer.set(NDArrayIndexer.java:157)
at ai.djl.ndarray.NDArray.set(NDArray.java:469)
at ai.djl.ndarray.NDArray.set(NDArray.java:490)
at processOutput(PtBertQATranslator.java:116)

Could you try the setter with NDIndex? It will utilize the new feature.

start_logits.set(new NDIndex("{}", nd_bad_tokens_mask), -10000.);

This is in version 0.18.0, which is just released.

KexinFeng added a commit that referenced this issue Jul 13, 2022
1. Fix issue 1773 and issue 1774
In NDArray.set(NDArray index, Number value) add the feature of setting array values with integer indices, as shown in the use case #1773 as well as #1774.

2. Fix an issue in the document of built-from-source
@demq
Copy link
Contributor Author

demq commented Jul 14, 2022

There is no change to this behavior when I use the NDArray.set(NDIndex, Number), the error is coming from the way PyTorch restricts the modification of the tensors after inference:
Caused by: ai.djl.engine.EngineException: Inplace update to inference tensor outside InferenceMode is not allowed.You can make a clone to get a normal tensor before doing inplace update.See pytorch/rfcs#17 for more details.

@KexinFeng
Copy link
Contributor

KexinFeng commented Jul 14, 2022

I see. Ok, if this is a bug in Pytorch, can you report an issue to PyTorch to see if they can solve this from their side?

@frankfliu
Copy link
Contributor

@demq
Are you able to reproduce the issue in python?

@demq
Copy link
Contributor Author

demq commented Jul 15, 2022

I need to clarify the reason I think this issue is a bug in DJL.

  1. Pytorch behaves as expected, since the tensors created while in "c10:InferenceMode" are made immutable outside of "InferenceMode": https://pytorch.org/cppdocs/notes/inference_mode.html

DJL appears to be invoking the "InferenceMode" for the "newer" version of torch: https://github.com/deepjavalibrary/djl/blob/master/engines/pytorch/pytorch-native/src/main/native/ai_djl_pytorch_jni_PyTorchLibrary_inference.cc

struct JITCallGuard {
#ifdef V1_10_X
  torch::autograd::AutoGradMode no_autograd_guard{false};
  torch::NoGradGuard no_grad;
#else
  c10::InferenceMode guard;
  torch::jit::GraphOptimizerEnabledGuard no_optimizer_guard{false};
#endif
};

The V1_10_X is set if PT_OLD_VERSION is true, which on its own is set in
https://github.com/deepjavalibrary/djl/blob/master/engines/pytorch/pytorch-native/build.cmd

if "%VERSION%" == "1.10.0" (
    set PT_OLD_VERSION=1
)
if "%VERSION%" == "1.9.1" (
    set PT_OLD_VERSION=1
)

I suppose the M1 version of the djl is compiled with the V1_10_X defined, so the c10::InferenceMode guard; is not defined, while the linux version has it.

  1. The NDArray.set() documentation does not outline this issue https://javadoc.io/static/ai.djl/api/0.18.0/ai/djl/ndarray/NDArray.html#set(ai.djl.ndarray.index.NDIndex,ai.djl.ndarray.NDArray)

DJL Needs to either Document this behavior, or ensure the tensors can be modified after the inference in PyTorch implementation to ensure the function behaves the same for all engines.
This can be done for example by:

  • Sub-optimal: Switching back from c10::InferenceMode guard; to torch::NoGradGuard no_grad;
  • Returning the duplicates of the tensors created outside of the InfereceMode
  • Doing something better than what I have proposed.

@KexinFeng
Copy link
Contributor

KexinFeng commented Jul 17, 2022

@demq Thank you so much for this detailed investigation! The purpose of InferenceMode guard, is to free the array from being changed by autograd in the inference mode. We should try to keep consistent with this. But in your case, where you wanted to modify the inference array, we will need to think about how to resolve it.

@KexinFeng
Copy link
Contributor

KexinFeng commented Jul 21, 2022

I have just updated the document of processOutput to notify this behaviour. So when users try to implement postProcessor, they will see this link.

I didn't do duplicates inside DJL, but leave it users, since it is good to keep default behaviour same as the engines.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants