Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generator methods not working on custom generators in tensorflow 2.2 #1073

Closed
maju116 opened this issue Jun 25, 2020 · 1 comment
Closed
Labels

Comments

@maju116
Copy link

maju116 commented Jun 25, 2020

Hi,

I'm trying to fit some models with custom generators, but fit/predict/evaluate_generator / generator_next functions doesn't seem to work for me.
I'm using:
tensorflow: 2.2.0-rc2 (GPU)
tensorflow R pkg: 2.2.0 (CRAN)
keras R pkg: 2.3.0.0 (CRAN)

Sample code:

library(keras)
library(tidyverse)

input1 <- layer_input(shape = 1)
input2 <- layer_input(shape = 1)

out <- layer_add(list(input1, input2))

model <- keras_model(list(input1, input2), out)

generator <- function() {
  list(list(1, 2), 3)
}

model %>% compile(loss = "mse", optimizer = "sgd")
2020-06-25 13:16:44.945200: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-06-25 13:16:44.975314: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-06-25 13:16:44.975767: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 0 with properties: 
pciBusID: 0000:01:00.0 name: GeForce GTX 1070 computeCapability: 6.1
coreClock: 1.645GHz coreCount: 16 deviceMemorySize: 7.92GiB deviceMemoryBandwidth: 238.66GiB/s
2020-06-25 13:16:44.976043: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-06-25 13:16:44.977732: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-06-25 13:16:44.979357: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2020-06-25 13:16:44.979802: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2020-06-25 13:16:44.981302: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2020-06-25 13:16:44.982329: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2020-06-25 13:16:44.985423: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-06-25 13:16:44.985694: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-06-25 13:16:44.986424: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-06-25 13:16:44.986821: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1703] Adding visible gpu devices: 0
2020-06-25 13:16:44.987081: I tensorflow/core/platform/cpu_feature_guard.cc:143] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2020-06-25 13:16:45.018159: I tensorflow/core/platform/profile_utils/cpu_utils.cc:102] CPU Frequency: 2799925000 Hz
2020-06-25 13:16:45.018581: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x7fdfa0000b60 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-06-25 13:16:45.018598: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2020-06-25 13:16:45.174748: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-06-25 13:16:45.175242: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5561a8761d60 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2020-06-25 13:16:45.175262: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): GeForce GTX 1070, Compute Capability 6.1
2020-06-25 13:16:45.175526: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-06-25 13:16:45.175940: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 0 with properties: 
pciBusID: 0000:01:00.0 name: GeForce GTX 1070 computeCapability: 6.1
coreClock: 1.645GHz coreCount: 16 deviceMemorySize: 7.92GiB deviceMemoryBandwidth: 238.66GiB/s
2020-06-25 13:16:45.176037: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-06-25 13:16:45.176073: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-06-25 13:16:45.176090: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2020-06-25 13:16:45.176107: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2020-06-25 13:16:45.176137: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2020-06-25 13:16:45.176156: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2020-06-25 13:16:45.176200: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-06-25 13:16:45.176287: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-06-25 13:16:45.176720: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-06-25 13:16:45.177085: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1703] Adding visible gpu devices: 0
2020-06-25 13:16:45.177143: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-06-25 13:16:45.177807: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1102] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-06-25 13:16:45.177822: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1108]      0 
2020-06-25 13:16:45.177846: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1121] 0:   N 
2020-06-25 13:16:45.177944: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-06-25 13:16:45.178342: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-06-25 13:16:45.180569: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1247] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7029 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1070, pci bus id: 0000:01:00.0, compute capability: 6.1)

generator_next(generator)
Error in py_has_attr_impl(x, name) : 
  Cannot convert object to an environment: [type=closure; target=ENVSXP].

model %>% predict(list(1, 2)) # Works
     [,1]
[1,]    3

model %>% predict_generator(generator, steps = 1) # Freezes

model %>% fit_generator(generator, steps_per_epoch = 10, validation_data = list(list(1, 2), 3)) # Freezes
1/10 [==>...........................] - ETA: 0s - loss: 0.0000e+00 # Frozen on this step, RStudio non responsive

I'm using custom generators directly in python and everything works fine.
In more advanced generators with model with custom losses and metrics I'm getting :

Error in py_call_impl(callable, dots$args, dots$keywords) : 
  RuntimeError: in user code:

    /home/maju116/anaconda3/lib/python3.7/site-packages/tensorflow/python/keras/engine/training.py:878 test_function  *
        outputs = self.distribute_strategy.run(
    /home/maju116/anaconda3/lib/python3.7/site-packages/tensorflow/python/distribute/distribute_lib.py:951 run  **
        return self._extended.call_for_each_replica(fn, args=args, kwargs=kwargs)
    /home/maju116/anaconda3/lib/python3.7/site-packages/tensorflow/python/distribute/distribute_lib.py:2290 call_for_each_replica
        return self._call_for_each_replica(fn, args, kwargs)
    /home/maju116/anaconda3/lib/python3.7/site-packages/tensorflow/python/distribute/distribute_lib.py:2649 _call_for_each_replica
        return fn(*args, **kwargs)
    /home/maju116/anaconda3/lib/python3.7/site-packages/tensorflow/python/keras/engine/training.py:849 test_step  **
        y, y_pred, sample_weight, regularization_losses=self.losses)
    /home/maju116/anaconda3/lib/python3.7/site-packag 


26. | stop(structure(list(message = "RuntimeError: in user code:


    /home/maju116/anaconda3/lib/python3.7/site-packages/tensorflow/python/keras/engine/training.py:878 test_function  *
        outputs = self.distribute_strategy.run(
    /home/maju116/anaconda3/lib/python3.7/site-packages/tensorflow/python/distribute/distribute_lib.py:951 run  **
        return self._extended.call_for_each_replica(fn, args=args, kwargs=kwargs)
    /home/maju116/anaconda3/lib/python3.7/site-packages/tensorflow/python/distribute/distribute_lib.py:2290 call_for_each_replica
        return self._call_for_each_replica(fn, args, kwargs)
    /home/maju116/anaconda3/lib/python3.7/site-packages/tensorflow/python/distribute/distribute_lib.py:2649 _call_for_each_replica
        return fn(*args, **kwargs)
    /home/maju116/anaconda3/lib/python3.7/site-packages/tensorflow/python/keras/engine/training.py:849 test_step  **
        y, y_pred, sample_weight, regularization_losses=self.losses)
    /home/maju116/anaconda3/lib/python3.7/site-packages/tensorflow/python/keras/engine/compile_utils.py:204 __call__
        loss_value = loss_obj(y_t, y_p, sample_weight=sw)
    /home/maju116/anaconda3/lib/python3.7/site-packages/tensorflow/python/keras/losses.py:143 __call__
        losses = self.call(y_true, y_pred)
    /home/maju116/anaconda3/lib/python3.7/site-packages/tensorflow/python/keras/losses.py:246 call
        return self.fn(y_true, y_pred, **self._fn_kwargs)
    <string>:4 fn
        
    /home/maju116/R/x86_64-pc-linux-gnu-library/4.0/reticulate/python/rpytools/call.py:21 python_function
        raise RuntimeError(res[kErrorKey])

    RuntimeError: Evaluation error: ValueError: None values not supported..
",      call = py_call_impl(callable, dots$args, dots$keywords),      cppstack = structure(list(file = "", line = -1L, stack = c("/home/maju116/R/x86_64-pc-linux-gnu-library/4.0/reticulate/libs/reticulate.so(Rcpp::exception::exception(char const*, bool)+0x7b) [0x7f1e580a78bb]",      "/home/maju116/R/x86_64-pc-linux-gnu-library/4.0/reticulate/libs/reticulate.so(Rcpp::stop(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0x2b) [0x7f1e580a792b]",  ...
25. | wrapper at func_graph.py#968
24.
wrapped_fn at def_function.py#441
23.
func_graph_from_py_func at func_graph.py#981
22.
_create_graph_function at function.py#2667
21.
_define_function_with_shape_relaxation at function.py#2706
20.
_maybe_define_function at function.py#2774
19.
__call__ at function.py#2419
18.
_call at def_function.py#611
17.
__call__ at def_function.py#580
16.
evaluate at training.py#1018
15.
_method_wrapper at training.py#66
14.
evaluate_generator at training.py#1444
13.
new_func at deprecation.py#324
12.
(structure(function (...) 
{
    dots <- py_resolve_dots(list(...))
    result <- py_call_impl(callable, dots$args, dots$keywords) ... 
    11.
    do.call(func, args) 
    10.
    call_generator_function(object$evaluate_generator, args) 
    9.
    evaluate_generator(., test_generator, steps = 1) 
    8.
    function_list[[k]](value) 
    7.
    withVisible(function_list[[k]](value)) 
    6.
    freduce(value, `_function_list`) 
    5.
    `_fseq`(`_lhs`) 
    4.
    eval(quote(`_fseq`(`_lhs`)), env, env) 
    3.
    eval(quote(`_fseq`(`_lhs`)), env, env) 
    2.
    withVisible(eval(quote(`_fseq`(`_lhs`)), env, env)) 
    1.
    test_yolo %>% evaluate_generator(test_generator, steps = 1)

Is there other way of using generators with tensorflow 2.2 or am I missing sth ?

> reticulate::py_config()
python:         /home/maju116/anaconda3/bin/python3
libpython:      /home/maju116/anaconda3/lib/libpython3.7m.so
pythonhome:     /home/maju116/anaconda3:/home/maju116/anaconda3
version:        3.7.6 (default, Jan  8 2020, 19:59:22)  [GCC 7.3.0]
numpy:          /home/maju116/anaconda3/lib/python3.7/site-packages/numpy
numpy_version:  1.18.1
tensorflow:     /home/maju116/anaconda3/lib/python3.7/site-packages/tensorflow

NOTE: Python version was forced by RETICULATE_PYTHON

> sessionInfo()
R version 4.0.2 (2020-06-22)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
LAPACK: /home/maju116/anaconda3/lib/libmkl_rt.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=pl_PL.UTF-8       
 [4] LC_COLLATE=en_US.UTF-8     LC_MONETARY=pl_PL.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=pl_PL.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
[10] LC_TELEPHONE=C             LC_MEASUREMENT=pl_PL.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] XML_3.99-0.3     abind_1.4-5      progress_1.2.2   tensorflow_2.2.0
 [5] keras_2.3.0.0    forcats_0.5.0    stringr_1.4.0    dplyr_1.0.0     
 [9] purrr_0.3.4      readr_1.3.1      tidyr_1.1.0      tibble_3.0.1    
[13] ggplot2_3.3.1    tidyverse_1.3.0 

loaded via a namespace (and not attached):
 [1] reticulate_1.16   platypus_0.1.0    tinytex_0.23      tidyselect_1.1.0 
 [5] xfun_0.14         haven_2.3.1       lattice_0.20-41   colorspace_1.4-1 
 [9] vctrs_0.3.1       generics_0.0.2    base64enc_0.1-3   blob_1.2.1       
[13] rlang_0.4.6       pillar_1.4.4      withr_2.2.0       glue_1.4.1       
[17] DBI_1.1.0         dbplyr_1.4.4      modelr_0.1.8      readxl_1.3.1     
[21] lifecycle_0.2.0   munsell_0.5.0     gtable_0.3.0      cellranger_1.1.0 
[25] rvest_0.3.5       tfruns_1.4        fansi_0.4.1       broom_0.5.6      
[29] Rcpp_1.0.4.6      scales_1.1.1      backports_1.1.7   jsonlite_1.6.1   
[33] fs_1.4.1          hms_0.5.3         stringi_1.4.6     grid_4.0.2       
[37] cli_2.0.2         tools_4.0.2       magrittr_1.5      crayon_1.3.4     
[41] whisker_0.4       pkgconfig_2.0.3   zeallot_0.1.0     ellipsis_0.3.1   
[45] Matrix_1.2-18     xml2_1.3.2        prettyunits_1.1.1 reprex_0.3.0     
[49] lubridate_1.7.9   assertthat_0.2.1  httr_1.4.1        rstudioapi_0.11  
[53] R6_2.4.1          nlme_3.1-147      compiler_4.0.2 
@dfalbel
Copy link
Member

dfalbel commented Jun 25, 2020

@maju116 Thanks! This is a know issue with TF >= 2.1 see #986
I have spent sometime debugging this and it looks like TensorFlow is always evaluating the generators in a different thread which leads to a deadlock somewhere.

Recommended workaround at the moment would be to use keras::train_on_batch() and write your own training loop that reads from the generator.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants