Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error in using GPU Tensorflow Mac M1 when doing neural network model #13

Closed
denvercal1234GitHub opened this issue Jan 30, 2023 · 4 comments

Comments

@denvercal1234GitHub
Copy link

denvercal1234GitHub commented Jan 30, 2023

Hi there,

When I followed the "training_non_default_regression_models" tutorial and sample data as below, I encountered errors as shown.

Would you mind helping me to fix the issue in R environment, because it does not seem that my TensorFlow is actually using GPU?

Thank you for your help.

regression_functions <- list(
    XGBoost = fitter_xgboost, # XGBoost
    ## Passed to fitter_nn, e.g. neural networks through keras::fit. See https://keras.rstudio.com/articles/tutorial_basic_regression.html
NN = fitter_nn,
    SVM = fitter_svm, # SVM
    LASSO2 = fitter_glmnet, # L1-penalized 2nd degree polynomial model
    LM = fitter_linear # Linear model
)

extra_args_regression_params <- list(
     ## Passed to the first element of `regression_functions`, e.g. XGBoost. See ?xgboost for which parameters can be passed through this list
    list(nrounds = 500, eta = 0.05),

    # ## Passed to the second element of `regression_functions`, e.g. neural networks through keras::fit. See https://keras.rstudio.com/articles/tutorial_basic_regression.html
    #MacOS with AMD GPU here. I am using tensorflow for metal as soon as it was launched, with GPU acceleration. Sometimes I get the same message (Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.). But, it still uses the GPU. You can check that by opening Activity Monitor, then pressing Cmd + 3 and Cmd + 4, which shows you GPU and CPU usage.
      list(
		object = { ## Specifies the network's architecture, loss function and optimization method
		model = keras_model_sequential()
		model %>%
		layer_dense(units = backbone_size, activation = "relu", input_shape = backbone_size) %>% 
		layer_dense(units = backbone_size, activation = "relu", input_shape = backbone_size) %>%
		layer_dense(units = 1, activation = "linear")
		model %>%
		compile(loss = "mean_squared_error", optimizer = optimizer_sgd(lr = 0.005))
		serialize_model(model)
		},
		epochs = 1000, ## Number of maximum training epochs. The training is however stopped early if the loss on the validation set does not improve for 20 epochs. This early stopping is hardcoded in fitter_nn.
		validation_split = 0.2, ## Fraction of the training data used to monitor validation loss
		verbose = 0,
		batch_size = 128 ## Size of the minibatches for training.
	),
    # Passed to the third element, SVMs. See help(svm, "e1071") for possible arguments
    list(type = "nu-regression", cost = 8, nu=0.5, kernel="radial"),

    # Passed to the fourth element, fitter_glmnet. This should contain a mandatory argument `degree` which specifies the degree of the polynomial model (1 for linear, 2 for quadratic etc...). Here we use degree = 2 corresponding to our LASSO2 model Other arguments are passed to getS3method("cv.glmnet", "formula"),
    list(alpha = 1, nfolds=10, degree = 2),

    # Passed to the fifth element, fitter_linear. This only accepts a degree argument specifying the degree of the polynomial model. Here we use degree = 1 corresponding to a linear model.
    list(degree = 1)
)

Output

Metal device set to: Apple M1 Max

systemMemory: 64.00 GB
maxCacheSize: 24.00 GB

2023-01-30 14:04:18.688257: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:306] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2023-01-30 14:04:18.688306: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:272] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: <undefined>)
WARNING:absl:`lr` is deprecated, please use `learning_rate` instead, or use the legacy optimizer, e.g.,tf.keras.optimizers.legacy.SGD.
[[1]]
[[1]]$nrounds
[1] 500

[[1]]$eta
[1] 0.05


[[2]]
[[2]]$object
   [1] 89 48 44 46 0d 0a 1a 0a 00 00 00 00 00 08 08 00 04 00 10 00 00 00 00 00 00 00 00 00 00 00 00 00 ff ff ff
  [36] ff ff ff ff ff 98 54 00 00 00 00 00 00 ff ff ff ff ff ff ff ff 00 00 00 00 00 00 00 00 60 00 00 00 00 00
  [71] 00 00 01 00 00 00 00 00 00 00 88 00 00 00 00 00 00 00 a8 02 00 00 00 00 00 00 01 00 07 00 01 00 00 00 18
 [106] 00 00 00 00 00 00 00 10 00 10 00 00 00 00 00 20 03 00 00 00 00 00 00 68 01 00 00 00 00 00 00 54 52 45 45
 [141] 00 00 01 00 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff 00 00 00 00 00 00 00 00 00 18 00 00 00 00 00
 [176] 00 18 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
 [211] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
 [246] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
 [281] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
 [316] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
 [351] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
 [386] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
 [421] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
 [456] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
 [491] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
 [526] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
 [561] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
 [596] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
 [631] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
 [666] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 48 45 41 50 00 00 00 00 58 00 00 00 00 00 00 00 30 00 00 00
 [701] 00 00 00 00 c8 02 00 00 00 00 00 00 00 00 00 00 00 00 00 00 6d 6f 64 65 6c 5f 77 65 69 67 68 74 73 00 00
 [736] 00 6f 70 74 69 6d 69 7a 65 72 5f 77 65 69 67 68 74 73 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00 28 00
 [771] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 11 00 10 00 00
 [806] 00 00 00 88 00 00 00 00 00 00 00 a8 02 00 00 00 00 00 00 0c 00 48 00 04 00 00 00 01 00 0e 00 14 00 08 00
 [841] 6b 65 72 61 73 5f 76 65 72 73 69 6f 6e 00 00 00 19 01 01 00 10 00 00 00 10 00 00 00 01 00 00 00 00 00 08
 [876] 00 00 00 00 00 01 00 00 00 00 00 00 00 06 00 00 00 00 08 00 00 00 00 00 00 01 00 00 00 0c 00 40 00 04 00
 [911] 00 00 01 00 08 00 14 00 08 00 62 61 63 6b 65 6e 64 00 19 01 01 00 10 00 00 00 10 00 00 00 01 00 00 00 00
 [946] 00 08 00 00 00 00 00 01 00 00 00 00 00 00 00 0a 00 00 00 00 08 00 00 00 00 00 00 02 00 00 00 0c 00 48 00
 [981] 04 00 00 00 01 00 0d 00 14 00 08 00 6d 6f 64 65 6c 5f 63 6f
 [ reached getOption("max.print") -- omitted 20656 entries ]

[[2]]$epochs
[1] 1000

[[2]]$validation_split
[1] 0.2

[[2]]$verbose
[1] 0

[[2]]$batch_size
[1] 128


[[3]]
[[3]]$type
[1] "nu-regression"

[[3]]$cost
[1] 8

[[3]]$nu
[1] 0.5

[[3]]$kernel
[1] "radial"


[[4]]
[[4]]$alpha
[1] 1

[[4]]$nfolds
[1] 10

[[4]]$degree
[1] 2


[[5]]
[[5]]$degree
[1] 1
if(length(regression_functions) != length(extra_args_regression_params)){
    stop("Number of models and number of lists of hyperparameters mismatch")
}
imputed_data <- infinity_flow(
	regression_functions = regression_functions,
	extra_args_regression_params = extra_args_regression_params,
	path_to_fcs = path_to_fcs,
	path_to_output = path_to_output,
	path_to_intermediary_results = path_to_intermediary_results,
	backbone_selection_file = backbone_selection_file,
	annotation = targets,
	isotype = isotypes,
	input_events_downsampling = input_events_downsampling,
	prediction_events_downsampling = prediction_events_downsampling,
	verbose = TRUE,
	#Note: there is an issue with serialization of the neural networks and socketing since I updated to R-4.0.1. If you want to use neural networks, please make sure to set cores = 1L
	cores = cores,
	neural_networks_seed = 12345
)

Output

Using directories...
	input: /Users/clusteredatom/Documents/DPHIL_DATA/scRNAseq/T230T240T246_CXCR5Project/scRNAseq_Analysis_Scripts_2022Nov22/HD_Flow/Infinity_Flow/basic_usage_tutorial/infinity_flow_example/fcs
	intermediary: /Users/clusteredatom/Documents/DPHIL_DATA/scRNAseq/T230T240T246_CXCR5Project/scRNAseq_Analysis_Scripts_2022Nov22/HD_Flow/Infinity_Flow/basic_usage_tutorial/infinity_flow_example/tmp
	subset: /Users/clusteredatom/Documents/DPHIL_DATA/scRNAseq/T230T240T246_CXCR5Project/scRNAseq_Analysis_Scripts_2022Nov22/HD_Flow/Infinity_Flow/basic_usage_tutorial/infinity_flow_example/tmp/subsetted_fcs
	rds: /Users/clusteredatom/Documents/DPHIL_DATA/scRNAseq/T230T240T246_CXCR5Project/scRNAseq_Analysis_Scripts_2022Nov22/HD_Flow/Infinity_Flow/basic_usage_tutorial/infinity_flow_example/tmp/rds
	annotation: /Users/clusteredatom/Documents/DPHIL_DATA/scRNAseq/T230T240T246_CXCR5Project/scRNAseq_Analysis_Scripts_2022Nov22/HD_Flow/Infinity_Flow/basic_usage_tutorial/infinity_flow_example/tmp/annotation.csv
	output: /Users/clusteredatom/Documents/DPHIL_DATA/scRNAseq/T230T240T246_CXCR5Project/scRNAseq_Analysis_Scripts_2022Nov22/HD_Flow/Infinity_Flow/basic_usage_tutorial/infinity_flow_example/output
Parsing and subsampling input data
	Downsampling to 1000 events per input file
	Concatenating expression matrices
	Writing to disk
Logicle-transforming the data
	Backbone data
	Exploratory data
	Writing to disk
	Transforming expression matrix
	Writing to disk
Harmonizing backbone data
	Scaling expression matrices
	Writing to disk
Fitting regression models
	Randomly selecting 50% of the subsetted input files to fit models
	Fitting...
		XGBoost

  |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=06s  
	6 seconds
		NN

  |                                                  | 0 % ~calculating  2023-01-30 14:12:34.372079: W tensorflow/tsl/platform/profile_utils/cpu_utils.cc:128] Failed to get CPU frequency: 0 Hz
2023-01-30 14:12:34.481941: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.
2023-01-30 14:12:35.277328: W tensorflow/core/framework/op_kernel.cc:1830] OP_REQUIRES failed at xla_ops.cc:418 : NOT_FOUND: could not find registered platform with id: 0x2b93988d0
2023-01-30 14:12:35.277355: W tensorflow/core/framework/op_kernel.cc:1830] OP_REQUIRES failed at xla_ops.cc:418 : NOT_FOUND: could not find registered platform with id: 0x2b93988d0
2023-01-30 14:12:35.461522: W tensorflow/core/framework/op_kernel.cc:1830] OP_REQUIRES failed at xla_ops.cc:418 : NOT_FOUND: could not find registered platform with id: 0x2b93988d0
2023-01-30 14:12:35.461546: W tensorflow/core/framework/op_kernel.cc:1830] OP_REQUIRES failed at xla_ops.cc:418 : NOT_FOUND: could not find registered platform with id: 0x2b93988d0
2023-01-30 14:12:35.465332: W tensorflow/core/framework/op_kernel.cc:1830] OP_REQUIRES failed at xla_ops.cc:418 : NOT_FOUND: could not find registered platform with id: 0x2b93988d0
2023-01-30 14:12:35.465353: W tensorflow/core/framework/op_kernel.cc:1830] OP_REQUIRES failed at xla_ops.cc:418 : NOT_FOUND: could not find registered platform with id: 0x2b93988d0
Error in py_call_impl(callable, dots$args, dots$keywords) : 
  tensorflow.python.framework.errors_impl.NotFoundError: Graph execution error:
<... omitted ...>ages/keras/optimizers/optimizer_experimental/optimizer.py", line 1166, in _internal_apply_gradients
      return tf.__internal__.distribute.interim.maybe_merge_call(
    File "/Users/clusteredatom/Library/r-miniconda-arm64/envs/r-reticulate/lib/python3.8/site-packages/keras/optimizers/optimizer_experimental/optimizer.py", line 1216, in _distributed_apply_gradients_fn
      distribution.extended.update(
    File "/Users/clusteredatom/Library/r-miniconda-arm64/envs/r-reticulate/lib/python3.8/site-packages/keras/optimizers/optimizer_experimental/optimizer.py", line 1211, in apply_grad_to_update_var
      return self._update_step_xla(grad, var, id(self._var_key(var)))
Node: 'StatefulPartitionedCall_4'
could not find registered platform with id: 0x2b93988d0
	 [[{{node StatefulPartitionedCall_4}}]] [Op:__inference_train_function_609]
See `reticulate::py_last_error()` for details
> sessionInfo()
R version 4.2.2 (2022-10-31)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Ventura 13.2

Matrix products: default
LAPACK: /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRlapack.dylib

Random number generation:
 RNG:     L'Ecuyer-CMRG 
 Normal:  Inversion 
 Sample:  Rejection 
 
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] tensorflow_2.11.0    keras_2.11.0         e1071_1.7-12         glmnetUtils_1.1.8    infinityFlow_1.3.1   flowCore_2.10.0      openxlsx_4.2.5.1    
 [8] readxl_1.4.1         stringr_1.5.0        ggplot2_3.4.0        patchwork_1.1.2.9000 SeuratObject_4.1.3   Seurat_4.3.0         dplyr_1.0.10        

loaded via a namespace (and not attached):
  [1] plyr_1.8.8             igraph_1.3.5           lazyeval_0.2.2         sp_1.6-0               splines_4.2.2          listenv_0.9.0         
  [7] scattermore_0.8        tfruns_1.5.1           digest_0.6.31          foreach_1.5.2          htmltools_0.5.4        fansi_1.0.4           
 [13] magrittr_2.0.3         tensor_1.5             cluster_2.1.4          ROCR_1.0-11            globals_0.16.2         matrixStats_0.63.0    
 [19] spatstat.sparse_3.0-0  cytolib_2.10.1         colorspace_2.1-0       ggrepel_0.9.2          xfun_0.36              jsonlite_1.8.4        
 [25] progressr_0.13.0       spatstat.data_3.0-0    zeallot_0.1.0          survival_3.5-0         zoo_1.8-11             iterators_1.0.14      
 [31] glue_1.6.2             polyclip_1.10-4        gtable_0.3.1           leiden_0.4.3           future.apply_1.10.0    shape_1.4.6           
 [37] BiocGenerics_0.44.0    abind_1.4-5            scales_1.2.1           DBI_1.1.3              spatstat.random_3.1-3  miniUI_0.1.1.1        
 [43] Rcpp_1.0.10            viridisLite_0.4.1      xtable_1.8-4           reticulate_1.27        matlab_1.0.4           proxy_0.4-27          
 [49] stats4_4.2.2           glmnet_4.1-6           htmlwidgets_1.6.1      httr_1.4.4             RColorBrewer_1.1-3     ellipsis_0.3.2        
 [55] ica_1.0-3              pkgconfig_2.0.3        sass_0.4.5             uwot_0.1.14            deldir_1.0-6           utf8_1.2.2            
 [61] here_1.0.1             tidyselect_1.2.0       rlang_1.0.6            reshape2_1.4.4         later_1.3.0            cachem_1.0.6          
 [67] munsell_0.5.0          cellranger_1.1.0       tools_4.2.2            xgboost_1.7.3.1        cli_3.6.0              generics_0.1.3        
 [73] ggridges_0.5.4         evaluate_0.20          fastmap_1.1.0          yaml_2.3.7             goftest_1.2-3          knitr_1.42            
 [79] fitdistrplus_1.1-8     zip_2.2.2              purrr_1.0.1            RANN_2.6.1             pbapply_1.7-0          future_1.30.0         
 [85] nlme_3.1-161           whisker_0.4.1          mime_0.12              compiler_4.2.2         rstudioapi_0.14        plotly_4.10.1.9000    
 [91] png_0.1-8              spatstat.utils_3.0-1   tibble_3.1.8           bslib_0.4.2            stringi_1.7.12         lattice_0.20-45       
 [97] Matrix_1.5-3           vctrs_0.5.2            pillar_1.8.1           lifecycle_1.0.3.9000   jquerylib_0.1.4        spatstat.geom_3.0-5   
[103] lmtest_0.9-40          RcppAnnoy_0.0.20       data.table_1.14.6      cowplot_1.1.1          irlba_2.3.5.1          raster_3.6-14         
[109] httpuv_1.6.8           R6_2.5.1               promises_1.2.0.1       KernSmooth_2.23-20     gridExtra_2.3          RProtoBufLib_2.10.0   
[115] parallelly_1.34.0      sessioninfo_1.2.2      codetools_0.2-18       MASS_7.3-58.2          assertthat_0.2.1       rprojroot_2.0.3       
[121] withr_2.5.0            sctransform_0.3.5      S4Vectors_0.36.1       parallel_4.2.2         terra_1.7-3            grid_4.2.2            
[127] tidyr_1.3.0            class_7.3-21           rmarkdown_2.20         Rtsne_0.16             spatstat.explore_3.0-5 Biobase_2.58.0        
[133] shiny_1.7.4            base64enc_0.1-3       
@ebecht
Copy link
Owner

ebecht commented Feb 2, 2023

Hi @denvercal1234GitHub

I am not familiar enough with tensorflow / keras to see where this error is coming from. Have you tested these packages independently on infinityFlow to see if they work on your machine ? Also have you made sure to disable parallelization by using cores = 1L in infinity_flow() ?

Best,
Etienne

@denvercal1234GitHub
Copy link
Author

Thank you @ebecht for all your help and patience with all my crazy questions.

I was finally able to fix the error and run infinity_flow() with neural net after I activate tensorflow conda environment before running R.

My laptop mac m1 has Total Number of Cores: 10 (8 performance and 2 efficiency). So I tried cores = 2L and obtained the following outputs.

Q1. Would you mind informing me what does the "L" in 2L indicate? Do you recommend using all 8 cores for infinity_flow?

Q2. Although the output still said "I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:306] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support", the pipeline appeared to run fine and GPU was in fact used (viewed in Activity Monitor). Is the output below what you also observed when you ran neural net on your data?

Screenshot 2023-03-22 at 14 09 28

Screenshot 2023-03-22 at 14 09 44

Screenshot 2023-03-22 at 14 10 00

Screenshot 2023-03-22 at 14 10 23

Screenshot 2023-03-22 at 14 10 32

Thanks again!

@ebecht
Copy link
Owner

ebecht commented Mar 22, 2023

Hello again,

Q1. The L indicates that the argument is of type integer rather than numeric. More core means it is going to run faster, you can use as many as your hardware supports. It should not affect the quality of the results, only the speed of computation

Q2. It's nice that you managed to use the GPU, I actually could not. Training and using the models should have been much faster then. Good for you !

@denvercal1234GitHub
Copy link
Author

Thanks Etienne!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants