Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

使用v2代码多机运行deepspeech2相关问题 #2920

Closed
THUHJ opened this issue Jul 17, 2017 · 8 comments
Closed

使用v2代码多机运行deepspeech2相关问题 #2920

THUHJ opened this issue Jul 17, 2017 · 8 comments
Assignees

Comments

@THUHJ
Copy link

THUHJ commented Jul 17, 2017

当在使用deepspech2进行多机实验的时候,将trainer = paddle.trainer.SGD的参数修改为is_local=False,同时设置了pserver_spec。在运行的时候提示

Traceback (most recent call last):
  File "train.py", line 360, in <module>
    main()
  File "train.py", line 356, in main
    train()
  File "train.py", line 350, in train
    feeding=train_generator.feeding)
  File "/usr/local/lib/python2.7/dist-packages/paddle/v2/trainer.py", line 132, in train
    self.__pserver_spec__)
  File "/usr/local/lib/python2.7/dist-packages/paddle/v2/optimizer.py", line 79, in create_updater
    pserver_spec)
  File "/usr/local/lib/python2.7/dist-packages/paddle/v2/optimizer.py", line 52, in __create_new_remote_updater__
    self.__opt_conf__, pserver_spec)
  File "/usr/local/lib/python2.7/dist-packages/py_paddle/swig_paddle.py", line 2179, in createNewRemoteUpdater
    return _swig_paddle.ParameterUpdater_createNewRemoteUpdater(config, pserverSpec)
RuntimeError:  

RuntimeError之后没有别的内容了。请问这种现象可能是由于什么原因导致的呢?
还是现在多机只能使用像v1的代码吗?似乎没有找到针对v2多机的文档。

@Yancey1989
Copy link
Contributor

Yancey1989 commented Jul 17, 2017

  1. 是否启动了parameter server?
  2. v2 trainer在启动时需要在paddle.init时候增加PServer的地址等参数,例如
paddle.init(use_gpu=False,
    pservers="127.0.0.1:7164", 
    port=7164, 
    num_gradient_servers=${YOUR TRAINER COUNT}, 
    ports_num=1, 
    ports_num_for_sparse=1)

BTW, 会尽快补上v2多机的文档。

@THUHJ
Copy link
Author

THUHJ commented Jul 17, 2017

非常感谢!之前没有添加init,我试一下~

@THUHJ
Copy link
Author

THUHJ commented Jul 17, 2017

@在init中增加了PServer地址的参数之后还是提示一样的error信息。

补充信息:
使用lsof -i命令可以看到

COMMAND    PID USER   FD   TYPE  DEVICE SIZE/OFF NODE NAME
paddle_ps 6081 root  121u  IPv4 3661426      0t0  TCP 8c4d2e756387:3000 (LISTEN)
paddle_ps 6081 root  123u  IPv4 3661428      0t0  TCP 8c4d2e756387:3001 (LISTEN)

Pserver确实在监听指定的端口。

不知道还有没有其他的可能?或者有什么方法可以追踪这个error的具体信息嘛,现在这个error没有别的提示信息不知道怎么调试好~

@Yancey1989
Copy link
Contributor

Yancey1989 commented Jul 17, 2017

@THUHJ 能贴下你PServer的启动参数么?另外可以把paddle.trainer.SGD()pserver_spec这个参数去掉,只在paddle.init中指定PServer的地址。

@THUHJ
Copy link
Author

THUHJ commented Jul 17, 2017

PS启动参数:

paddle pserver --num_gradient_servers 1 \
 --nics eth0 \
 --port 3000 \
 --ports_num 1 \
 --ports_num_for_sparse 1
trainer = paddle.trainer.SGD(
        cost=cost, parameters=parameters, update_equation=optimizer,
                is_local=False,
        pserver_spec="10.30.40.109:3000")

 paddle.init(use_gpu=args.use_gpu, trainer_count=args.trainer_count,
        pservers="10.30.40.109:3000",
        port=3000, 
        num_gradient_servers=1, 
        ports_num=1, 
        ports_num_for_sparse=1)
 train()

本机的ip就是10.30.40.109

@typhoonzero
Copy link
Contributor

@THUHJ 目前在创建trainer时先不要使用pserver_spec="10.30.40.109:3000"这个参数,这个参数是正在开发的另一个功能,去掉这个参数即可。

@lcy-seso lcy-seso changed the title 使用v2代码运行多机问题 使用v2代码多机运行deepspeech2相关问题 Jul 17, 2017
@Yancey1989 Yancey1989 self-assigned this Jul 17, 2017
@Yancey1989
Copy link
Contributor

看到了 @typhoonzero 正在写的多机训练的文档,很赞!#2072@THUHJ 欢迎Review相关的文档:)

@JiayiFeng
Copy link
Collaborator

问题已经解答,因此关闭issue

heavengate pushed a commit to heavengate/Paddle that referenced this issue Aug 16, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants