Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed to run benchmark scripts in Android #291

Open
liute62 opened this issue May 16, 2019 · 20 comments
Open

Failed to run benchmark scripts in Android #291

liute62 opened this issue May 16, 2019 · 20 comments

Comments

@liute62
Copy link

liute62 commented May 16, 2019

Hi there,

I just new to here. So I typed by following the tutorial:

benchmarking/run_bench.py -b specifications/models/caffe2/shufflenet/shufflenet.json --platforms android

After a long time compiling, all of compile and link tasks are finished in the build_android folder in my pytorch repo. But it throws an error:

cmake unknown rule to install xxxx

It looks the caffe2_benchmark executable has been generated by failed to copied to the install folder, so I manually copied to the folder, namely:
/home/new/.aibench/git/exec/caffe2/android/2019/4/5/fefa6d305ea3e820afe64cec015d2f6746d9ca88
Then I modified repo_driver.py to avoid compiling again and run function _runBenchmarkSuites
But failed :

In file included from ../third_party/zstd/lib/common/pool.h:20:0,
from ../third_party/zstd/lib/common/pool.c:14:
../third_party/zstd/lib/common/zstd_internal.h:382:37: error: unknown type name ‘ZSTD_dictMode_e’; did you mean ‘FSE_decode_t’?
ZSTD_dictMode_e dictMode,
^~~~~~~~~~~~~~~
FSE_decode_t

Questions:

  1. Any suggestions on how to run the tutorial correctly?
  2. How to avoid the long time compiling for each time running

benchmarking/run_bench.py -b specifications/models/caffe2/shufflenet/shufflenet.json --platforms android

thanks!

@hl475
Copy link
Contributor

hl475 commented May 16, 2019

Hi @liute62 , thanks for using FAI-PEP!

To answer your questions: (1) I haven't encounter this problem with failed errors about this third_party. Actually this zstd is from pytorch repo https://github.com/pytorch/pytorch/tree/master/third_party. I would suggest you download the latest repo from the github page. (2) Once you built caffe2_benchmark, you can copied it to somewhere in your laptop, and changed the lines here https://github.com/facebook/FAI-PEP/blob/master/specifications/frameworks/caffe2/android/build.sh. In particular, you can comment out line 5 and 6, and changed line 7 to something like cp YOUR_BUILT_CAFF2_BENCHMARK $2. By doing this, it will build again but just using your pre build binary.

One more thing, I think from the wiki, the command you want to use is benchmarking/run_bench.py -b specifications/models/caffe2/shufflenet/shufflenet.json, i.e., no --platforms android.

@sf-wind
Copy link
Contributor

sf-wind commented May 16, 2019

@liute62 , can you please share the entire log somewhere?

To speedup the build, you can try incremental build by specifying --platforms android/interactive.

Another way to speedup the build is to do the process with a more powerful host with more cores. The number of parallel threads is capped to the number of cores in the system.

@liute62
Copy link
Author

liute62 commented May 17, 2019

Hi @hl475 and @sf-wind
Thanks for the quick response! I just modified the build.sh and it did work! I got numerous output of the result, for benchmarking/run_bench.py -b specifications/models/caffe2/shufflenet/shufflenet.json
ex:
NET latency: value median 146049.00000 MAD: 2139.00000
ID_0_Conv_gpu_0/conv3_0 latency: value median 6958.12500 MAD: 188.93500
ID_100_Conv_gpu_0/gconv3_9 latency: value median 5317.45000 MAD: 198.00000
ID_101_SpatialBN_gpu_0/gconv3_9_bn latency: value median 132.60650 MAD: 3.69650
ID_102_Conv_gpu_0/gconv1_19 latency: value median 1441.87500 MAD: 79.97000
ID_103_SpatialBN_gpu_0/gconv1_19_bn latency: value median 123.35950 MAD: 5.56950
ID_104_Sum_gpu_0/block9 latency: value median 127.13200 MAD: 16.63550
ID_105_Relu_gpu_0/block9 latency: value median 46.07390 MAD: 2.52530
ID_106_Conv_gpu_0/gconv1_20 latency: value median 1130.52000 MAD: 52.89000
ID_107_SpatialBN_gpu_0/gconv1_20_bn latency: value median 121.87600 MAD: 5.15750
ID_108_Relu_gpu_0/gconv1_20_bn latency: value median 42.08370 MAD: 2.70845
ID_109_ChannelShuffle_gpu_0/shuffle_10 latency: value median 144.16900 MAD: 4.03600

But failed in:
ERROR 19:45:31 utilities.py: 163: Post Connection failed HTTPConnectionPool(host='127.0.0.1', port=8000): Max retries exceeded with url: /benchmark/store-result (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f1b39761da0>: Failed to establish a new connection: [Errno 111] Connection refused',))
INFO 19:45:31 utilities.py: 170: wait 64 seconds. Retrying...

  1. Do you have any ideal about how to setup the server? Is is a data visualization platform?

  2. How to parse the output above? For ex:
    NET latency: value median 146049.00000 MAD: 2139.00000
    My understanding here is that the median value for the running network once is 146049 (ms) ? and what MAD is here? Besides, how could you get the result? Is it totally running on CPU or already leveraged some libraries like OpenCL or GPU in mobile?

  3. Does the benchmark also contain any memory consumption measurements?

  4. For the battery usage, I find it needs an extra hardware device, have you guys tried measuring the battery using some software methods?

thanks!

@hl475
Copy link
Contributor

hl475 commented May 17, 2019

For HTTPConnectionPool, I have the same problem last night when I tried the wiki as well. I changed one line of my config.txt file from the wiki to "--remote_reporter": null, to make it work.

  1. You can follow this page to set up the server, and it contains the info for data visualization as well.
  2. (1) According to the json file, the median value are the median of NET latency for 50 iterations.
    (2) MAD stands for Median absolute deviation.
    (3) Once you fixed remote_reporter in config.txt, run the command one more time. There will be one line in the output log like INFO 00:48:28 local_reporter.py: 67: Writing file for SM-G950U-7.0-24 (988837435645343543) SOME_PATH_ON_YOUR_LAPTOP. You can find all results in that path.
    (4) This is not controlled by FAI-PEP. Indeed, it depends on the framework you are using. In this case, you better check with PyTorch.
  3. I don't think we do that. cc @sf-wind to confirm.
  4. You can find more info in this page, and this is the way we are currently using for the battery usage.

@sf-wind
Copy link
Contributor

sf-wind commented May 17, 2019

For 3, no, we don't measure memory consumption. it is not difficult to add one though. That needs to instrument the framework code to measure it, since PEP just runs on the host system. We use a separate mechanism internally to get the memory consumption.

@hl475 , why do we have a default remote report specified in the confit.txt. Is it added by @ZhizhenQin for the remote lab? In that case, it only affects the lab, not the client.

@hl475
Copy link
Contributor

hl475 commented May 17, 2019

@sf-wind For remote_reporter, I think the logic is from [1] and [2].

@sf-wind
Copy link
Contributor

sf-wind commented May 17, 2019

haha.. I added [2]. I feel the right behavior is to move it to

app = self.repoCls(raw_args=raw_args)
and add remote reporter only if --lab is specified. can you make the change? @hl475

@hl475
Copy link
Contributor

hl475 commented May 17, 2019

Sure, will do.

hl475 added a commit to hl475/FAI-PEP that referenced this issue May 20, 2019
Summary: facebook#291 (comment)

Differential Revision: D15390774

fbshipit-source-id: a38e7ad6a4ba8b13fecd563bc52af5dc2c93838c
hl475 added a commit to hl475/FAI-PEP that referenced this issue May 20, 2019
Summary:
Pull Request resolved: facebook#293

facebook#291 (comment)

Differential Revision: D15390774

fbshipit-source-id: 1afa823a71ee44d204f9ac1a2eee879ab87034ab
hl475 added a commit to hl475/FAI-PEP that referenced this issue May 20, 2019
Summary:
Pull Request resolved: facebook#293

facebook#291 (comment)

Differential Revision: D15390774

fbshipit-source-id: 3412bf7422bf35e9019b59fe1a218dcb1621742e
facebook-github-bot pushed a commit that referenced this issue May 21, 2019
Summary:
Pull Request resolved: #293

#291 (comment)

Reviewed By: sf-wind

Differential Revision: D15390774

fbshipit-source-id: 9e2e6c87d526a9949feae4a61faa86b15c42a9d3
@hl475
Copy link
Contributor

hl475 commented May 21, 2019

Hi @liute62, we have addressed the Post Connection failed problem in #293. Please take another try.

@liute62
Copy link
Author

liute62 commented May 22, 2019

Hi @hl475 and @sf-wind
Thanks for the kind response!

I just setup the server and figured out you guys have solved the HTTP connection errors before, nice job!

Btw, after I enter the data visualization platform, I have this screen:

image

My questions are here:

  1. there is no data available, but in my Django database, I got bunches of benchmark results entries. So how could I show it?

  2. I noticed that even after one single running:

benchmarking/run_bench.py -b specifications/models/caffe2/shufflenet/shufflenet.json

I got numerous output with the same timestamp in the time table on the website, like:

image

Is it the normal behavior of output and the data visualization platform?

thanks!

@sf-wind
Copy link
Contributor

sf-wind commented May 22, 2019

  1. I believe you need to add some filters, especially the one called user_identifier. By default, it doesn't print out everything.

  2. it does output a lot of entries. You can select some fields in columns and then you will see them displayed.

@liute62
Copy link
Author

liute62 commented May 22, 2019

  1. So what the user_identifier here is referring to? How could I know which is my user_identifier? Any document pages for the filtering conditions?

  2. I just randomly select fields in columns or select all in columns, but the screen is still showing no data available after I pressing the submit button. Any specific rules for that?

@sf-wind
Copy link
Contributor

sf-wind commented May 22, 2019

that is strange. Do you use the remote lab flow by following https://github.com/facebook/FAI-PEP/tree/master/ailab?

I thought if you select some fields to display, the fields should appear in the table section. Can you post some screen shot?

@liute62
Copy link
Author

liute62 commented May 22, 2019

I didn't setup the nginx and uwsgi but I guess it is still enough to run the system locally with Django.

Step1:
after setup the database make migrations:

(venv) (base) new@tower2:~/Documents/git/FAI-PEP/ailab$ python manage.py runserver

Django version 2.2.1, using settings 'ailab.settings'
Starting development server at http://127.0.0.1:8000/

Step2:
(venv) (base) new@tower2:~/Documents/git/FAI-PEP/benchmarking$ python run_bench.py --lab --claimer_id 1

INFO 15:10:02 lab_driver.py: 129: Running <class 'run_lab.RunLab'> with raw_args ['--app_id', None, '--token', None, '--root_model_dir', '/home/new/.aibench/git/root_model_dir', '--logger_level', 'info', '--benchmark_table', 'benchmark_benchmarkinfo', '--cache_config', '/home/new/.aibench/git/cache_config.txt', '--commit', 'master', '--commit_file', '/home/new/.aibench/git/processed_commit', '--exec_dir', '/home/new/.aibench/git/exec', '--file_storage', 'django', '--framework', 'caffe2', '--local_reporter', '/home/new/.aibench/git/reporter', '--model_cache', '/home/new/.aibench/git/model_cache', '--platform', 'android', '--remote_reporter', 'http://127.0.0.1:8000/benchmark/store-result|oss', '--remote_repository', 'origin', '--repo', 'git', '--repo_dir', '/home/new/Documents/git/pytorch', '--result_db', 'django', '--screen_reporter', '', '--server_addr', 'http://127.0.0.1:8000', '--status_file', '/home/new/.aibench/git/status', '--timeout', '300', '--claimer_id', '1']
[{"kind": "PBDM00-8.1.0-27", "hash": "35661df3"}]

Step3:
open browser in http://127.0.0.1:8000/benchmark/visualize
image

@sf-wind
Copy link
Contributor

sf-wind commented May 22, 2019

did you submit any test?

@liute62
Copy link
Author

liute62 commented May 23, 2019

did you submit any test?

After I launched the Django server, i just run:

benchmarking/run_bench.py -b specifications/models/caffe2/shufflenet/shufflenet.json

also noticed that the django model benchmark result has been saved multiple times.

So what is the test here? Need to have more extra steps?

@hl475
Copy link
Contributor

hl475 commented May 24, 2019

Hi @liute62, how do you submit your job remotely? are you using

python run_bench.py -b <benchmark_file> --remote --devices <devices> --server_addr <server_name>

as said from here?

In particular, this <server_name> has to match the one when you start the lab

python run_bench.py --lab --claimer_id <claimer_id> --server_addr <server_name> --remote_reporter "<server_name>/benchmark/store-result|oss" --platform android

as said from here

@sf-wind
Copy link
Contributor

sf-wind commented May 24, 2019

@liute62 you just ran the tests locally. You need to run it remotely in order to see the result from the UI. Please follow the instructions @hl475 provides. Thanks.

@liute62
Copy link
Author

liute62 commented May 24, 2019

@hl475 @sf-wind Thanks for your clarification:

I just write down details below as an example accompany to the tutorial, for other's references.

And I didn't setup the nginx and uwsgi.
Steps order are 1) 2) 3)

1) Terminal 1:

(venv) (base) new@tower2:~/Documents/git/FAI-PEP/ailab$ python manage.py runserver

Django version 2.2.1, using settings 'ailab.settings'
Starting development server at http://127.0.0.1:8000/
Quit the server with CONTROL-C.

2)Terminal 2:

(venv) (base) new@tower2:~/Documents/git/FAI-PEP/benchmarking$ python run_bench.py --lab --claimer_id 1 --server_addr http://127.0.0.1:8000 --remote_reporter "http://127.0.0.1:8000/benchmark/store-result|oss" --platform android

INFO 14:49:09 lab_driver.py: 129: Running <class 'run_lab.RunLab'> with raw_args ['--app_id', None, '--token', None, '--root_model_dir', '/home/new/.aibench/git/root_model_dir', '--logger_level', 'info', '--benchmark_table', 'benchmark_benchmarkinfo', '--cache_config', '/home/new/.aibench/git/cache_config.txt', '--commit', 'master', '--commit_file', '/home/new/.aibench/git/processed_commit', '--exec_dir', '/home/new/.aibench/git/exec', '--file_storage', 'django', '--framework', 'caffe2', '--local_reporter', '/home/new/.aibench/git/reporter', '--model_cache', '/home/new/.aibench/git/model_cache', '--remote_repository', 'origin', '--repo', 'git', '--repo_dir', '/home/new/Documents/git/pytorch', '--result_db', 'django', '--screen_reporter', '', '--status_file', '/home/new/.aibench/git/status', '--timeout', '300', '--claimer_id', '1', '--server_addr', 'http://127.0.0.1:8000', '--remote_reporter', 'http://127.0.0.1:8000/benchmark/store-result|oss', '--platform', 'android']
[{"kind": "PBDM00-8.1.0-27", "hash": "35661df3"}]

3) Terminal 3:

(venv) (base) new@tower2:~/Documents/git/FAI-PEP/benchmarking$ python run_bench.py -b ../specifications/models/caffe2/shufflenet/shufflenet.json --remote --devices PBDM00-8.1.0-27 --server_addr http://127.0.0.1:8000

INFO 14:51:50 build_program.py: 41: + cp /home/new/Documents/git/pytorch/build_android/bin/caffe2_benchmark /tmp/tmpmvdlhx_n/program
INFO 14:51:50 upload_download_files_django.py: 21: Uploading /tmp/tmpmvdlhx_n/program to http://127.0.0.1:8000/upload/
INFO 14:51:50 upload_download_files_django.py: 31: File has been uploaded to http://127.0.0.1:8000/media/documents/2019/05/24/program_e5ieX2H
INFO 14:51:50 run_remote.py: 170: program: http://127.0.0.1:8000/media/documents/2019/05/24/program_e5ieX2H
Result URL => http://127.0.0.1:8000/benchmark/visualize?sort=-p10&selection_form=%5B%7B%22name%22%3A+%22columns%22%2C+%22value%22%3A+%22identifier%22%7D%2C+%7B%22name%22%3A+%22columns%22%2C+%22value%22%3A+%22metric%22%7D%2C+%7B%22name%22%3A+%22columns%22%2C+%22value%22%3A+%22net_name%22%7D%2C+%7B%22name%22%3A+%22columns%22%2C+%22value%22%3A+%22p10%22%7D%2C+%7B%22name%22%3A+%22columns%22%2C+%22value%22%3A+%22p50%22%7D%2C+%7B%22name%22%3A+%22columns%22%2C+%22value%22%3A+%22p90%22%7D%2C+%7B%22name%22%3A+%22columns%22%2C+%22value%22%3A+%22platform%22%7D%2C+%7B%22name%22%3A+%22columns%22%2C+%22value%22%3A+%22time%22%7D%2C+%7B%22name%22%3A+%22columns%22%2C+%22value%22%3A+%22type%22%7D%2C+%7B%22name%22%3A+%22columns%22%2C+%22value%22%3A+%22user_identifier%22%7D%2C+%7B%22name%22%3A+%22graph-type-dropdown%22%2C+%22value%22%3A+%22bar-graph%22%7D%2C+%7B%22name%22%3A+%22rank-column-dropdown%22%2C+%22value%22%3A+%22p10%22%7D%5D&filters=%7B%22condition%22%3A+%22AND%22%2C+%22rules%22%3A+%5B%7B%22id%22%3A+%22user_identifier%22%2C+%22field%22%3A+%22user_identifier%22%2C+%22type%22%3A+%22string%22%2C+%22input%22%3A+%22text%22%2C+%22operator%22%3A+%22equal%22%2C+%22value%22%3A+%22591648383115815%22%7D%5D%2C+%22valid%22%3A+true%7D
Job status for PBDM00-8.1.0-27 is changed to QUEUE
Job status for PBDM00-8.1.0-27 is changed to RUNNING
Job status for PBDM00-8.1.0-27 is changed to DONE
ID:0 NET latency: 91662.5

4) Visualization

http://127.0.0.1:8000/benchmark/visualize Doesn't show anything though, but just typed in Result URL, you would get:
image

5)More

  1. It will be great to attach each layer's latency to a built-in network graph visualization.
  2. It will be great to have the caffe support for some old projects that build on caffe.

@sf-wind
Copy link
Contributor

sf-wind commented May 24, 2019

Right, you need to click the result URL to see the result. It just have some filters preset. In http://127.0.0.1:8000/benchmark/visualize, you can set the same filters to see the values.

For the operator latency, you can adjust the filters to remove the NET entry, then you will get the operator latency in the plot in clearer format.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants