Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why is the performance so terrible? #11

Open
FreeApophis opened this issue May 23, 2021 · 10 comments
Open

Why is the performance so terrible? #11

FreeApophis opened this issue May 23, 2021 · 10 comments
Labels
help wanted Extra attention is needed

Comments

@FreeApophis
Copy link

FreeApophis commented May 23, 2021

The tor-v3 vanity generator on cathugger/mkp224o has no GPU support and is faster on my 10 year old CPU than the numbers shown in the README here.

I get about 1 vanity hash every 10 seconds for 5 character prefix on my CPU.

I did not test if the numbers are true, cause I do not have Nvidia GPU, but this should be A LOT faster on the GPU than this.

Anyone tested the both implementation side by side? Is tor-v3-vanity really that slow? That does not sound right.

With scallion it was easily possible to have 8 and 9 character prefixes.

@marcialvieira
Copy link

marcialvieira commented Jun 8, 2021

Testing using mkp224o on an i7-8565U, even with the best optimization for me (--enable-binsearch --enable-amd64-64-24k --enable-intfilter=64), I'm only getting ~15MK/sec, and running the tor-v3-vanity with a GTX-1660, I'm getting ~5GK/sec.

However I noticed that only 1 core is busy, maybe if the validations were forwarded and validated in multi-thread, I would get more performance, taking better advantage of the keys generated by the GPU.

@marcialvieira
Copy link

marcialvieira commented Jun 9, 2021

@FreeApophis is correct, the output information is confusing, it wasn't 5GK/sec, the output was showing me a cumulative count, so the correct count is 297KK/sec.

BTW: I'm getting 368KK/s with the mkp224o on a raspberry pi 2. :O

@FreeApophis
Copy link
Author

Thanks for the numbers, so there is defintily something wrong with the implementation when a Raspberry is faster than a GTX-1660.

@23cku0r
Copy link

23cku0r commented Aug 12, 2021

4x2080Ti
image

4x3090
image

@marcialvieira
Copy link

As you can see @23cku0r posted, his benchmark is 8x my raspberry pi 2 performance, so just 2 rasps CPU-based have the equivalent performance of a 2080Ti GPU-based performance. lol

@megapro17
Copy link

Languages
Rust
100.0%

@dr-bonez dr-bonez added the help wanted Extra attention is needed label Nov 4, 2021
@dr-bonez
Copy link
Owner

dr-bonez commented Nov 4, 2021

I took a look at the code again, and I don't see an obvious reason why it should be so much slower. This was a weekend pet project I threw together a while back just to try out the nvptx target for rust. I have too much going on right now to look into this, but if anyone takes the time to instrument the code and determine where the bottleneck is, I'm happy to address the problem.

@dr-bonez
Copy link
Owner

dr-bonez commented Nov 4, 2021

My best guess is that there's an issue with automatic block size detection. 256 threads with 272 blocks seems low for a 2080ti.

@ghost
Copy link

ghost commented Nov 18, 2021

Something is definitely wrong here, this is my experience running it on my gtx 1080

=27116== NVPROF is profiling process 27116, command: ./t3v -d keys hello
Launching kernel on device #0 with 256 threads and 60 blocks
Tried 2012160 / 33554432 (expected) keys.
Running for 30 seconds / 8 minutes, 21 seconds (expected).
Tried 4024320 / 33554432 (expected) keys.
Running for 1 minutes, 0 seconds / 8 minutes, 21 seconds (expected).
Tried 6036480 / 33554432 (expected) keys.
Running for 1 minutes, 30 seconds / 8 minutes, 21 seconds (expected).
^C==27116== Profiling application: ./t3v -d keys hello
==27116== Warning: 1 records have invalid timestamps due to insufficient device buffer space. You can configure the buffer space using the option --device-buffer-size.
==27116== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:  100.00%  90.9431s       397  229.08ms  213.42ms  253.10ms  render
                    0.00%  591.57us       794     745ns     351ns  3.0080us  [CUDA memcpy DtoH]
                    0.00%  226.64us       404     561ns     480ns  1.2480us  [CUDA memcpy HtoD]
      API calls:   99.87%  90.9558s       397  229.11ms  213.43ms  253.10ms  cuStreamSynchronize
                    0.11%  100.82ms         1  100.82ms  100.82ms  100.82ms  cuCtxCreate
                    0.01%  10.058ms       794  12.667us  5.8100us  93.444us  cuMemcpyDtoH
                    0.01%  5.0190ms       398  12.610us  9.2540us  72.085us  cuLaunchKernel
                    0.00%  1.7040ms       404  4.2170us  2.4680us  155.27us  cuMemcpyHtoD
                    0.00%  1.6310ms         1  1.6310ms  1.6310ms  1.6310ms  cuModuleLoadData
                    0.00%  255.92us       399     641ns     280ns  1.9780us  cuModuleGetFunction
                    0.00%  109.00us         6  18.166us  1.7130us  99.015us  cuMemAlloc
                    0.00%  9.9490us         1  9.9490us  9.9490us  9.9490us  cuStreamCreateWithPriority
                    0.00%  4.9050us         1  4.9050us  4.9050us  4.9050us  cuDeviceGetPCIBusId
                    0.00%  1.7310us         6     288ns     139ns     553ns  cuDeviceGetAttribute
                    0.00%     832ns         3     277ns     107ns     554ns  cuDeviceGetCount
                    0.00%     555ns         2     277ns     101ns     454ns  cuFuncGetAttribute
                    0.00%     500ns         2     250ns      98ns     402ns  cuDeviceGet

@scramblr
Copy link

Same issues here. Figured I was just having bad luck, but no.. running on an 8 GPU server produces less result than multi processor mkp224o. Was really looking forward to this too, as it's the ONLY solution currently in existence for v3 onions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

6 participants