Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SCI] Weird outputs of resnet50-OT #204

Open
wqruan opened this issue Dec 13, 2023 · 0 comments
Open

[SCI] Weird outputs of resnet50-OT #204

wqruan opened this issue Dec 13, 2023 · 0 comments

Comments

@wqruan
Copy link

wqruan commented Dec 13, 2023

Hi,
When I run resnet50-OT (SCI/build/bin) under the inference mode (i.e. removing the definition of macro TRAINING), the output is:

  Total time taken = 2789929 milliseconds.
  Total data sent = 346429 MiB.
  Number of rounds = 5579
  Total comm (sent+received) = 376212 MiB.
  ------------------------------------------------------
  Total time in Conv = 2507.34 seconds.
  Total time in MatMul = 2516.68 seconds.
  Total time in BatchNorm = 47.581 seconds.
  Total time in Truncation = 115.27 seconds.
  Total time in Relu = 92.644 seconds.
  Total time in MaxPool = 16.519 seconds.
  Total time in AvgPool = 0.043 seconds.
  Total time in ArgMax = 0.078 seconds.
  Total time in MatAdd = 0 seconds.
  Total time in MatAddBroadCast = 0 seconds.
  Total time in MulCir = 0 seconds.
  Total time in ScalarMul = 0 seconds.
  Total time in Sigmoid = 0 seconds.
  Total time in Tanh = 0 seconds.
  Total time in Sqrt = 0 seconds.
  Total time in NormaliseL2 = 0 seconds.
  ------------------------------------------------------
  Conv data sent = 342519 MiB.
  MatMul data sent = 343677 MiB.
  BatchNorm data sent = 901.882 MiB.
  Truncation data sent = 945.903 MiB.
  Relu data sent = 766.831 MiB.
  Maxpool data sent = 136.633 MiB.
  Avgpool data sent = 0.239258 MiB.
  ArgMax data sent = 0.136467 MiB.
  MatAdd data sent = 0 MiB.
  MatAddBroadCast data sent = 0 MiB.
  MulCir data sent = 0 MiB.
  Sigmoid data sent = 0 MiB.
  Tanh data sent = 0 MiB.
  Sqrt data sent = 0 MiB.
  NormaliseL2 data sent = 0 MiB.
  ------------------------------------------------------
  Conv data (sent+received) = 355150 MiB.
  MatMul data (sent+received) = 356514 MiB.
  BatchNorm data (sent+received) = 5993.38 MiB.
  Truncation data (sent+received) = 7614.44 MiB.
  Relu data (sent+received) = 5165.8 MiB.
  Maxpool data (sent+received) = 921.391 MiB.
  Avgpool data (sent+received) = 2.24609 MiB.
  ArgMax data (sent+received) = 0.76857 MiB.
  MatAdd data (sent+received) = 0 MiB.
  MatAddBroadCast data (sent+received) = 0 MiB.
  MulCir data (sent+received) = 0 MiB.
  ScalarMul data (sent+received) = 0 MiB.
  Sigmoid data (sent+received) = 0 MiB.
  Tanh data (sent+received) = 0 MiB.
  Sqrt data (sent+received) = 0 MiB.
  NormaliseL2 data (sent+received) = 0 MiB.

The sum of each component's running time is significantly larger than the total running time and the sum of each component's communication size is significantly larger than the total communication size.

Therefore, I guess that the running time and communication size of the convolution component and the matmul component may be repeatedly counted. Could you help me explain this? 
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant