-
Notifications
You must be signed in to change notification settings - Fork 27.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Benchmarks #4912
Benchmarks #4912
Conversation
b95c536
to
3c3fc09
Compare
@@ -91,22 +91,6 @@ def test_train_with_configs(self): | |||
self.check_results_dict_not_empty(results.time_train_result) | |||
self.check_results_dict_not_empty(results.memory_train_result) | |||
|
|||
def test_train_with_configs_torchscript(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no possible torchscript training atm
b7664d3
to
c5fa0d5
Compare
Codecov Report
@@ Coverage Diff @@
## master #4912 +/- ##
==========================================
- Coverage 77.28% 76.34% -0.95%
==========================================
Files 133 134 +1
Lines 22134 22369 +235
==========================================
- Hits 17107 17078 -29
- Misses 5027 5291 +264
Continue to review full report at Codecov.
|
206462d
to
355954f
Compare
@@ -5,12 +5,15 @@ include_trailing_comma = True | |||
known_first_party = transformers | |||
known_third_party = | |||
absl | |||
elasticsearch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yjernite New packages for the example folder have to be added here to avoid problems with isort (learned from @sshleifer)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Love this, LGTM!
Great test coverage as well.
Crazy that there isn't a better way to do this in tf.
Left some nits, but feel free to ignore.
# as written in https://docs.python.org/2/library/timeit.html#timeit.Timer.repeat, min should be taken rather than the average | ||
runtimes = timeit.repeat(_train, repeat=self.args.repeat, number=10,) | ||
# as written in https://docs.python.org/2/library/timeit.html#timeit.Timer.repeat, min should be taken rather than the average | ||
runtimes = timeit.repeat(func, repeat=self.args.repeat, number=10,) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is number
. Is this why the benchmarking is slow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
number defines how many times the function should be executed and is then summed up repeat
say how often the sum of functions should be run. So it returns a list of len(...) = self.args.repeat
and each element in the list is the sum of number
running the function
@@ -81,6 +81,31 @@ | |||
_torch_tpu_available = False | |||
|
|||
|
|||
try: | |||
import psutil # noqa: F401 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a fairly innocuous dependency afaict, and you could add it to requirements.txt/setup.py
no_inference=True, | ||
sequence_lengths=[8], | ||
batch_sizes=[1], | ||
no_multi_process=True, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
multiprocess and pytest are not friends it seems?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nope :D PyTest also starts a new process in itself for each test so multiprocessing in multiprocessing breaks CUDA inits.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great. I really like that you can import the benchmark if you want to use them during runtime, rather than the only option being to run a script.
Some remarks after playing with it:
- Maybe you should raise an error when no
model_names
are specified. Right now it crashes withUnboundLocalError: local variable 'inference_summary' referenced before assignment
(pytorch version at least) - There seems to be an error in the way the runtimes are computed. PyTorch using GPU, is slower than TensorFlow on CPU (10x times slower), while PyTorch on CPU is 150x slower than TensorFlow on CPU.
Here are the results from my runs so far. The following is on CPU with TensorFlow (2ms per inference with bert-base-cased
, seq len 8 and batch size 512 on a CPU??) I didn't test the memory usage so they're not in the results:
==================== INFERENCE - SPEED - RESULT ====================
--------------------------------------------------------------------------------
Model Name Batch Size Seq Length Time in s
--------------------------------------------------------------------------------
bert-base-cased 8 8 0.001
bert-base-cased 8 32 0.001
bert-base-cased 8 128 0.001
bert-base-cased 8 512 0.002
--------------------------------------------------------------------------------
==================== ENVIRONMENT INFORMATION ====================
- transformers_version: 2.11.0
- framework: Tensorflow
- eager_mode: False
- use_xla: False
- framework_version: 2.2.0
- python_version: 3.6.10
- system: Linux
- cpu:
- architecture: 64bit
- date: 2020-06-18
- time: 11:57:18.595804
- fp16: False
- use_multiprocessing: True
- cpu_ram_mb: 64333
- use_gpu: False
- use_tpu: False
Here's the test with PyTorch on GPU:
==================== INFERENCE - SPEED - RESULT ====================
--------------------------------------------------------------------------------
Model Name Batch Size Seq Length Time in s
--------------------------------------------------------------------------------
bert-base-cased 8 8 0.007
bert-base-cased 8 32 0.007
bert-base-cased 8 128 0.019
bert-base-cased 8 512 0.074
--------------------------------------------------------------------------------
==================== ENVIRONMENT INFORMATION ====================
- transformers_version: 2.11.0
- framework: PyTorch
- use_torchscript: False
- framework_version: 1.5.0
- python_version: 3.6.10
- system: Linux
- cpu:
- architecture: 64bit
- date: 2020-06-18
- time: 11:56:31.041360
- fp16: False
- use_multiprocessing: True
- cpu_ram_mb: 64333
- use_gpu: True
- num_gpus: 1
- gpu: N/A
- gpu_ram_mb: N/A
- gpu_power_watts: N/A
- gpu_performance_state: N/A
- use_tpu: False
I'm not sure that PyTorch on GPU is ~37x slower than TensorFlow on CPU 😄 I tried to debug but it's not easy to debug tf functions unfortunately
# coding=utf-8 | ||
# Copyright 2018 The HuggingFace Inc. team. | ||
# Copyright (c) 2018, NVIDIA CORPORATION. All rights reserved. | ||
# | ||
# Licensed under the Apache License, Version 2.0 (the "License"); | ||
# you may not use this file except in compliance with the License. | ||
# You may obtain a copy of the License at | ||
# | ||
# http://www.apache.org/licenses/LICENSE-2.0 | ||
# | ||
# Unless required by applicable law or agreed to in writing, software | ||
# distributed under the License is distributed on an "AS IS" BASIS, | ||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
""" Benchmarking the library on inference and training in Tensorflow""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, nice addition to the existing run_benchmark.py
. You preferred to split the files into two because the arguments are too different?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, yeah I thought it's a nicer standard to just always have files one for tf one for pt. Also I didn't want to add a dataclass with a use_tf
argument.
Thanks a lot for checking everything! Found the error :-) One just has to return a tensor out of the tf.function context so that it is actually computed. I guess before compilation TF compilation optimizes the function so that variables that are not used outside of the @tf.function scope are not computed. Will update the notebooks and should then getter more reasonable results :-) |
And will definitely add a better error message |
The speed tests seem much more reasonable now, if you check the notebooks :-) @LysandreJik |
GPU locally gives reasonable results of TF vs. PT.All tests were run in this environment:
for TF 2.2 and Pytorch 1.4.0 PyTorch
|
PyTorch FP16
|
TF no eager modus
|
TF XLA
|
Memory measurementsThey also seem reasonable for forward pass:. TF no eager mode (keeping in mind that nvidia-smi is not accurate here and TF always allocates more than it needs):
PyTorch
PyTorch FP16
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great!! Can't wait for the future PRs, this is an exciting subject!
Benchmarks
This PR adds the functionality to measure the following functionalities for TF and PT:
Tensorflow:
PyTorch:
How is memory measured?
CPU
We are always interested in the peak memory usage of the process. For CPU, the library
psutil
in combination with multiprocessing is leveragedGPU
It is difficult to have exact memory measurement on GPU. Tensorflow allocates the full GPU memory by default. This is disabled with
tf.config.experimental.set_memory_growth=True
, but Tensorflow still allocates more memory than it needs for efficiency as far as I know.=> Memory is therefore always measured to give the same maximal result as shown by
nvidia-smi
. This means that also memory for loading PyTorch / Tensorflow is taken into account which is for example not done when measuring viatorch.cuda.max_allocated_memory
.Tensorflow also does not release GPU memory before the process is finished. Therefore, all measurement functions are wrapped into their own spawned process via Python's multiprocessing tools.
Also note that because TF does not release memory during the same process, memory and inference is measured using a multiprocess approach in TF. Also TF does not provide an official memory monitoring function, so that the same result that
nvidia-smi
would show for TF is used.TPU
Memory measurement is currently not supported
How is speed measured?
For all functionality that requires compilation (TPU, XLA, Torchscript), 5 warmup calls of the function are done beforehand.
Afterwards, the minimum of
self.args.repeat
x the time-averaged over 10 function calls.Example Colabs:
The colabs give quick examples for each functionality with little explanation for the moment:
Pytorch TPU: https://colab.research.google.com/drive/1GJFOdcBe1pW_FKWpA0jK_AOsIQ5epcvE?usp=sharing
Tensorflow TPU:
https://colab.research.google.com/drive/1t8DW1NxA4b1BsWSZ1ehFG9oT69l0h7os?usp=sharing
GPU: https://colab.research.google.com/drive/15XTPT_GPp42Zj7_f1W9X_T3NNXE9_1Te?usp=sharing
CPU: https://colab.research.google.com/drive/1OG2rZgo18KvliS-ratybld9pHD06-v5S?usp=sharing
Future PR:
labels
parameter as an input, adding measurement for training is left for a future PRmodel.half()
to measure fp16 in Pytorch. See issue here: GPU memory issues (leak?) NVIDIA/apex#439 . Wait until amp is supported in upstream torch 1.6