From a0417629efe838cb4dae6232a3d423969290d1ec Mon Sep 17 00:00:00 2001
From: Zijie Li <michael20001122@gmail.com>
Date: Mon, 4 Nov 2024 00:37:46 -0500
Subject: [PATCH 1/2] update benchmark readme

update new comment with memory usage included
---
 python/llm/dev/benchmark/README.md | 17 ++++++++++++++---
 1 file changed, 14 insertions(+), 3 deletions(-)

diff --git a/python/llm/dev/benchmark/README.md b/python/llm/dev/benchmark/README.md
index 7f16746edab..4d19e7d5be5 100644
--- a/python/llm/dev/benchmark/README.md
+++ b/python/llm/dev/benchmark/README.md
@@ -59,6 +59,12 @@ with torch.inference_mode():
         output_str = tokenizer.decode(output[0], skip_special_tokens=True)
 ```
 
+### Sample Output
+```bash
+=========First token cost xx.xxxxs and 3.595703125 GB=========
+=========Rest tokens cost average xx.xxxxs (31 tokens in all) and 3.595703125 GB=========
+```
+
 ### Inference on multi GPUs
 Similarly, put this file into your benchmark directory, and then wrap your optimized model with `BenchmarkWrapper` (`model = BenchmarkWrapper(model)`).
 For example, just need to apply following code patch on [Deepspeed Autotp example code](https://github.com/intel-analytics/ipex-llm/blob/main/python/llm/example/GPU/Deepspeed-AutoTP/deepspeed_autotp.py) to calculate 1st and the rest token performance:
@@ -81,8 +87,13 @@ For example, just need to apply following code patch on [Deepspeed Autotp exampl
 ```
 
 ### Sample Output
-Output will be like:
+You can also set `verbose = True`
+```python
+model = BenchmarkWrapper(model, do_print=True, verbose=True)
+```
+
 ```bash
-=========First token cost xx.xxxxs=========
-=========Last token cost average xx.xxxxs (31 tokens in all)=========
+=========First token cost xx.xxxxs and 3.595703125 GB=========
+=========Rest token cost average xx.xxxxs (31 tokens in all) and 3.595703125 GB=========
+Peak memory for every token: [3.595703125, 3.595703125, 3.595703125, 3.595703125, 3.595703125, 3.595703125, 3.595703125, 3.595703125, 3.595703125, 3.595703125, 3.595703125, 3.595703125, 3.595703125, 3.595703125, 3.595703125, 3.595703125, 3.595703125, 3.595703125, 3.595703125, 3.595703125, 3.595703125, 3.595703125, 3.595703125, 3.595703125, 3.595703125, 3.595703125, 3.595703125, 3.595703125, 3.595703125, 3.595703125, 3.595703125, 3.595703125]
 ```

From 1c4cfa8eaf24c4647c30d50e50f9256cf55d6129 Mon Sep 17 00:00:00 2001
From: Zijie Li <michael20001122@gmail.com>
Date: Mon, 4 Nov 2024 11:06:08 -0500
Subject: [PATCH 2/2] Update README.md

---
 python/llm/dev/benchmark/README.md | 23 +++++++++++------------
 1 file changed, 11 insertions(+), 12 deletions(-)

diff --git a/python/llm/dev/benchmark/README.md b/python/llm/dev/benchmark/README.md
index 4d19e7d5be5..160b4bf14d0 100644
--- a/python/llm/dev/benchmark/README.md
+++ b/python/llm/dev/benchmark/README.md
@@ -65,6 +65,17 @@ with torch.inference_mode():
 =========Rest tokens cost average xx.xxxxs (31 tokens in all) and 3.595703125 GB=========
 ```
 
+You can also set `verbose = True`
+```python
+model = BenchmarkWrapper(model, do_print=True, verbose=True)
+```
+
+```bash
+=========First token cost xx.xxxxs and 3.595703125 GB=========
+=========Rest token cost average xx.xxxxs (31 tokens in all) and 3.595703125 GB=========
+Peak memory for every token: [3.595703125, 3.595703125, 3.595703125, 3.595703125, 3.595703125, 3.595703125, 3.595703125, 3.595703125, 3.595703125, 3.595703125, 3.595703125, 3.595703125, 3.595703125, 3.595703125, 3.595703125, 3.595703125, 3.595703125, 3.595703125, 3.595703125, 3.595703125, 3.595703125, 3.595703125, 3.595703125, 3.595703125, 3.595703125, 3.595703125, 3.595703125, 3.595703125, 3.595703125, 3.595703125, 3.595703125, 3.595703125]
+```
+
 ### Inference on multi GPUs
 Similarly, put this file into your benchmark directory, and then wrap your optimized model with `BenchmarkWrapper` (`model = BenchmarkWrapper(model)`).
 For example, just need to apply following code patch on [Deepspeed Autotp example code](https://github.com/intel-analytics/ipex-llm/blob/main/python/llm/example/GPU/Deepspeed-AutoTP/deepspeed_autotp.py) to calculate 1st and the rest token performance:
@@ -85,15 +96,3 @@ For example, just need to apply following code patch on [Deepspeed Autotp exampl
      # Load tokenizer
      tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
 ```
-
-### Sample Output
-You can also set `verbose = True`
-```python
-model = BenchmarkWrapper(model, do_print=True, verbose=True)
-```
-
-```bash
-=========First token cost xx.xxxxs and 3.595703125 GB=========
-=========Rest token cost average xx.xxxxs (31 tokens in all) and 3.595703125 GB=========
-Peak memory for every token: [3.595703125, 3.595703125, 3.595703125, 3.595703125, 3.595703125, 3.595703125, 3.595703125, 3.595703125, 3.595703125, 3.595703125, 3.595703125, 3.595703125, 3.595703125, 3.595703125, 3.595703125, 3.595703125, 3.595703125, 3.595703125, 3.595703125, 3.595703125, 3.595703125, 3.595703125, 3.595703125, 3.595703125, 3.595703125, 3.595703125, 3.595703125, 3.595703125, 3.595703125, 3.595703125, 3.595703125, 3.595703125]
-```