Add gpt-j-6B w/ deepspeed example #87

Oogy · 2022-08-08T13:40:54Z

Fixes https://github.com/coreweave/infra-pm/issues/249

Oogy · 2022-09-08T16:33:11Z

Preliminary results:

./benchmark.py --url http://gpt-j-6b-deepspeed.tenant-sta-tweldon-workbench.knative.chi.coreweave.com
INFO:gpt-j-6b-benchmark-client: Connecting to http://gpt-j-6b-deepspeed.tenant-sta-tweldon-workbench.knative.chi.coreweave.com
INFO:gpt-j-6b-benchmark-client: Input Sequence Lenghth: 8
INFO:gpt-j-6b-benchmark-client: Request data: b'{"benchmark": true, "parameters": {"benchmark_sequence_length": 8, "max_length": 16}}'
INFO:gpt-j-6b-benchmark-client: Results: {"benchmark_results": {"input_sequence_length": 8, "generated_tokens": 8, "time": 0.17124605178833008}}
INFO:gpt-j-6b-benchmark-client: Input Sequence Lenghth: 8
INFO:gpt-j-6b-benchmark-client: Request data: b'{"benchmark": true, "parameters": {"benchmark_sequence_length": 8, "max_length": 1024}}'
INFO:gpt-j-6b-benchmark-client: Results: {"benchmark_results": {"input_sequence_length": 8, "generated_tokens": 1016, "time": 20.1104953289032}}
INFO:gpt-j-6b-benchmark-client: Input Sequence Lenghth: 8
INFO:gpt-j-6b-benchmark-client: Request data: b'{"benchmark": true, "parameters": {"benchmark_sequence_length": 8, "max_length": 2048}}'

Getting an error I've not yet encountered at the last request in above, which does not finish:

    HTTPServerRequest(protocol='http', host='gpt-j-6b-deepspeed-predictor-default.tenant-sta-tweldon-workbench.svc.tenant.chi.local', method='POST', uri='/v1/models/eleutherai-gpt-j-6b:predict', version='HTTP/1.1', remote_ip='127.0.0.1')
    Traceback (most recent call last):
      File "/usr/local/lib/python3.8/dist-packages/tornado/web.py", line 1713, in _execute
        result = await result
      File "/usr/local/lib/python3.8/dist-packages/kserve/handlers/predict.py", line 70, in post
        response = await model(body)
      File "/usr/local/lib/python3.8/dist-packages/kserve/model.py", line 87, in __call__
        else self.predict(request)
      File "/app/service.py", line 224, in predict
        return self.benchmark(request_params)
      File "/app/service.py", line 255, in benchmark
        predicitions = self._predict(request_params)
      File "/app/service.py", line 208, in _predict
        "predictions": self.generator(
      File "/usr/local/lib/python3.8/dist-packages/transformers/pipelines/text_generation.py", line 176, in __call__
        return super().__call__(text_inputs, **kwargs)
      File "/usr/local/lib/python3.8/dist-packages/transformers/pipelines/base.py", line 1067, in __call__
        return self.run_single(inputs, preprocess_params, forward_params, postprocess_params)
      File "/usr/local/lib/python3.8/dist-packages/transformers/pipelines/base.py", line 1075, in run_single
        outputs = self.postprocess(model_outputs, **postprocess_params)
      File "/usr/local/lib/python3.8/dist-packages/transformers/pipelines/text_generation.py", line 237, in postprocess
        text = self.tokenizer.decode(
      File "/usr/local/lib/python3.8/dist-packages/transformers/tokenization_utils_base.py", line 3367, in decode
        return self._decode(
      File "/usr/local/lib/python3.8/dist-packages/transformers/tokenization_utils_fast.py", line 548, in _decode
        text = self._tokenizer.decode(token_ids, skip_special_tokens=skip_special_tokens)
    OverflowError: out of range integral type conversion attempted

rtalaricw · 2022-09-08T19:54:10Z

online-inference/gpt-j-deepspeed/service/service.py

+            request_params["BENCHMARK_SEQUENCE_LENGTH"] < request_params["MAX_LENGTH"]
+        )
+
+        sequence_start = random.randrange(


You can iterate over a these ranges:

self.input_sizes = [8, 512, 1536] self.context_sizes = [16, 1024, 2048]

You would have to run it in a loop where you get the 25,50,75 percentile for 50 experiments with a cut off of 10 to start measurement.

rtalaricw · 2022-09-08T19:55:39Z

online-inference/gpt-j-deepspeed/service/service.py

+        sequence_end = sequence_start + request_params["BENCHMARK_SEQUENCE_LENGTH"]
+        random_sequence_encoded = self.dataset[sequence_start:sequence_end]
+        random_sequence = self.tokenizer.decode(random_sequence_encoded)


Can you explain how you are randomizing the sequences here? Does randrange require two args: https://www.w3schools.com/python/ref_random_randrange.asp

randrange() can take 1-3 args. 1 arg just returns a random value between 0 and your arg N.

We pick a random start that is at most within BENCHMARK_SEQUENCE_LENGTH of the end of the self.dataset, so that when we add BENCHMARK_SEQUENCE_LENGTH to the start, we don't go out of range.

Then we grab the slice of self.dataset from start to end indexes.

The 3 arg form of randrange() could also be used, but being limited to incrementing by the step arg would limit the number of random sequences we could get.

rtalaricw · 2022-09-08T19:57:31Z

online-inference/gpt-j-deepspeed/service/service.py

+        return {
+            "benchmark_results": {
+                "input_sequence_length": request_params["BENCHMARK_SEQUENCE_LENGTH"],
+                "generated_tokens": request_params["MAX_LENGTH"]
+                - request_params["BENCHMARK_SEQUENCE_LENGTH"],
+                "time": generation_time,
+            }


I think iterating over the sequences will change this. You can write to a JSON (append) and read at the end to get the quantiles for the different number of experiments you run.

Oogy · 2022-09-19T14:42:24Z

Closing pending Deepspeed updates/improvements:

Can't use do_sample=True beyond 129 token contexts resulting in terrible output, do_sample=False can handle larger contexts but output is useless.
With do_sample=False min/max length generation parameters are not respected beyond 129 tokens, and eventually fails to produce output longer than ~570 tokens.
Can't batch. Can't seem to find any issues specific to GPTJ batching, but batching issues with other models are reported. Additionally, given the context limitations above the speed up for single prompts now seems less worth it.
Can't use low_cpu_mem_usage=True to speed up loading model, output is bad. Not required but it brings model loading time down to a little over a minute from a little over 2.5 minutes.

Related GH Issues:
microsoft/DeepSpeed#2300
microsoft/DeepSpeed#2062
microsoft/DeepSpeed#2230
microsoft/DeepSpeed#2251
microsoft/DeepSpeed#2233

I'm not sure where to go from here and don't want to spend anymore time on this without getting some more feedback.

add gpt-j-6B w/ deepspeed

4ea0cf7

Oogy requested a review from salanki August 8, 2022 13:40

Oogy changed the title ~~Add gpt-j-6B w/ deepspeed~~ Add gpt-j-6B w/ deepspeed example Aug 8, 2022

Oogy added 4 commits August 8, 2022 13:15

add accelerate

854dd0d

rm accelerate

240130e

restucture; formatting; optimize Dockerfile;

4b616b8

add benchmark logic

5f2053b

Oogy requested review from wbrown and rtalaricw September 8, 2022 16:07

rtalaricw reviewed Sep 8, 2022

View reviewed changes

Oogy closed this Sep 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add gpt-j-6B w/ deepspeed example #87

Add gpt-j-6B w/ deepspeed example #87

Oogy commented Aug 8, 2022 •

edited by salanki

Loading

Oogy commented Sep 8, 2022

rtalaricw Sep 8, 2022

rtalaricw Sep 8, 2022

Oogy Sep 8, 2022

rtalaricw Sep 8, 2022

Oogy commented Sep 19, 2022

Add gpt-j-6B w/ deepspeed example #87

Add gpt-j-6B w/ deepspeed example #87

Conversation

Oogy commented Aug 8, 2022 • edited by salanki Loading

Oogy commented Sep 8, 2022

rtalaricw Sep 8, 2022

Choose a reason for hiding this comment

rtalaricw Sep 8, 2022

Choose a reason for hiding this comment

Oogy Sep 8, 2022

Choose a reason for hiding this comment

rtalaricw Sep 8, 2022

Choose a reason for hiding this comment

Oogy commented Sep 19, 2022

Oogy commented Aug 8, 2022 •

edited by salanki

Loading