Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add gpt-j-6B w/ deepspeed example #87

Closed
wants to merge 5 commits into from
Closed

Conversation

Oogy
Copy link
Contributor

@Oogy Oogy commented Aug 8, 2022

@Oogy Oogy requested a review from salanki August 8, 2022 13:40
@Oogy Oogy changed the title Add gpt-j-6B w/ deepspeed Add gpt-j-6B w/ deepspeed example Aug 8, 2022
@Oogy Oogy requested review from wbrown and rtalaricw September 8, 2022 16:07
@Oogy
Copy link
Contributor Author

Oogy commented Sep 8, 2022

Preliminary results:

./benchmark.py --url http://gpt-j-6b-deepspeed.tenant-sta-tweldon-workbench.knative.chi.coreweave.com
INFO:gpt-j-6b-benchmark-client: Connecting to http://gpt-j-6b-deepspeed.tenant-sta-tweldon-workbench.knative.chi.coreweave.com
INFO:gpt-j-6b-benchmark-client: Input Sequence Lenghth: 8
INFO:gpt-j-6b-benchmark-client: Request data: b'{"benchmark": true, "parameters": {"benchmark_sequence_length": 8, "max_length": 16}}'
INFO:gpt-j-6b-benchmark-client: Results: {"benchmark_results": {"input_sequence_length": 8, "generated_tokens": 8, "time": 0.17124605178833008}}
INFO:gpt-j-6b-benchmark-client: Input Sequence Lenghth: 8
INFO:gpt-j-6b-benchmark-client: Request data: b'{"benchmark": true, "parameters": {"benchmark_sequence_length": 8, "max_length": 1024}}'
INFO:gpt-j-6b-benchmark-client: Results: {"benchmark_results": {"input_sequence_length": 8, "generated_tokens": 1016, "time": 20.1104953289032}}
INFO:gpt-j-6b-benchmark-client: Input Sequence Lenghth: 8
INFO:gpt-j-6b-benchmark-client: Request data: b'{"benchmark": true, "parameters": {"benchmark_sequence_length": 8, "max_length": 2048}}'

Getting an error I've not yet encountered at the last request in above, which does not finish:

    HTTPServerRequest(protocol='http', host='gpt-j-6b-deepspeed-predictor-default.tenant-sta-tweldon-workbench.svc.tenant.chi.local', method='POST', uri='/v1/models/eleutherai-gpt-j-6b:predict', version='HTTP/1.1', remote_ip='127.0.0.1')
    Traceback (most recent call last):
      File "/usr/local/lib/python3.8/dist-packages/tornado/web.py", line 1713, in _execute
        result = await result
      File "/usr/local/lib/python3.8/dist-packages/kserve/handlers/predict.py", line 70, in post
        response = await model(body)
      File "/usr/local/lib/python3.8/dist-packages/kserve/model.py", line 87, in __call__
        else self.predict(request)
      File "/app/service.py", line 224, in predict
        return self.benchmark(request_params)
      File "/app/service.py", line 255, in benchmark
        predicitions = self._predict(request_params)
      File "/app/service.py", line 208, in _predict
        "predictions": self.generator(
      File "/usr/local/lib/python3.8/dist-packages/transformers/pipelines/text_generation.py", line 176, in __call__
        return super().__call__(text_inputs, **kwargs)
      File "/usr/local/lib/python3.8/dist-packages/transformers/pipelines/base.py", line 1067, in __call__
        return self.run_single(inputs, preprocess_params, forward_params, postprocess_params)
      File "/usr/local/lib/python3.8/dist-packages/transformers/pipelines/base.py", line 1075, in run_single
        outputs = self.postprocess(model_outputs, **postprocess_params)
      File "/usr/local/lib/python3.8/dist-packages/transformers/pipelines/text_generation.py", line 237, in postprocess
        text = self.tokenizer.decode(
      File "/usr/local/lib/python3.8/dist-packages/transformers/tokenization_utils_base.py", line 3367, in decode
        return self._decode(
      File "/usr/local/lib/python3.8/dist-packages/transformers/tokenization_utils_fast.py", line 548, in _decode
        text = self._tokenizer.decode(token_ids, skip_special_tokens=skip_special_tokens)
    OverflowError: out of range integral type conversion attempted

request_params["BENCHMARK_SEQUENCE_LENGTH"] < request_params["MAX_LENGTH"]
)

sequence_start = random.randrange(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can iterate over a these ranges:

self.input_sizes = [8, 512, 1536]
self.context_sizes = [16, 1024, 2048]

You would have to run it in a loop where you get the 25,50,75 percentile for 50 experiments with a cut off of 10 to start measurement.

Comment on lines +247 to +249
sequence_end = sequence_start + request_params["BENCHMARK_SEQUENCE_LENGTH"]
random_sequence_encoded = self.dataset[sequence_start:sequence_end]
random_sequence = self.tokenizer.decode(random_sequence_encoded)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain how you are randomizing the sequences here? Does randrange require two args: https://www.w3schools.com/python/ref_random_randrange.asp

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

randrange() can take 1-3 args. 1 arg just returns a random value between 0 and your arg N.

We pick a random start that is at most within BENCHMARK_SEQUENCE_LENGTH of the end of the self.dataset, so that when we add BENCHMARK_SEQUENCE_LENGTH to the start, we don't go out of range.

Then we grab the slice of self.dataset from start to end indexes.

The 3 arg form of randrange() could also be used, but being limited to incrementing by the step arg would limit the number of random sequences we could get.

Comment on lines +262 to +268
return {
"benchmark_results": {
"input_sequence_length": request_params["BENCHMARK_SEQUENCE_LENGTH"],
"generated_tokens": request_params["MAX_LENGTH"]
- request_params["BENCHMARK_SEQUENCE_LENGTH"],
"time": generation_time,
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think iterating over the sequences will change this. You can write to a JSON (append) and read at the end to get the quantiles for the different number of experiments you run.

@Oogy
Copy link
Contributor Author

Oogy commented Sep 19, 2022

Closing pending Deepspeed updates/improvements:

  1. Can't use do_sample=True beyond 129 token contexts resulting in terrible output, do_sample=False can handle larger contexts but output is useless.
  2. With do_sample=False min/max length generation parameters are not respected beyond 129 tokens, and eventually fails to produce output longer than ~570 tokens.
  3. Can't batch. Can't seem to find any issues specific to GPTJ batching, but batching issues with other models are reported. Additionally, given the context limitations above the speed up for single prompts now seems less worth it.
  4. Can't use low_cpu_mem_usage=True to speed up loading model, output is bad. Not required but it brings model loading time down to a little over a minute from a little over 2.5 minutes.

Related GH Issues:
microsoft/DeepSpeed#2300
microsoft/DeepSpeed#2062
microsoft/DeepSpeed#2230
microsoft/DeepSpeed#2251
microsoft/DeepSpeed#2233

I'm not sure where to go from here and don't want to spend anymore time on this without getting some more feedback.

@Oogy Oogy closed this Sep 19, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants