Investigate storing results from ggml operations in F16 format #959

ggerganov · 2023-04-14T07:35:34Z

Currently, all ggml operations return the results in F32 format.

The goal of this task is to see if there is an elegant way to add support for keeping the results in F16 format.
This will ideally be passed as a parameter to the ggml_context and will also involve adding support for F16 operands in most of the existing operators. Ideally, we want to achieve this somehow without duplicating the entire code base.

Note that internal floating-point accumulators in the different operations can and should remain in F32 format.
It is just when we store the results into the dst tensor, we will cast them to F16.

Going to F16 intermediate results would reduce significantly the memory pressure and could lead to significant speed improvements. Hopefully, the loss in quality would be marginal. But in any case, there will always be the option of switching back to full F32 precision.

I am looking for suggestions and initial prototypes of how we can achieve this in an elegant way.

Edit: An initial quick and dirty implementation that simply goes over the existing LLaMA related operators and changes the return type to F16 would be useful to determine if such functionality is useful and how much performance gain we can expect. If it is worth, then we can think in more details about how exactly to support it.

The text was updated successfully, but these errors were encountered:

ggerganov · 2023-04-22T08:48:31Z

Very basic tests by changing certain formats and F32 - > F16 casts at hot spots indicate that this might not be a viable approach for improving the performance. Will close this for now, as I no longer think that this can lead to improvements, but in case anyone has other observations - feel free to reopen

ggerganov added performance Speed related topics high priority Very important issue research 🔬 help wanted Extra attention is needed labels Apr 14, 2023

ggerganov added this to ggml : improve integer quantization Apr 14, 2023

ggerganov moved this to Todo in ggml : improve integer quantization Apr 14, 2023

This was referenced Apr 17, 2023

New Q4_0 implementation using 2x F16 instead of 1x F32 #1026

Closed

ggml : Q4_2 ARM #1046

Merged

ggerganov closed this as completed Apr 22, 2023

github-project-automation bot moved this from Todo to Done in ggml : improve integer quantization Apr 22, 2023

ggerganov self-assigned this Apr 22, 2023

sw mentioned this issue May 17, 2023

Feature - Internal ggml precision GGML_TYPE_F16 support #1492

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate storing results from ggml operations in F16 format #959

Investigate storing results from ggml operations in F16 format #959

ggerganov commented Apr 14, 2023 •

edited

Loading

ggerganov commented Apr 22, 2023

Investigate storing results from ggml operations in F16 format #959

Investigate storing results from ggml operations in F16 format #959

Comments

ggerganov commented Apr 14, 2023 • edited Loading

ggerganov commented Apr 22, 2023

ggerganov commented Apr 14, 2023 •

edited

Loading