Directly send tensor via jit serialization #3088

ZiyueXu77 · 2024-12-02T21:57:39Z

Fixes # .

Description

Directly send tensor without converting to numpy
Using jit serialization to avoid pickle

Types of changes

Non-breaking change (fix or new feature that would not break existing functionality).
Breaking change (fix or new feature that would cause existing functionality to change).
New tests added to cover the changes.
Quick tests passed locally by running ./runtest.sh.
In-line docstrings updated.
Documentation updated.

Added support for BF16 in TensorDecomposer

examples/advanced/llm_hf/README.md

examples/advanced/llm_hf/sft_job.py

examples/advanced/llm_hf/src/hf_sft_peft_fl.py

nvflare/app_opt/pt/decomposers.py

chesterxgchen

the logics seems only for LLM BF16. I think what want to achieve is for all Tensor, regardless or not.

ZiyueXu77 · 2024-12-04T21:13:07Z

the logics seems only for LLM BF16. I think what want to achieve is for all Tensor, regardless or not.

I think we mixed two processes: conversion for filtering, and conversion for communication

Considering local to server communication (reverse will be similar) with quantization:
local model --> to_nvflare_converter --> quant_filter --> (decomposer) --> communication --> (composer) --> dequant_filter --> global

Currently our client api executor has a default to_nvflare_converter as PTtoNumpy, so afterwards everything will be in numpy, including the serialization part, so the tensor decomposer/composer will not be called.

Now if instead of PTtoNumpy, we use a simple "pass through" to_nvflare_converter, it gonna have two indications:

filter needs to handle tensor properly
decomposer and composer will be needed to handle tensor communication, and currently this is again via numpy.

Hence we can have two places with tensor<->numpy conversion: converter for filter, and decomposer for communication. The first will mean all the following computations (filter) are in numpy, while the second means only the communication/serialization is via numpy - but it will be recovered to tensor once received, so "virtually" the whole pipeline is still in tensor.

For the sake of serialization efficency, my guess is that numpy maybe more efficient than jit (@nvidianz to confirm), then jit is only needed for formats not supported by numpy (e.g. bf16), but if otherwise, we can use jit for all cases (And maybe "safe tensor" as suggested).

chesterxgchen · 2024-12-04T21:22:25Z

the logics seems only for LLM BF16. I think what want to achieve is for all Tensor, regardless or not.

I think we mixed two processes: conversion for filtering, and conversion for communication

Considering local to server communication (reverse will be similar) with quantization: local model --> to_nvflare_converter --> quant_filter --> (decomposer) --> communication --> (composer) --> dequant_filter --> global

Currently our client api executor has a default to_nvflare_converter as PTtoNumpy, so afterwards everything will be in numpy, including the serialization part, so the tensor decomposer/composer will not be called.

Now if instead of PTtoNumpy, we use a simple "pass through" to_nvflare_converter, it gonna have two indications:

filter needs to handle tensor properly

decomposer and composer will be needed to handle tensor communication, and currently this is again via numpy.

Hence we can have two places with tensor<->numpy conversion: converter for filter, and decomposer for communication. The first will mean all the following computations (filter) are in numpy, while the second means only the communication/serialization is via numpy - but it will be recovered to tensor once received, so "virtually" the whole pipeline is still in tensor.

For the sake of serialization efficency, my guess is that numpy maybe more efficient than jit (@nvidianz to confirm), then jit is only needed for formats not supported by numpy (e.g. bf16), but if otherwise, we can use jit for all cases (And maybe "safe tensor" as suggested).

The reason for avoid to_numpy() conversion is to avoid loss the Tensor Compression ratio to make sure the Tensor Model doesn't increase after transfer. It doesn't matter this conversion is in filter or other places, if we convert tensor in jit in place, and use to_n umpy in another place before sending over the wire, we already loss the compression, it JIT conversion is becomes pointless.

We need Tensor native in all communication pipeline

ZiyueXu77 · 2024-12-04T21:38:05Z

the logics seems only for LLM BF16. I think what want to achieve is for all Tensor, regardless or not.

I think we mixed two processes: conversion for filtering, and conversion for communication
Considering local to server communication (reverse will be similar) with quantization: local model --> to_nvflare_converter --> quant_filter --> (decomposer) --> communication --> (composer) --> dequant_filter --> global
Currently our client api executor has a default to_nvflare_converter as PTtoNumpy, so afterwards everything will be in numpy, including the serialization part, so the tensor decomposer/composer will not be called.
Now if instead of PTtoNumpy, we use a simple "pass through" to_nvflare_converter, it gonna have two indications:

filter needs to handle tensor properly

decomposer and composer will be needed to handle tensor communication, and currently this is again via numpy.

Hence we can have two places with tensor<->numpy conversion: converter for filter, and decomposer for communication. The first will mean all the following computations (filter) are in numpy, while the second means only the communication/serialization is via numpy - but it will be recovered to tensor once received, so "virtually" the whole pipeline is still in tensor.
For the sake of serialization efficency, my guess is that numpy maybe more efficient than jit (@nvidianz to confirm), then jit is only needed for formats not supported by numpy (e.g. bf16), but if otherwise, we can use jit for all cases (And maybe "safe tensor" as suggested).

The reason for avoid to_numpy() conversion is to avoid loss the Tensor Compression ratio to make sure the Tensor Model doesn't increase after transfer. It doesn't matter this conversion is in filter or other places, if we convert tensor in jit in place, and use to_n umpy in another place before sending over the wire, we already loss the compression, it JIT conversion is becomes pointless.

We need Tensor native in all communication pipeline

no this is not the case, use numpy + jit for serialization will not lead to bigger message, only the conversion for filter purpose will - because for that we want everything in numpy and so have to cast bf16 to float32 so that it can be convered

ZiyueXu77 · 2024-12-11T21:41:14Z

/build

ZiyueXu77 · 2024-12-13T15:58:35Z

/build

examples/advanced/llm_hf/sft_job.py

nvflare/app_common/executors/in_process_client_api_executor.py

nvflare/app_opt/quantization/dequantizor.py

nvflare/app_opt/quantization/quantizor.py

chesterxgchen

need to change package path

ZiyueXu77 · 2024-12-13T20:46:57Z

/build

chesterxgchen

LGTM

nvidianz and others added 2 commits November 26, 2024 21:10

Added support for bfloat16 tensor using JIT

048da33

directly send tensor via jit serialization

0dcefd3

ZiyueXu77 marked this pull request as draft December 2, 2024 21:57

ZiyueXu77 and others added 12 commits December 3, 2024 10:15

polish sft_job

a2e6849

polish sft_job

247e7c7

polish local training script

6acec33

polish tensor params converter

15240d8

polish decomposer

531304b

format correction

059c64f

header update

f694a14

Merge branch 'tensor_comm' into bf16_decomposer

5bd20f3

Merge pull request #8 from nvidianz/bf16_decomposer

92f4225

Added support for BF16 in TensorDecomposer

Merge branch 'NVIDIA:main' into tensor_comm

999375d

update decomposer

5c567c7

end to end tensor communication passed

3cf01c4

ZiyueXu77 marked this pull request as ready for review December 3, 2024 22:16

ZiyueXu77 requested review from nvidianz, chesterxgchen and holgerroth December 3, 2024 22:23

chesterxgchen reviewed Dec 4, 2024

View reviewed changes

examples/advanced/llm_hf/README.md Show resolved Hide resolved

chesterxgchen reviewed Dec 4, 2024

View reviewed changes

examples/advanced/llm_hf/sft_job.py Show resolved Hide resolved

chesterxgchen reviewed Dec 4, 2024

View reviewed changes

examples/advanced/llm_hf/src/hf_sft_peft_fl.py Show resolved Hide resolved

chesterxgchen reviewed Dec 4, 2024

View reviewed changes

nvflare/app_opt/pt/decomposers.py Show resolved Hide resolved

chesterxgchen requested changes Dec 4, 2024

View reviewed changes

Merge branch 'NVIDIA:main' into tensor_comm

6cc4c8c

nvidianz previously approved these changes Dec 11, 2024

View reviewed changes

ZiyueXu77 enabled auto-merge (squash) December 11, 2024 21:42

update quantization filters to handle tensor

dca4fd8

ZiyueXu77 dismissed nvidianz’s stale review via dca4fd8 December 12, 2024 23:33

nvidianz previously approved these changes Dec 13, 2024

View reviewed changes

bug fixes and unittest updates

395f2de

ZiyueXu77 dismissed nvidianz’s stale review via 395f2de December 13, 2024 15:14

ZiyueXu77 and others added 2 commits December 13, 2024 10:17

Merge branch 'main' into tensor_comm

25fd3ca

unit test cannot run on gpu, update case

e59210f

ZiyueXu77 requested a review from chesterxgchen December 13, 2024 16:38

chesterxgchen reviewed Dec 13, 2024

View reviewed changes

examples/advanced/llm_hf/sft_job.py Show resolved Hide resolved

chesterxgchen reviewed Dec 13, 2024

View reviewed changes

nvflare/app_common/executors/in_process_client_api_executor.py Show resolved Hide resolved

chesterxgchen reviewed Dec 13, 2024

View reviewed changes

nvflare/app_opt/quantization/dequantizor.py Outdated Show resolved Hide resolved

chesterxgchen reviewed Dec 13, 2024

View reviewed changes

nvflare/app_opt/quantization/quantizor.py Outdated Show resolved Hide resolved

chesterxgchen requested changes Dec 13, 2024

View reviewed changes

ZiyueXu77 added 2 commits December 13, 2024 15:30

bug fixes and polish

54ea4fb

format update

8376894

ZiyueXu77 requested a review from chesterxgchen December 13, 2024 21:28

chesterxgchen approved these changes Dec 13, 2024

View reviewed changes

ZiyueXu77 merged commit 38157c3 into NVIDIA:main Dec 13, 2024
20 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Directly send tensor via jit serialization #3088

Directly send tensor via jit serialization #3088

ZiyueXu77 commented Dec 2, 2024

chesterxgchen left a comment

ZiyueXu77 commented Dec 4, 2024

chesterxgchen commented Dec 4, 2024

ZiyueXu77 commented Dec 4, 2024 •

edited

Loading

ZiyueXu77 commented Dec 11, 2024

ZiyueXu77 commented Dec 13, 2024

chesterxgchen left a comment

ZiyueXu77 commented Dec 13, 2024

chesterxgchen left a comment

Directly send tensor via jit serialization #3088

Directly send tensor via jit serialization #3088

Conversation

ZiyueXu77 commented Dec 2, 2024

Description

Types of changes

chesterxgchen left a comment

Choose a reason for hiding this comment

ZiyueXu77 commented Dec 4, 2024

chesterxgchen commented Dec 4, 2024

ZiyueXu77 commented Dec 4, 2024 • edited Loading

ZiyueXu77 commented Dec 11, 2024

ZiyueXu77 commented Dec 13, 2024

chesterxgchen left a comment

Choose a reason for hiding this comment

ZiyueXu77 commented Dec 13, 2024

chesterxgchen left a comment

Choose a reason for hiding this comment

ZiyueXu77 commented Dec 4, 2024 •

edited

Loading