Fix the tensor-slicing copy for qkv parameters #2198

RezaYazdaniAminabadi · 2022-08-09T02:12:11Z

This addresses #2184 and #2113

@RezaYazdaniAminabadi

* Fix the layer-past for GPT based models (microsoft#2196) * Add gradient_average flag support for sparse grads (microsoft#2188) * Add gradient_average flag support for sparse grads * formatting fixes * Add tests Co-authored-by: Olatunji Ruwase <[email protected]> * Adding additional instructiosn in the compression tutorial on pre-training distillation and quantization for GPT (microsoft#2197) Co-authored-by: Olatunji Ruwase <[email protected]> Co-authored-by: Jeff Rasley <[email protected]> * Log user config exactly (microsoft#2201) * Fix the tensor-slicing copy for qkv parameters (microsoft#2198) Co-authored-by: Olatunji Ruwase <[email protected]> * Refactor Distributed Tests (microsoft#2180) Refactor Distributed unit tests * fix table syntax (microsoft#2204) Co-authored-by: Conglong Li <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]> * Correctly detect offload configuration (microsoft#2208) Co-authored-by: Jeff Rasley <[email protected]> * add cuda 11.7 (microsoft#2211) * add cuda 11.7 * formatting * use torch 1.9 (microsoft#2215) * [zero-3] print warning once and support torch parameter (microsoft#2127) * print warning only once. * add support for torch param and only warn on gpu 0 * remove type checking. will be done on a new PR with more tests. Co-authored-by: Jeff Rasley <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]> * Add support of OPT models (microsoft#2205) * add opt replace policy * simplify inf. api * fix opt replace policy * fix use-cash & add relu * Add support of custom MLP act. function * Revert "simplify inf. api" This reverts commit 9e910fc. * fix the inference API (temp. solution) * fix code formatting * add unit tests for OPT models. * refactor pre-attention layer norm configuration * add support of opt-350m model * refactor the HF model config initialization * fix hf model config issue Co-authored-by: Reza Yazdani <[email protected]> Co-authored-by: Jeff Rasley <[email protected]> Co-authored-by: Reza Yazdani <[email protected]> * fix typos in readme. (microsoft#2218) Co-authored-by: Olatunji Ruwase <[email protected]> * [device abstraction] add device abstraction to allow other device than CUDA be used * Fix regression w. dist_init_required (microsoft#2225) * add doc for new bert example (microsoft#2224) * Remove the random-generator from context during inference (microsoft#2228) * Fix the tensor-slicing copy for qkv parameters * remove the random-generator from context during inference * formatting Co-authored-by: Jeff Rasley <[email protected]> * allow saving ckpt w/o ckpt json + bloom copy fix (microsoft#2237) * Correctly detect zero_offload (microsoft#2213) * Correctly detect offload configuration * Correctly detect offload configuration * Handle deprecated cpu offload setting * Correcly detect zero_offload setting * Minor tweak Co-authored-by: Jeff Rasley <[email protected]> Co-authored-by: Ammar Ahmad Awan <[email protected]> * update videos (microsoft#2249) * Refactor dist tests: Checkpointing (microsoft#2202) Refactor distributed tests: checkpointing Co-authored-by: Michael Wyatt <[email protected]> * Make OPT policy backward compatible with pre-OPT transformers versions (microsoft#2254) * fix ds-inference without policy (microsoft#2247) Co-authored-by: Jeff Rasley <[email protected]> * bump to 0.7.2 * Enable contiguous gradients with Z1+MoE (microsoft#2250) MoE training with zero stage 1 only works with `contiguous gradients=True`. * [rebase-202208] additional changes needed when rebase to 202208 * [rebase] cleanup direct cuda usage after merge * Correctly detect CPU optimizer usage (microsoft#2257) * Correctly detect CPU optimizer usage * Update nv-transformers-v100.yml (microsoft#2259) Co-authored-by: Jeff Rasley <[email protected]> * [precommit] fix pre-commit issues * Update half precision header guards (microsoft#2261) * fix microsoft#2240: wrong time unit in flops_profiler (microsoft#2241) Co-authored-by: Olatunji Ruwase <[email protected]> Co-authored-by: Jeff Rasley <[email protected]> * bump to 0.7.3 * Add blob storage to CI runners (microsoft#2260) Add blob storage to CI runners and enable for transformers cache on inference tests * Update replace_module.py, test-gptj.py related fix (microsoft#2269) Fix RuntimeError: Boolean value of Tensor with more than one value is ambiguous when running test-gptj.py * Fix OrderedDict import for python3.6 (microsoft#2267) Co-authored-by: Olatunji Ruwase <[email protected]> * Ds inference/fix mp2 (microsoft#2270) * Trajepl: nebula load fix (microsoft#2182) Co-authored-by: Olatunji Ruwase <[email protected]> Co-authored-by: chenguo <[email protected]> * prevent torch ext folder mkdir at tmp (microsoft#2274) * Ds-inference Int8 support through ZeroQuant technology (microsoft#2217) Co-authored-by: Jeff Rasley <[email protected]> * add a new unit test for cuda ops (microsoft#2278) Co-authored-by: cmikeh2 <[email protected]> * Add to codeowners file (microsoft#2279) * [pin_memory] make pin_memory select device type * Memory Access Utility (microsoft#2276) Co-authored-by: Ammar Ahmad Awan <[email protected]> * Fp32 accuracy bug fix (microsoft#2285) Co-authored-by: Arash Bakhtiari <[email protected]> Co-authored-by: Arash Bakhtiari <[email protected]> * Refactor universal checkpointing and tensor fragments (microsoft#2253) * Refactor universal checkpointing and tensor fragments * Formatting * [ds-inference] fix progress bar (microsoft#2286) when loading the non-sharded checkpoint update the progress bar (fix by @RezaYazdaniAminabadi) - I've just tested it to work. Co-authored-by: Olatunji Ruwase <[email protected]> * Offload all gradients to nvme (microsoft#2282) * fused bias relu unittest (microsoft#2297) * fix for pytest picking up local deepspeed dir instead of installed deepspeed (microsoft#2299) * Fix for Zero3 when MP>1 and at least one batch param undefined (microsoft#2289) Co-authored-by: anthony.301 <[email protected]> Co-authored-by: Jeff Rasley <[email protected]> * [downstream] merge from xpu support downstream * Unit test for bias add kernel (microsoft#2298) * added unit test * Update pt_binding.cpp * formatting * Update test_bias_add.py * Update relu.cu with mem_access_utils (microsoft#2306) * Add tensor parallel inference unit tests (microsoft#2232) Co-authored-by: Reza Yazdani <[email protected]> Co-authored-by: Jeff Rasley <[email protected]> Co-authored-by: Sam Ade Jacobs <[email protected]> * Fix the residual add mp scaling for GPTNeoX (microsoft#2310) * Add unit tests for residual_add kernels (microsoft#2307) * add inference eval scripts (microsoft#2303) * Upgrade P40 tests to torch 1.8 (microsoft#2316) Co-authored-by: Jeff Rasley <[email protected]> * ZeRO-Inference blog (microsoft#2271) * ZeRO-Inference blog * ZeRO-Inference blog * Format fixes * Apply feedback * Feedback * Update docs/_posts/2022-08-27-zero-inference.md Co-authored-by: Stas Bekman <[email protected]> * Update docs/_posts/2022-08-27-zero-inference.md Co-authored-by: Stas Bekman <[email protected]> * Address feedback * Format fixes * More tweaks * long sequence, nvme offload * Add image Co-authored-by: Stas Bekman <[email protected]> Co-authored-by: Jeff Rasley <[email protected]> * ZeRO-Inference blog - wrap up (microsoft#2321) * ZeRO-Inference blog - Update README (microsoft#2322) * refactor to use mem_access (microsoft#2317) * add quant unit test (microsoft#2315) * add quant unit test * add codeowner * format fix * fix undefined symbol: curandSetPseudoRandomGeneratorSeed * modify ref fn name and add comment * add comments * add 4bit quant 16groups * fix * modify groups in ref code * parameterize tensor shape * single param * detach tensor * remove -lcurand flag * add back -lcurand flag Co-authored-by: Ammar Ahmad Awan <[email protected]> * only override forward if using cuda-graph (microsoft#2291) * Add more options to inference benchmark (microsoft#2325) * bump to 0.7.4 * MOE residual matmult unit test (microsoft#2323) MOE residual matmul unit tests Co-authored-by: Michael Wyatt <[email protected]> Co-authored-by: Ammar Ahmad Awan <[email protected]> * [device] port cuda device to literal_device() in new tests * MOE matmult with memaccess (microsoft#2336) * Fix formatting * Remove redundant variable * Refactor residual add kernels (microsoft#2333) Co-authored-by: Ammar Ahmad Awan <[email protected]> * [accel_runtime] add pin_memory to accelerator runtime interface. * mem access for quantize kernel (microsoft#2331) * mem access for quantize kernel * format * format fp32 * modify quant kernel * modify quant kernel2 * modify format * format * fix comments in pytest * fix comments in pytest * format * rerun Co-authored-by: Ammar Ahmad Awan <[email protected]> Co-authored-by: Connor Holmes <[email protected]> * increase min pre-commit versions (microsoft#2346) * Extend scratch buffer for long prompts (microsoft#2212) Co-authored-by: Reza Yazdani <[email protected]> Co-authored-by: Reza Yazdani <[email protected]> Co-authored-by: Jeff Rasley <[email protected]> * fix zero docs (microsoft#2350) * Inference profiling updates/fixes (microsoft#2348) (microsoft#2349) Co-authored-by: Jeff Rasley <[email protected]> Co-authored-by: Michael Wyatt <[email protected]> * Kernel Data Conversion Utility (microsoft#2327) * Unify macro definitions and constants in a single file * Conversion utility implementation. * Fix reversion from formatting * Bugfixes after testing with correct DeepSpeed * Inline markers are available on both HIP + CUDA * Add Onebit Optimzers in __init__ (microsoft#2340) Co-authored-by: Saeyeol Lee <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]> * [accelerator abstraction] merge from microsoft#2320 * docs(mixture-of-experts-inference): fix typo in tuto (microsoft#2345) Co-authored-by: Olatunji Ruwase <[email protected]> * download cifar to blob storage (microsoft#2342) Co-authored-by: Olatunji Ruwase <[email protected]> * Refactor gptj_residual_add kernels for better readability (microsoft#2358) Co-authored-by: Reza Yazdani <[email protected]> * Updated issue templates (microsoft#2363) * Update issue templates * fix cuda invalid config error in dequant kernel (microsoft#2362) * format * remove round fn * Add missing pytest fixture scope (microsoft#2353) Co-authored-by: Olatunji Ruwase <[email protected]> Co-authored-by: Michael Wyatt <[email protected]> * Extend residual_add kernel tests to conver pre_attn_norm (microsoft#2354) Co-authored-by: Jeff Rasley <[email protected]> * Refactor fused_bias_residual kernels for better readability (microsoft#2356) Co-authored-by: Olatunji Ruwase <[email protected]> * Capture error message during sweep tests (microsoft#2351) * Collect error messages in results.csv Co-authored-by: Michael Wyatt <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]> * fix an exception when recursively casting dicts to fp16 (microsoft#2370) * Refactor remaining distributed tests (microsoft#2216) * batch of refactored tests * more test refactoring * fp16 test refactor * more refactors * added DistributedFixture class * applied DistributedFixture to first batch of tests as a trial * added DistributedFixture test and documentation * last tests * fixes for refactored tests * remove subdirs in workflow files * fix pytest syntax error * fix another syntax error * update imports * use DistFixture with elastic checkpoint test * missing import * update to shared class tmpdir for elastic test * moved test files * avoid duplicate test file name * last refactor and moving test files * formatting * fix broken import * testing forked AMD tests * update abstract method * use blob storage for accelerate and transformers tests * upgrade torch for acclerate CI Co-authored-by: Olatunji Ruwase <[email protected]> * Fix the MLP output tensor's shape (microsoft#2380) * allow building with latest CUDA (11.8), it is backwards compatible (microsoft#2390) * pin transformers version for unit tests (microsoft#2402) * Change type to tuple in replace_wo_policy isinstance check (microsoft#2387) Update the isinstance check inside the `replace_wo_policy` function to `tuple` and `str` instead of `dict`, since the layers are provided as a `tuple` type. Co-authored-by: Lev Kurilenko <[email protected]> Co-authored-by: Molly Smith <[email protected]> Co-authored-by: Lok Chand Koppaka <[email protected]> Co-authored-by: Ammar Ahmad Awan <[email protected]> * Checkpoint backwards-compatbility workaround (microsoft#2384) * Add predicated global load (microsoft#2373) Co-authored-by: Reza Yazdani <[email protected]> * change call site of literal_device, on_accel_device and accel_runtime to get_accelerator() call * add new interface definition from olruwase/accelerator_abstraction * MII blog post (microsoft#2418) Co-authored-by: Samyam Rajbhandari <[email protected]> Co-authored-by: Ammar Ahmad Awan <[email protected]> * Fix figure reference (microsoft#2419) * [docs] update news items * [docs] add mii repo link * Add SLURM Multinode Runner (microsoft#2404) Signed-off-by: Dashiell Stander <[email protected]> Co-authored-by: Dashiell Stander <[email protected]> Co-authored-by: Jeff Rasley <[email protected]> * Fix issue with corrupted output on long generation for GPT (microsoft#2359) Co-authored-by: Reza Yazdani <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]> * MII blog title update on Readme * DeepSpeed-MII title change in website * Fix GPT Neo-X multi-gpu inference (microsoft#2401) Co-authored-by: Reza Yazdani <[email protected]> Co-authored-by: Jeff Rasley <[email protected]> * MII-Public and MII-Azure subheading in mii post * CI fixes related to triton (microsoft#2422) * [docs] update mii blog title (microsoft#2423) * add SD injection policy (microsoft#2381) Co-authored-by: Reza Yazdani <[email protected]> Co-authored-by: Reza Yazdani <[email protected]> * [accelerator abstraction] remove name() from interface, device_name() should be used. * merge with master (ec13da6) * fix checkpoint loading when it is a dictionary (microsoft#2425) * Make error regex more generic in collect_results.py (microsoft#2415) Co-authored-by: Jeff Rasley <[email protected]> * fixes microsoft#2389 (microsoft#2411) truncating expert param storage for checkpointing Co-authored-by: Alexander Jipa <[email protected]> Co-authored-by: Michael Wyatt <[email protected]> * Fix for inference gpt-j test (microsoft#2430) * fix for gpt-j failing due to tokenizer error * limit number of gpt-j tokens generated due to low memory * Fixing bug 2361 (microsoft#2410) * fixing bug 2361 * adding pytest for config initialization * chaning expected output to FusedAdam * remove print statement * running yapf on modified files * running pre-commit formatting Co-authored-by: Olatunji Ruwase <[email protected]> * Universal checkpoint for zero stage 1 (microsoft#2284) * Refactor universal checkpointing and tensor fragments * Formatting * Support zero stage1; Expand TP dim * Remove debug prints * Detect sharded optimizer state * Format fixes * Encode reshaping guide * More symbolic constants Co-authored-by: Michael Wyatt <[email protected]> * only add deps if extra is explictly called (microsoft#2432) * Add TestInjectionPolicy inference unittest class for testing custom injection policies (microsoft#2426) This PR adds a TestInjectionPolicy inference unittest class for testing custom injection policies. This test differs from the existing tests in that the injection_policy dictionary is explicitly specified when calling the DeepSpeed init_inference API. The google/t5-v1_1-small text2text-generation model and the roberta-large fill-mask model are added as tests with the injection policy explicitly specified. This is done to expand our unittest coverage to test the path where the replace_wo_policy function is invoked (see microsoftGH-2387). Co-authored-by: Lev Kurilenko <[email protected]> Co-authored-by: Michael Wyatt <[email protected]> * [memory estimators] new config args sync (microsoft#2431) Co-authored-by: Olatunji Ruwase <[email protected]> Co-authored-by: Jeff Rasley <[email protected]> * parallelize writing of layer checkpoint files across data parallel instances (microsoft#1419) * parallelize layer checkpoints across data parallel groups * use partition_uniform to determine start/end index values * formatting fix * config: add option for parallel write of layer checkpoints in pipeline stage * yapf fixes * enable parallel layer write according to config param * avoid extraneous makedir when rank 0 writes all layers Co-authored-by: Olatunji Ruwase <[email protected]> * Fix broken link to DeepSpeed Megatron fork (microsoft#2440) Co-authored-by: Lev Kurilenko <[email protected]> * bump to 0.7.5 * [OpBuilder] Add op builder abstraction * convert op builder usage in merged code * merge diff files from upstream * [OpBuilder] add create_op_builder interface in abstract_accelerator.py * remove files that is deleted from upstream * [OpBuilder] add left over op builder usage in tests * [OpBuilder] fix op builder usage in tests * [OpBuilder] fix <op builder>.NAME usage in tests to follow op builder abstraction design * import get_accelerator from deepspeed.accelerator directly * [OpBuilder] remove unused function and sync with main * add missing import * revert changes in device.py to avoid conflict with main * fix alexnet_model to use /tmp instead of /blob * Mingzhi/solve pr108 b (microsoft#115) * move ALL_OPs from __init__.py to all_Op.py to solve circular import * delete deepspeedexamples * fix import * fix regression (microsoft#117) * fix pin_memory * fix regression * fix error Signed-off-by: Dashiell Stander <[email protected]> Co-authored-by: Reza Yazdani <[email protected]> Co-authored-by: Mikhail Druzhinin <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]> Co-authored-by: Minjia Zhang <[email protected]> Co-authored-by: Jeff Rasley <[email protected]> Co-authored-by: Michael Wyatt <[email protected]> Co-authored-by: Kamal Raj <[email protected]> Co-authored-by: Conglong Li <[email protected]> Co-authored-by: Ammar Ahmad Awan <[email protected]> Co-authored-by: Arash Bakhtiari <[email protected]> Co-authored-by: Reza Yazdani <[email protected]> Co-authored-by: Zhihong Chen <[email protected]> Co-authored-by: Siddharth Singh <[email protected]> Co-authored-by: Connor Holmes <[email protected]> Co-authored-by: 叶志晟 <[email protected]> Co-authored-by: Molly Smith <[email protected]> Co-authored-by: trajep <[email protected]> Co-authored-by: chenguo <[email protected]> Co-authored-by: Arash Bakhtiari <[email protected]> Co-authored-by: Stas Bekman <[email protected]> Co-authored-by: Quentin Anthony <[email protected]> Co-authored-by: anthony.301 <[email protected]> Co-authored-by: Sam Ade Jacobs <[email protected]> Co-authored-by: Guanhua Wang <[email protected]> Co-authored-by: Saeyeol Lee <[email protected]> Co-authored-by: Saeyeol Lee <[email protected]> Co-authored-by: Jean-Louis Queguiner <[email protected]> Co-authored-by: Matt Smith <[email protected]> Co-authored-by: Thomas-MMJ <[email protected]> Co-authored-by: lekurile <[email protected]> Co-authored-by: Lev Kurilenko <[email protected]> Co-authored-by: Molly Smith <[email protected]> Co-authored-by: Lok Chand Koppaka <[email protected]> Co-authored-by: Samyam Rajbhandari <[email protected]> Co-authored-by: Dashiell Stander <[email protected]> Co-authored-by: Dashiell Stander <[email protected]> Co-authored-by: Andrey Chernykh <[email protected]> Co-authored-by: Alexander Jipa <[email protected]> Co-authored-by: Alexander Jipa <[email protected]> Co-authored-by: Joe Mayer <[email protected]> Co-authored-by: Adam Moody <[email protected]> Co-authored-by: AGUL <[email protected]>

Fix the tensor-slicing copy for qkv parameters

73fc030

RezaYazdaniAminabadi requested review from jeffra, samyam, tjruwase, ShadenSmith, conglongli, awan-10, cli99, eltonzheng, minjiaz, duli2012, mrwyattii, yaozhewei, arashb, xiaoxiawu-microsoft and samadejacobs as code owners August 9, 2022 02:12

tjruwase added 2 commits August 9, 2022 05:14

Merge branch 'master' into ds-inference/fixmp

1132f9a

Merge branch 'master' into ds-inference/fixmp

f2b1c85

tjruwase approved these changes Aug 9, 2022

View reviewed changes

Merge branch 'master' into ds-inference/fixmp

29e0249

jeffra merged commit 8920308 into master Aug 10, 2022

jeffra deleted the ds-inference/fixmp branch August 10, 2022 16:35

mallorbc mentioned this pull request Aug 18, 2022

[BUG][master branch] garbage GPTJ output for multi-gpu inference #2233

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix the tensor-slicing copy for qkv parameters #2198

Fix the tensor-slicing copy for qkv parameters #2198

RezaYazdaniAminabadi commented Aug 9, 2022

Fix the tensor-slicing copy for qkv parameters #2198

Fix the tensor-slicing copy for qkv parameters #2198

Conversation

RezaYazdaniAminabadi commented Aug 9, 2022