Fixing file permissions #1

ShadenSmith · 2020-02-03T18:41:35Z

No description provided.

Fixing file permissions.

Pulling upstream

* Tied module indexing bugfix. * Train and inference pipeline schedules. * Move code quality tests to Azure-hosted agents. (microsoft#368)

Co-authored-by: Cheng Li <[email protected]> Co-authored-by: Jeff Rasley <[email protected]>

…cript Added ds_train_bert_bsz32k_seq512_pipeclean.sh

Fix all Pipeline Module Parameters being sent to cuda:0

* threaded tf_dl+presplit sentences+shuffled dataset with resume * elaborate in readme

Megatron + DeepSpeed + Pipeline Parallelism

* Enable Megatron-LM workload on ROCm (microsoft#1) * Enable Megatron workload on ROCm * Added ds_pretrain_gpt_350M_dense_pipeclean.sh * removed a file * Removed an extra line * Fix to resolve the below rsqrtf() error on ROCm /root/Megatron-DeepSpeed/megatron/fused_kernels/layer_norm_hip_kernel.hip:298:10: error: no matching function for call to 'rsqrtf' return rsqrtf(v); ^~~~~~ /opt/rocm-5.2.0/llvm/lib/clang/14.0.0/include/__clang_hip_math.h:521:7: note: candidate function not viable: call to __device__ function from __host__ function float rsqrtf(float __x) { return __ocml_rsqrt_f32(__x); } ^ * Simplified code * Simplified the code * Removed extra spaces

don't gather partitioned activations for mp size 1 (microsoft#2454)

* Add workspace capability to DSKernel * Add to injection pipeline * Validated

* Initialize the fp6-quant-kernel integration. * Add necessary parameters of kernel interfaces and the linear layer selection logic. * upload kernel code * The simple script for debugging. * fix typo * update * fix split k * Fix some errors and add test case. * Workspace for Inference Kernels (#1) * Add transform_param functions and update format. * kernel debug * fix include * Update core_ops.cpp * Add split k support * fix * Fix kernel error * update * update * Fix rebase errors. * Add missed include. * Fix the bug that the attribute uses the weight information for mem alloc. * Avoid GPU preallocation during weight loading. * Add support of larger shapes for gated activation kernel. * update * model update * fix all weight preprocessing * Add split-k heuristic. * Avoid reading scale attribute on non-quantized tensors. * Change the scales from attributes to new tensors. Provide the end-to-end script given HuggingFace model id. * Hard-coded commented out the scales in the kernel to workaround the bug. * Support the user config for quantization. Fix kernel bug. * Per operator test functions. * Multiply scales by 1e12 according to the kernel design. * Revert "Workspace for Inference Kernels (#1)". This reverts commit 1528732. * Remove the format-only changes. * Put the quantization into the transform_param function. --------- Co-authored-by: Shiyang Chen <[email protected]> Co-authored-by: Haojun Xia <[email protected]>

ShadenSmith changed the title ~~Adding executable perms to install.sh~~ Fixing file permissions Feb 3, 2020

Shaden Smith added 2 commits February 3, 2020 10:53

Adding executable perms to install.sh

3836643

Fixing perms for some source code

9bb70fb

ShadenSmith force-pushed the install_perms branch from cf59f7f to 9bb70fb Compare February 3, 2020 18:54

ShadenSmith merged commit b18eae2 into master Feb 3, 2020

ShadenSmith deleted the install_perms branch February 3, 2020 18:55

sj6077 mentioned this pull request Mar 31, 2020

Megatron-LM pretrain_bert with deepspeed #179

Closed

kouml pushed a commit to kouml/DeepSpeed that referenced this pull request Apr 3, 2020

Fixing file permissions (microsoft#1)

05bd292

Fixing file permissions.

gongwei-130 mentioned this pull request Aug 7, 2020

'CUDA error: an illegal memory access was encountered' in forward #308

Open

TonyTangYu mentioned this pull request Aug 21, 2020

Warning: NaN or Inf found in input tensor when running DeepSpeedExamples/BingBertSquad. #324

Open

arashashari added a commit that referenced this pull request Sep 4, 2020

Merge pull request #1 from microsoft/master

a2984d0

Pulling upstream

ShadenSmith referenced this pull request in ShadenSmith/DeepSpeed Sep 10, 2020

Pipeline staging v2 PR #1 - scheduler (microsoft#83)

13c6441

* Tied module indexing bugfix. * Train and inference pipeline schedules. * Move code quality tests to Azure-hosted agents. (microsoft#368)

YeDeming mentioned this pull request Sep 30, 2020

CUDA Error when run with multiple GPUs #454

Closed

GrvLeo mentioned this pull request Oct 22, 2020

Fail to use Zero-offload: "ModuleNotFoundError: No module named 'deepspeed.ops.adam.cpu_adam_op'" #483

Closed

cli99 added a commit that referenced this pull request Jan 13, 2021

squash latest flops profiling changes (#1) (#664)

e2fbe4d

Co-authored-by: Cheng Li <[email protected]> Co-authored-by: Jeff Rasley <[email protected]>

bobisapotato mentioned this pull request Jan 24, 2021

Another thing to merge. (MY EYES HURT) bobisai/DeepSpeed#1

Merged

tani18dr mentioned this pull request Mar 4, 2021

How to convert to fp32 model? #816

Closed

rraminen pushed a commit to rraminen/DeepSpeed that referenced this pull request Apr 28, 2021

Merge pull request microsoft#1 from rraminen/pipeclean_seq512_shell_s…

53b28ad

…cript Added ds_train_bert_bsz32k_seq512_pipeclean.sh

garvct mentioned this pull request Jun 29, 2021

Bert training model failed when add --deepspeed_transformer_kernel #1155

Open

liamcli referenced this pull request in determined-ai/DeepSpeed Sep 27, 2021

Merge pull request #1 from sdtblck/patch-1

ef8d702

Fix all Pipeline Module Parameters being sent to cuda:0

oborchers mentioned this pull request Aug 26, 2022

[BUG] DS Inference Bloom OOM / get_sd_loader_json() missing 1 argument #2222

Closed

pengwa pushed a commit to pengwa/DeepSpeed that referenced this pull request Oct 14, 2022

Faster dataloader merge (microsoft#1)

66719e9

* threaded tf_dl+presplit sentences+shuffled dataset with resume * elaborate in readme

pengwa pushed a commit to pengwa/DeepSpeed that referenced this pull request Oct 14, 2022

Merge pull request microsoft#1 from microsoft/megatron-2.4-ds-pipe

b56b50b

Megatron + DeepSpeed + Pipeline Parallelism

guoyejun pushed a commit to guoyejun/DeepSpeed that referenced this pull request Nov 10, 2022

Merge pull request microsoft#1 from guoyejun/forgma

b9211e8

don't gather partitioned activations for mp size 1 (microsoft#2454)

WeiMa01 mentioned this pull request Apr 12, 2023

[BUG] error: use of undeclared identifier '__double2half'; did you mean '__double2hiint'?" #3197

Closed

afeilulu mentioned this pull request Apr 22, 2023

[BUG] trainning [ERROR] [launch.py:434:sigkill_handler] exits with return code = -9 #3232

Open

JINGTING92 mentioned this pull request Apr 25, 2023

[BUG]Error after changing the model from opt to gpt #3373

Open

This was referenced Apr 28, 2023

[BUG]RuntimeError: CUDA error: unknown error #3403

Closed

[BUG]RuntimeError: CUDA error: unknown error microsoft/DeepSpeedExamples#453

Open

iamsile mentioned this pull request Sep 4, 2023

[BUG]deepspeed-chat training error on v100 * 8, raise assert param.ds_status == ZeroParamStatus.AVAILABLE, param.ds_summary() after training of step3 #4194

Closed

phalexo mentioned this pull request Oct 11, 2023

[BUG] The code for deepspeed.comm.comm.monitored_barrier() #4488

Open

loadams pushed a commit that referenced this pull request Mar 6, 2024

Workspace for Inference Kernels (#1)

1528732

* Add workspace capability to DSKernel * Add to injection pipeline * Validated

nctu6 mentioned this pull request Jul 5, 2024

Cannot install deepspeed 0.12.6, fail to produce metadata. #4914

Closed

always-H mentioned this pull request Nov 3, 2024

[BUG] NCCL Timeout When Pre-traing "ds_train_bert_nvidia_data_bsz32k_seq512". #6705

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixing file permissions #1

Fixing file permissions #1

ShadenSmith commented Feb 3, 2020

Fixing file permissions #1

Fixing file permissions #1

Conversation

ShadenSmith commented Feb 3, 2020