[Audio LLM]support audiollm for asr, based on whisper and llama3 #2532

Zth9730 · 2024-05-20T04:37:48Z

conduct experiment on librispeech dataset for severl steps:

thsxbw · 2024-05-23T07:37:51Z

Can you provide config.yaml or experimental results?

Zth9730 · 2024-05-23T08:13:16Z

Can you provide config.yaml or experimental results?

Yes, there will be new commits later.

fclearner · 2024-08-02T10:00:15Z

Is this branch support qwen?

Zth9730 · 2024-08-07T02:49:07Z

audiollm.yaml

The hyperparameters, such as learning rate and warmup, may not be the best.

accum_grad: 1
cmvn: null
cmvn_conf:
  cmvn_file: null
  is_json_cmvn: null
dataset: audio_llm
dataset_conf:
  batch_conf:
    batch_type: static
    batch_size: 4
  cycle: 1
  data_style: audiosft
  data_style_conf:
    add_bos: true
    add_eos: true
    template: audio_llama3
  feats_type: log_mel_spectrogram
  filter_audio_conf:
    max_length: 3000
    min_length: 0
  filter_conf:
    token_max_length: 8192
    token_min_length: 1
  log_mel_spectrogram_conf:
    hop_length: 160
    n_fft: 400
    num_mel_bins: 128
    pad_or_trim: ture
    padding: 0
  resample_conf:
    resample_rate: 16000
  shift: true
  shuffle: true
  shuffle_conf:
    shuffle_size: 1500
  shuffle_list: true
  shuffle_list_conf:
    shuffle_size: 15000
  sort: true
  sort_conf:
    sort_size: 500
    spec_aug: true
  spec_aug_conf:
    max_f: 10
    max_t: 50
    num_f_mask: 0
    num_t_mask: 0
  spec_sub: false
  spec_sub_conf:
    max_t: 30
    num_t_sub: 3
  spec_trim: false
  speed_perturb: false
decoder: decoder_only
decoder_conf:
  activation_type: swish
  attention_dropout_rate: 0.0
  attention_heads: 32
  dropout_rate: 0.0
  gelu_approximate: null
  gradient_checkpointing: true
  head_dim: 128
  hidden_size: 4096
  linear_units: 14336
  max_position_embeding: 8192
  n_kv_head: 8
  norm_eps: 1.0e-05
  normalize_before: true
  num_blocks: 32
  positional_dropout_rate: 0.0
  rms_norm_offset: false
  rope_style: llama
  rope_theta: 500000.0
  scale_embed: false
  use_sdpa: true
encoder: transformer
encoder_conf:
  activation_type: gelu
  attention_dropout_rate: 0.0
  attention_heads: 20
  dropout_rate: 0.1
  gradient_checkpointing: true
  input_layer: conv1d2
  key_bias: false
  linear_units: 5120
  normalize_before: true
  num_blocks: 32
  output_size: 1280
  pos_enc_layer_type: abs_pos_whisper
  positional_dropout_rate: 0.1
  static_chunk_size: -1
  use_dynamic_chunk: false
  use_dynamic_left_chunk: false
  use_sdpa: true
grad_clip: 1
input_dim: 128
log_interval: 40
max_epoch: 3
save_limited: 1
save_best_ckpt: True
model: audio_llm
model_conf:
  bottleneck_mid_dim: 512
  bottleneck_type: conv-linear
  conv_kernel_sizes:
  - 3
  - 3
  - 3
  length_normalized_loss: false
  linear_bias: false
  lsm_weight: 0.1
  tie_word_embedding: false
  freeze_decoder: true
  freeze_encoder: true
  freeze_llm_embed: false
optim: adamw
optim_conf:
  lr: 4.0e-05
  weight_decay: 0.01
output_dim: 128256
save_interval: 2000
save_states: model_only
scheduler: warmuplr
scheduler_conf:
  warmup_steps: 1000
tokenizer: huggingface
tokenizer_conf:
  model: meta-llama/Meta-Llama-3-8B
  special_tokens:
    <|begin_of_text|>: 128000
    <|end_header_id|>: 128007
    <|end_of_text|>: 128001
    <|eot_id|>: 128009
    <|start_header_id|>: 128006
vocab_size: 128256

decode scripts:

temperature=1.0
top_p=1.0
top_k=1
for test in $recog_set; do
    result_dir=$dir/${test}
    python wenet/bin/audiollm_recognize.py --gpu 0 \
      --config $dir/train.yaml \
      --data_type raw \
      --dtype bf16 \
      --test_data $wave_data/$test/data.list \
      --checkpoint $decode_checkpoint \
      --output_len 256 \
      --temperature $temperature \
      --top_p $top_p \
      --top_k $top_k \
      --result_dir $result_dir
    test_dir=$result_dir/temp${temperature}_topk${top_k}_topp${top_p}
    python tools/compute-wer.py --char=1 --v=1 \
      $wave_data/$test/text $test_dir/text > $test_dir/wer
  done

Zth9730 · 2024-08-07T02:53:40Z

Is this branch support qwen?

可以参考周神的代码把qwen的weight转成wenet的，就可以支持🐶

Mddct · 2024-08-07T06:29:22Z

rebase 一下main， LLM有一部分已经合到main了

Mddct added 30 commits April 7, 2024 14:16

add casual model

8cdc9ed

fix typo

8559f93

rm ckpt

9f3dd76

add topk topp sampler

9958a55

fix positoin

1de7240

Merge branch 'main' into Mddct-llm

a90d336

add generate

6568552

add toto

984d481

support sft & pretrain training forward

b36b3ad

gemm conversion works

cc57164

support init casual model

4180661

Merge branch 'main' into Mddct-llm

3fabb2b

Merge branch 'main' into Mddct-llm

7bbb2d7

all gemma model works

e6a6d02

fix ut

fbe519f

merge main

ed38698

Merge branch 'main' into Mddct-llm

50458c3

merge main

acd42c7

fix cache

25f5ef3

Merge branch 'main' into Mddct-llm

135b9c0

generate works!

33a55d5

unify chat pattern

05a2579

convert llama3 works

126d740

merge main

bd6a6e6

fix w1 w2 w3 in feedforward

34eecb2

add 70b temporarily

dabcdf2

mv LLM to wenet

72c0f23

support llm dataset

e92b207

unify config

b892c44

add dataset yaml in script

d01c3ec

Mddct and others added 14 commits April 24, 2024 15:27

support llm dataset

7f683a9

dynamic static bucket works

592fb69

merge main

a8cbf23

training works

38330d1

pretrain works

bfe0628

refactor covert

79bafa3

fix flash att in generate

a9a7f7b

llama works

e5e36fc

fix llama3

0e81840

fix speed

9805ed6

try fix ut

32853c6

support stop tokens in gen and support ppl

e81b110

support stop tokens in gen and support ppl

e246769

support audiollm for asr, based on whisper and llama3

a7a9a75

support decode, fix some bug

6885e26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Audio LLM]support audiollm for asr, based on whisper and llama3 #2532

[Audio LLM]support audiollm for asr, based on whisper and llama3 #2532

Zth9730 commented May 20, 2024

thsxbw commented May 23, 2024

Zth9730 commented May 23, 2024

fclearner commented Aug 2, 2024

Zth9730 commented Aug 7, 2024 •

edited

Loading

Zth9730 commented Aug 7, 2024

Mddct commented Aug 7, 2024

[Audio LLM]support audiollm for asr, based on whisper and llama3 #2532

Are you sure you want to change the base?

[Audio LLM]support audiollm for asr, based on whisper and llama3 #2532

Conversation

Zth9730 commented May 20, 2024

thsxbw commented May 23, 2024

Zth9730 commented May 23, 2024

fclearner commented Aug 2, 2024

Zth9730 commented Aug 7, 2024 • edited Loading

audiollm.yaml

The hyperparameters, such as learning rate and warmup, may not be the best.

decode scripts:

Zth9730 commented Aug 7, 2024

Mddct commented Aug 7, 2024

Zth9730 commented Aug 7, 2024 •

edited

Loading