fix: Always initialize bias to zero for ColumnParallelLinear. #1490
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR should fix the problem of #1411.
In #1181, @zhuohan123 refactor the codes of
ColumnParallelLinear
, and removed theself.bias.zero_()
statement, which result in some unknow behaviors in the model init phase. vLLM will do a forward inprofile_num_available_blocks
, andhidden_state
will includenan
value when executing the secondDecodeLayer
forwarding.To be more precise, some wired phenomenon happened to the
self.bias
ofColumnParallelLinear
:self.bias
will include some large value from the second attention layer if we just init the whole model;These larger value will result in some
nan
value in theattn_output
.If I add a breakpoint at this statement:
step into its construct function, run the
__init__()
step by step,self.bias
seems to be a zero tensor, but still bring in some accuracy error when forwarding,qkv, _ = self.qkv_proj(hidden_states)
is different to the result in vLLM==0.2.0.I'm no a expert of Pytorch, and can't explain why these wired phenomenon happened. But this PR could fix the bug.