You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
First of all, I would like to thank you for your incredible work on DeepSeekerV2. I am very interested in the model and have been exploring it in detail. However, I have a couple of questions related to the implementation of your inference process.
In the paper, you mentioned that during inference, the compressed latent vectors for keys and values (ct^kv) are stored. However, when I checked the HuggingFace code implementation, I noticed that Key_states and Value_states are still being saved separately during inference. Could you clarify how this aligns with the approach mentioned in the paper?
Additionally, the paper discusses merging W^UV into WO and W^UK into WQ for efficiency. However, I couldn't locate this merging process in the code either. Could you provide some insights or point me in the right direction on how this is implemented?
Thank you again for your fantastic work, and I really look forward to your guidance on these points.
Best regards,
lucas
The text was updated successfully, but these errors were encountered:
Dear DeepSeekerV2 team,
First of all, I would like to thank you for your incredible work on DeepSeekerV2. I am very interested in the model and have been exploring it in detail. However, I have a couple of questions related to the implementation of your inference process.
In the paper, you mentioned that during inference, the compressed latent vectors for keys and values (ct^kv) are stored. However, when I checked the HuggingFace code implementation, I noticed that Key_states and Value_states are still being saved separately during inference. Could you clarify how this aligns with the approach mentioned in the paper?
Additionally, the paper discusses merging W^UV into WO and W^UK into WQ for efficiency. However, I couldn't locate this merging process in the code either. Could you provide some insights or point me in the right direction on how this is implemented?
Thank you again for your fantastic work, and I really look forward to your guidance on these points.
Best regards,
lucas
The text was updated successfully, but these errors were encountered: