You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, currently I'm researching the impact of different retrieval-augmented generation (RAG) techniques on the LLM effect. We are attempting to replicate the CrossCodeEval from the "StarCoder 2 and The Stack v2: The Next Generation" paper as a baseline.
However, we have encountered issues in replicating the results stated in section 7.6.2 of the paper using the provided GitHub repository data and code for CrossCodeEval, along with the hyperparameters specified in the section. The paper reports a Code ES of 74.52 and an ID F1 of 68.81 for StarCoder2-7B’s Python code generation, whereas our replicated results showed a Code ES of 67.92 and an ID F1 of 58.08.
We noticed the option to use the BigCode-Evaluation-Harness for testing as mentioned in your repository, but we could not find the CrossCodeEval experiment within the bigcode-project/bigcode-evaluation-harness project. Therefore, we proceeded with the direct use of the open-source GitHub code and dataset for CrossCodeEval, employing the hyperparameters given in section 7.6.2.
My experiment evironment is:
A100 40G*8 DGX node
ubuntu 20.04
cuda 12.1
torch 2.1.2
Could you please provide any insights or additional guidelines that might help us better replicate the benchmark results? Any assistance or further details you could offer would be greatly appreciated.
Thank you for your time and support.
The text was updated successfully, but these errors were encountered:
Azure-Tang
changed the title
Request for Assistance on Replicating CrossCodeEval Results for StarCoder 2
CrossCodeEval Results for StarCoder 2
Apr 23, 2024
Hi, currently I'm researching the impact of different retrieval-augmented generation (RAG) techniques on the LLM effect. We are attempting to replicate the CrossCodeEval from the "StarCoder 2 and The Stack v2: The Next Generation" paper as a baseline.
However, we have encountered issues in replicating the results stated in section 7.6.2 of the paper using the provided GitHub repository data and code for CrossCodeEval, along with the hyperparameters specified in the section. The paper reports a Code ES of 74.52 and an ID F1 of 68.81 for StarCoder2-7B’s Python code generation, whereas our replicated results showed a Code ES of 67.92 and an ID F1 of 58.08.
We noticed the option to use the BigCode-Evaluation-Harness for testing as mentioned in your repository, but we could not find the CrossCodeEval experiment within the bigcode-project/bigcode-evaluation-harness project. Therefore, we proceeded with the direct use of the open-source GitHub code and dataset for CrossCodeEval, employing the hyperparameters given in section 7.6.2.
My experiment evironment is:
A100 40G*8 DGX node ubuntu 20.04 cuda 12.1 torch 2.1.2
Could you please provide any insights or additional guidelines that might help us better replicate the benchmark results? Any assistance or further details you could offer would be greatly appreciated.
Thank you for your time and support.
The text was updated successfully, but these errors were encountered: