-
Notifications
You must be signed in to change notification settings - Fork 92
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Numerous egregious issues with this paper #3
Comments
Hi @wemoveon2, thanks for sharing these comments, we noticed you may have certain misconceptions here, the clarifications for them are:
|
Thank you for the clarifications, however, I have further questions regarding some of these points.
|
Principle 4 is contradictory to Principle 15 and 22.It's very confusing. Principle 4: Principle 15: Principle 22: |
Hi @AreChen, thank you for your query. To provide clarity on the principles and directly address your concerns, let me explain further. Each principle has been crafted with a distinct aim to facilitate effective communication. While Principle 4 advocates for a positive tone in dialogues, Principles 15 and 22 provide a structured approach to specific requests. It’s understood that the presence of 'don't' within these instructions could appear to contradict the affirmative stance of Principle 4. However, our main focus in Principles 15 and 22 is on guiding users in how to frame requests effectively, irrespective of a positive or negative tone. The focus here is on the manner of articulation—Principle 15 encourages interactive learning and meaningful engagement with the content for a deeper understanding, and Principle 22 is about fine-tuning the text while maintaining the original tone and intent. |
@wemoveon2 Thanks for writing that up. Agree with a lot of points. Would like to see some elbaoration on evaluation methodology. Without human reviewer criteria for "Boosting", (and/or criteria/examples for Correctness), it's hard to draw any conclusions whatsoever from this paper. What did your evaluators see? Where did you source them from? |
I wish others would agree, but one of the main reasons I took the time to review this paper was due to the numerous influencers on linkedin and twitter touting the results of this paper as "removing the need for prompt engineering". I really dislike how work in prompt engineering is not seen as proper science in the wider academic (AI) community, and I see this paper and its authors as propagators of this stereotype. Of course, I could stand corrected if independent parties demonstrate that this paper's results are replicable across the widely used and accessible LLMs, not just the author's own fine-tuned variants which could've easily had its training data contaminated by their benchmark. |
@darinkishore Tbh, I read the whole paper (at least 3x) just because I felt I was missing something related to the evaluation methodology.
@wemoveon2 I appreciate your time doing this. This is how we suppose to approach these studies - with all due respect for those that wrote it. I appreciate the effort from them, but it missed some important clarifications. The lack of "proper science" as you mentioned is a poisoning idea that backfires in the whole industry as we grow in popularity (and promises) - and its stagerring that this isn't being perceived by the academic community. I don't know how many readers perceived flaws and did not opened issues or discussed it publicly, but a single one is alarming, and definitely reinforce the stereotype. |
Thanks for all these comments. I'm not an academic and I'm not an influencer. But I, too, was wondering about the heatmap. I couldn't figure out how you came up with the values on page 9. I couldn't find the methodology in the paper so I thought I would check here. The only way I saw the data on page 9 making sense is if 20 people were shown the result and were asked to judge which resuld was better, the principled or the non-principled one. And I'm not getting into the double-blind side of things or even the correctness vs quality distinction. I would have to assume that all 20 participants (if my methodology is correct) understood the criteria to evaluate clearly. I see that this is V1 so I think it has to be reviewed and maybe a second round of tests will be done. I hope so, anyway. I would really like to know if these principles are really solid before I write about it. |
Here's a list of issues others and I have found with your paper, code, data, methodology, and experiment design:
Issues pertaining to overall experiment design and methodology
Issues pertaining to code, implementation, and the actual data
generate.py
code you've released, it's literally impossible to generate the responses as shown for Prompt 14 since all you're doing is calling the model using the same prompt without updating it with the model's questions or the user's responseATLAS/generate.py
Lines 40 to 43 in 03511d3
ATLAS/generate.py
Line 44 in 03511d3
There are significant issues with your paper which makes your findings "dubious" to say the least. Was this written by freshmen undergrads over two to three weeks? This paper comes off as sloppy, and the way this was written makes me think the authors were trying to just fill pages without regard to the quality of the content. Almost 1/5th of the pages are dedicated to just the Gemini and GPT4 references when there's no other (decent) paper referencing either paper that does so in this manner. I get this was released on arxiv, but how such glaring flaws weren't caught by your advisor is honestly beyond me.
The text was updated successfully, but these errors were encountered: