Omar Khattab | September 4th, 2024 (corresponding tweet)
Grad students often reach out to talk about structuring their research, e.g. how do I do research that makes a difference in the current, rather crowded AI space? Too many feel that long-term projects, proper code releases, and thoughtful benchmarks are not incentivized — or are perhaps things you do quickly and guiltily to then go back to doing 'real' research.
This post distills thoughts on impact I've been sharing with folks who ask. Impact takes many forms, and I will focus only on making research impact in AI via open-source work through artifacts like models, systems, frameworks, or benchmarks. Because my goal is partly to refine my own thinking, to document concrete advice, and to gather feedback, I'll make rather terse, non-trivial statements. Please let me know if you disagree; I'll update here if I change my mind.
Here are the guidelines:
- Invest in projects, not papers.
- Select timely problems with large headroom and "fanout".
- Think two steps ahead and iterate fast.
- Put your work out there and own popularizing your ideas.
- Funnel the excitement you build: Tips on growing open-source research.
- Continue investing in your projects, via new papers.
The fifth bullet "tips on growing open-source research" deserves its own, longer post. I may write that next.
This is a crucial mental shift. Everything else follows from it.
Junior students learn to place a lot of value in publishing their first couple of papers. And that's reasonable: it's how you learn to conduct research, explore initial topics, and demonstrate early progress. But this is a stage that you must depart eventually: your fulfillment and growth in the longer term will depend less on paper count and more on your impact and the overarching research story you communicate. Unfortunately, too many PhD students perceive most actions that may create impact as 'disincentivized'. This perplexed me until I realized they meant that those actions could slow down your ability to produce the next paper. But your ability to produce the next paper so quickly is just not that important.
Instead of thinking of your work as a series of isolated papers, I suggest that you ask yourself: What is the larger vision, the sub-area, or the paradigm you will lead? What difference is your work seeking to make? You will publish individual papers to explore and build support for it, but your bigger vision should be something you iterate on quite intentionally. It needs to be a lot larger than a publication unit and certainly something that you haven't completely solved yet.
One way to do this is to structure some of your research papers around a coherent artifact — like a model, system, framework, or benchmark — that you maintain in open source. This strategy is more costly than "run some experiments and put out a quick repo that you forget about". But it forces you to find a problem that has the right characteristics for impact and helps ensure that new research you do is actually coherent and useful: you wouldn't introduce hacky features you don't believe are worthwhile to the artifact you have been growing and maintaining.
Not every paper you write will deserve indefinite investment; many of your papers will be exploratory one-offs. To find a direction that can be turned into a larger project, use the following criteria.
First, the problem must be timely. You can define this in many ways, but one strategy that works well in AI is to seek a problem space that will be 'hot' in 2-3 years but hasn't nearly become mainstream yet.
Second, the problem must have large "fanout", i.e. potential impact on many downstream problems. Basically, these are problems whose outcomes could benefit or interest enough people. Researchers and people care about what helps them achieve their goals, so your problem could be something that helps others build things or achieve research or production goals. You can apply this filter to work on theoretical foundations, systems infrastructure, novel benchmarks, new models, and many others things.
Third, the problem must have large headroom. If you tell people that their systems could be 1.5x faster or 5% more effective, that's rarely going to beat inertia. In my view, you need to find problems where there's non-zero hope that you'll make things, say, 20x faster or 30% more effective, at least after years of work. Of course, you don't need to get all the way there to succeed, and you should not wait until you've completely made it all the way there to release the first paper or artifact.
Not to be too abstract, let me illustrate using ColBERT. In late 2019, research applying BERT for retrieval was popular, but the approaches were very expensive. It was natural to ask if we could make this dramatically more efficient. What might make this a good problem? First, it was timely. You could correctly expect that, by 2021 (1.5 years later), many researchers will be seeking efficient BERT-based retrieval architectures. Second, it had large headroom. New ML paradigms tend to, as most such work ignores efficiency at first. Indeed, the original approaches could take 30 seconds to answer a query, whereas nowadays much higher quality retrieval can be done in 30 milliseconds, 1000x faster. Third, it had large fanout. Scalable retrieval is a nice “foundational” problem: everyone needs to build something atop retrievers but few want to build them.
Now that you have a good problem, resist the urge to pick the immediate low-hanging fruit as your approach! At some point, at least eventually, many others will be thinking about that 'obvious' approach.
Instead, think at least two steps ahead. Identify the path most people are likely to take, when this timely problem eventually becomes more mainstream. Then identify the limitations of that path itself, and start working on understanding and tackling those limitations.
What might this look like in practice? Let's revisit the ColBERT case study. The obvious way to build efficient retrievers with BERT is to encode documents into a vector. Interestingly, there was only limited IR work that does that by late 2019. For example, the best-cited work in this category (DPR) only had its first preprint released in April 2020. Given this, you might think that the right thing to do in 2019 was to build a great single-vector IR model via BERT. In contrast, thinking just two steps ahead would be to ask: everyone will be building single-vector methods sooner or later, where will this single-vector approach get fundamentally stuck? And indeed, that question led to the late interaction paradigm and widely-used models.
As another example, we could use DSPy. In February 2022, as prompting is becoming decently powerful, it was clear that people will want to do retrieval-based QA with prompting, not with fine-tuning like it used to be. A natural thing to do would be to build a method for just that. Thinking two steps ahead would be to ask: where will such approaches get stuck? Ultimately, retrieve-then-generate (or "RAG") approaches are perhaps the simplest possible pipeline involving LMs. For the same reasons people will be interested in it, it was clear that they would increasingly be interested in (i) expressing more complex modular compositions and (ii) figuring out how the resulting sophisticated pipelines should be supervised or optimized, via automated prompting or finetuning of the underlying LMs. That's DSPy.
The second half of this guideline is "iterate fast". This was perhaps the very first research advice I received from my advisor Matei Zaharia, in week one of my PhD: by identifying a version of the problem you can iterate quickly on and receive feedback (e.g., latency or validation scores), you greatly impove your chances at solving hard problems. This is especially important if you will be thinking two steps ahead, which is already hard and uncertain enough.
At this point, you’ve identified a good problem and then iterated until you discovered something cool and produced an insightful writeup. Don’t move on to the next paper! Instead, focus on putting your work out there — I mean this with a flavor of vulnerability — and seeking to really engage with people, not just about your one-time paper release but about the big picture you are actively developing. Or, even better, about the useful open-source artifact you're building and maintaining that captures your key ideas.
A common first step is to release the paper as a preprint on arXiv and then release a “thread” (or similar) announcing the paper’s release. When you do this, make sure your thread begins with a concrete, substantial, and accessible claim. The goal isn’t to tell people that you released a paper — that doesn't carry inherent value. The goal is to communicate your key argument in a direct, vulnerable, and engaging way, something concrete that people can agree or disagree with. (I know this is hard. But it is necessary.)
Perhaps more importantly, this whole process does not end after the first "release". It starts with the release. Given that now you're investing in projects, not just papers, your ideas and your scientific communication persist year-long, well beyond isolated paper releases. Let me illustrate why this matters. When I help grad students “tweet” about their work, it's not uncommon that their initial post doesn’t get as much traction as hoped. Students typically assume this validates their fear of posting about their research and take it as yet another sign that they should just move on to the next paper. Obviously, this is not correct.
A lot of personal experience, second-hand experience, and observations suggest that this persistence in this part is massively helpful (and, by the way, exceedingly rare). That is, with few exceptions, traction of good ideas needs you to tell people the key things many times in different contexts — and evolving your thoughts and your communication of your thoughts — either until the community can absorb these ideas over time or until the field evolves to the right stage of development where it's easier to appreciate those ideas.
While getting people excited about your findings is great, you can often have much greater impact by releasing, contributing to, and growing open-source artifacts that channel your ideas to whatever downstream applications are relevant. This is not easy: uploading code files with a README to GitHub is not enough. A good repository will be the “home” of your project, more so than any of the individual papers you’ll publish.
Good open-source research needs two, almost independent qualities to be good. First, it needs to be good research, i.e. novel, timely, well-scoped, accurate, etc. Second, it needs to have clear downstream utility and low friction. This is the important part: people will repeatedly avoid (and other people will repeatedly use!) your OSS artifacts for all the "wrong" reasons, all the time. Just to illustrate, your research may be the objective "state of the art", but people will prioritize a lower-friction alternative nine times out of ten. Reversedly, people will often use your tools for reasons that, to you as a graduate student, miss the point, e.g. because they don't fully utilize of your most innovative components. This is not something to resist; it's something to understand and to build on.
Given all of this, here is a list of milestones to observe when working on the open-source side of your research releases.
Milestone 0: Make your release usable! There’s no point making a code release that no one can execute. Make your release usable to other researchers trying to use your work (as a baseline or similar). These are folks in your area who want to replicate your runs — and maybe hill-climb past them and cite you. These folks will have more patience than other types of users. Despite this, you will observe a dramatic difference in academic impact based on how easy it is to tinker with your code.
Milestone 1: Make your release useful! Beyond people in your narrow area, you should make sure your release is useful to the (potentially much larger!) audience that wants to actually use your project to build something else. This milestone rarely comes naturally in AI research. You should allocate plenty of time to thinking about the (research, production, etc.) problems people are trying to solve where your artifact can help. If you do this right, a lot of its effects will be reflected in everything from project design to the APIs you expose and the documentation/examples you show.
Milestone 2: Make your release approachable! This is hard for us, AI researchers, but you should be aware that a useful release, where technically everything is available and explained somewhere, does not equal a release most of your potential users will find approachable enough to invest in learning or trying it. "You need the thing, then you need the ramp" is a great post from Andrej Karpathy about this. Ben Clavie has written extensively about this too, largely demonstrated by him taking our work on ColBERT and making it what feels like 10x more approachable.
Milestone 3: Build the case for why the obvious alternative fails, and be patient. We started by talking about thinking two steps ahead. This is crucial in my opinion, but it means that most people won't understand why they need to adopt a solution for problems they can't yet glaringly observe. I think it's part of your job to build that case up over time. Collect evidence and communicate in accessible ways why the obvious alternative (thinking only one step at a time) fails.
Milestone 4: Understand that there are categories of users, and leverage that to grow. When I started both ColBERT and DSPy, the original audience I sought were researchers and expert ML engineers. I learned over time to let go of that and to understand that you can reach much larger audiences, but that they require different things. Before anything else, stop blocking different potential categories of users indirectly or even directly. This is more common than one may think. Second, when seeking users, seek a balance between two types of users. On the one hand, expert builders with advanced usecases may require substantial investment on your part but will often advance some usecase in a researchy sense, which can be rewarding. On the other hand, public builders, who generally aren't ML experts but often build in public and share their learnings will lead a much larger fraction of any massive growth of funnel/excitement and will teach you a lot about your initial assumptions. You need both.
Milestone 5: Turn interest into a growing community! The real success of an OSS effort is in the community presence and the growth that happens independently of your efforts. A good community should generally be organic, but you needs to put in active effort at helping it form, e.g. by welcoming contributions and discussions and seeking opportunities for turning interest into contributions or discussion forums of some sort (e.g., Discord or GitHub).
Milestone 6: Turn interest into active, collaborative, and modular downstream projects. Chances are, your OSS project in its early stages hasn't solved all elements in your vision. A well-designed project will often have multiple modular pieces that can allow you to initiate research collaborations (or other efforts) where new team members can not only advance but own substantial pieces of the project and, as a result, achieve faster or larger impact for their ideas, while improving the project substantially. For example, DSPy currently has separate teams leading the research and development efforts on prompt optimization, programming abstractions, and reinforcement learning. ColBERT has components like the external API, the underlying retrieval infrastructure, and core modeling that have largely been advanced by different people in different projects.
Let me summarize. Open-source research adoption needs good research and good open-source artifacts. This balance is hard but can be quite rewarding once you get it right. Personally, it took me a long time to grasp and internalize this. I owe that to repeated feedback from my PhD advisors Chris Potts and Matei Zaharia and to valuable input from Heather Miller and Jeremy Howard. Research is evaluated by the delta over prior knowledge, but software must be effective on its own before people can meaningfully utilize that "delta". And for software to be effective, its documentation also needs to be so: people won't see all the downstream ways they should use it unless you show them. That is, until such tasks can be developed by an independent community.
Having said all of this, the most important tip in this entire section is to release, actually release, release often, and learn from that.
When you read the previous guideline (#5), it's very natural to wonder: Where might a graduate student find all this time to spend on OSS? When can they do actual research?
The answer in practice is that it's possible for most of the time you spend on OSS to involve doing new, exciting research. The two are not as separate as they may look. In fact, a major advantage of the style of research in this guide is that it creates recognizably important problems in which you have a very large competitive advantage. How so? For one, being at the frontier of this OSS effort gives you intuitive recognition of new problems extremely early. You develop a more instinctive understanding of the problem than you would have otherwise. For another, the communities you build will often provide direct feedback on prototypes of your approaches and give you access to wonderful collaborators who understand the significance of the problems. You also get useful "distribution channels" that ensure that every new paper you do in this space will have a receptive audience and will reinforce your existing platform.
Just to illustrate, ColBERT isn't just one paper from early 2020. It's probably around ten papers now, with investments into improved training, lower memory footprint, faster retrieval infrastructure, better domain adaptation, and better alignment with downstream NLP tasks. Similarly, DSPy isn't one paper but is a large collection of papers on programming abstractions, prompt optimization, and downstream programs. So many of these papers are written by different, amazing primary authors, whose work has individually been highly impactful, in part through the OSS channels that creates a receptive audience. A good open-source artifact creates modular pieces that can be explored, owned, and grown by new researchers and contributors.