Summary
Elicit has gotten exciting coverage on Twitter the last few days, leading to an influx of new users [1, 2, 3]. Welcome! We’re so excited to have you and grateful for your interest.
Alongside the overwhelmingly positive response, some people wisely pointed out the need for more transparency about who is building Elicit, how it works, and where it doesn’t. We’ll start the conversation with this note, but we expect this will be an ongoing dialogue.
Elicit is an early product using early technology, attempting to help with complex topics. You are the researcher and the expert, not Elicit. Elicit results should be taken as a starting point for your further review and evaluation.
While we think researchers are a very careful and rigorous group on average, there is research with questionable methodology and even fraud. Elicit does not yet know how to evaluate whether one paper is more trustworthy than another, except by giving you some imperfect heuristics like citation count, journal, critiques from other researchers who cited the paper, and certain methodological details (sample size, study type, etc.). We’re actively researching how best to help with quality evaluation but, today, Elicit summarizes the findings of a bad study just like it summarizes the findings of a good study.
Similarly, when you are impressed by Elicit results, it’s in large part because some researchers poured their blood, sweat, and tears into doing the actual research and presenting it. They also deserve your admiration :)
Confirm that Elicit’s summaries and extracted information are correct by clicking each row and reviewing the abstract or full text of the paper. Search for both sides of your question to minimize confirmation bias.
Elicit is built by Ought, a non-profit machine learning research lab with a team of eight people distributed across the Bay Area, Austin, New York, and Oristà. Our team brings experiences from academia, mature tech, and startups. Ought is funded by grants from organizations like Open Philanthropy, Jaan Tallin, Future of Life Institute, and other individuals identifying with the effective altruism and longtermism communities. Our funders and team are primarily motivated by making sure that artificial intelligence goes well for the world, in part by being useful for high-quality work like research. Elicit is the only project that Ought currently works on.
Elicit is an early-stage product, with updates and improvements every week (as documented on our mailing list). As of April 25, 2022, the literature review workflow is implemented as follows (in the interest of sharing a lot of information quickly, we’ve unfortunately had to assume a lot of technical context):
You enter a question.
We search for relevant papers, shown one per row in the results table:
We retrieve the title and abstract for the top 1,000 results from the Semantic Scholar API for the keywords you provide, applying additional filters if you select them (keywords, dates, or study type).
If you star papers and click “show more like starred”, we instead retrieve paper candidates by expanding the citation network of the starred papers forward and backward, again using the Semantic Scholar API.
We generate additional information for the top 8 papers, mostly shown in the supplementary columns you can add to the results table.
We generate what the paper’s abstract implies about your question using a GPT-3 Davinci model finetuned on roughly 2,000 examples of questions, abstracts, and takeaways.
This column was previously called “Answer to your question” but that sounded too strong and it was unclear where the answer was coming from. We’ve renamed it to “Takeaway from abstract.”
We use the prompt-based GPT-3 Davinci Instruct model for some of the supplementary columns (e.g., number of participants, number of studies), and a fine-tuned Curie model for others (e.g., intervention, dose). For the prompt-based Instruct model, the prompt looks like this, with “...” replaced with the query and paper details:
Answer the question "..." based on the extract from a research paper. Try to answer, but say "... not mentioned in the paper" if you really don't know how to answer. Include everything that the paper excerpt has to say about the answer. Make sure everything you say is supported by the extract. Answer in one phrase or sentence. Paper title: ... Paper excerpt: ... Question: ... Answer:
We use a finetuned GPT-3 Davinci model to compute the “Takeaway suggests yes/no” column.
We use a bag-of-words SVM to classify which studies are randomized controlled trials.
If you open the paper detail modal, we surface the citations most likely to criticize methodology by first ranking citations from Semantic Scholar using the GPT-3 Ada search endpoint, then further using a finetuned GPT-3 Curie model.
As we go through these steps, we stream information back to you as soon as it’s computed. For example, we return the titles of papers before we’ve computed the claims.
Much of this workflow will change in the future based on user feedback and internal evaluations. We already know that we’re soon going to:
To help you calibrate how much you can rely on Elicit, we’ll share some of the limitations you should be aware of as you use Elicit:
Elicit uses language models, which have only been around for three years. While already useful, these early stage technology are far from “Artificial general intelligence that takes away all of our jobs.”
The models aren’t explicitly trained to be faithful to a body of text. We’ve had to customize the models to make sure their summaries or extractions are actually what is said in the abstract, and not what the model thinks is likely to be the case in general (sometimes called "hallucination"). While we’ve made a lot of progress and try hard to err on the side of Elicit saying nothing rather than saying something wrong, in some cases Elicit can miss the nuance of a paper or misunderstand what a number refers to.
Elicit is a very early stage tool and we launch things uncomfortably beta to iterate quickly with user feedback. It’s more helpful to think of Elicit-generated content as around 80-90% accurate, definitely not 100% accurate.
Other people have also helpfully shared thoughts on limitations [1, 2].
As we discussed at the start, Elicit is only as good as the papers underlying it. There are some bad papers we have yet to filter out and there are some important papers not yet in our dataset.
In the same way that good research involves looking for evidence for and against various arguments, we recommend searching for papers presenting multiple sides of a position to avoid confirmation bias.
Elicit works better for some questions and domains than others. We eventually want to help with all domains and types of research but, to date, we’ve focused on empirical research (e.g. randomized controlled trials in social sciences or biomedicine) so that we can apply lessons from the systematic review discipline.
This section is really way too short. We tried to share enough to make you not overrely on Elicit but this is not a comprehensive list of possible limitations.
Lastly, more users than we expected might mean that the app breaks. So far, our engineering team has done a phenomenal job keeping the site up. But if you encounter an error, please let us know at [email protected] and thanks for understanding.
Given these limitations, here are some ways to relate to Elicit that can be useful without leading to undue confidence in Elicit’s abilities.
Elicit’s tabular interface with columns that highlight key information about studies aims to make it easier for you to review a study in the context of other studies and to get a preliminary understanding of more studies. This can’t replace digging into the studies and understanding them carefully, but Elicit may be more effective than other tools at showing varied or conflicting perspectives.
The same query may get you different results in different paper search databases. This can be because different search tools have more or different papers or because they rank papers differently. Elicit can supplement other search tools to help you discover different papers. The inverse is also true - search engines with different ranking algorithms may return different papers at the top even if they were to have the exact same data as Elicit does.
Generally, Elicit users today find it helpful as a starting point. They may have a question, but not the best keywords. Some papers might seem relevant, but they may not know whether digging into them would involve getting stuck at a local optimum. Overall, there are way more papers than any of us could ever read in an ideal world. Elicit can help with the prioritization decision by showing information about papers and letting you sort or filter by that information.
There are short-term limitations that we expect to overcome as we continue working on Elicit, such as better coverage of papers, higher accuracy of Elicit-generated answers, etc.
But there are also fundamental questions that we will probably wrestle with for a very long time. These are questions like:
It's unlikely that we’ll manage to always get this right. We chose to build a research assistant for many reasons, one of which is because researchers are a particularly rigorous and skeptical group. Researchers keep us honest and provide detailed feedback. There is already a wealth of knowledge about research methodologies and best practices that we have been learning from.
We’d really like Elicit to be a tool that we build together, a tool where you see the tangible impact of your feedback on every page. Many of you have already spent so much time with us sharing your screens, showing us your notes, sending us your papers, and finding ways Elicit can be better.
Seriously, thank you so much. It has not stopped blowing our minds that people are so encouraging and helpful.
If you’d like to learn more, Supervise Process, not Outcomes explains our worldview about machine learning systems and how to make them more aligned with users. The Plan for Elicit describes our progress to date and roadmap. Our mailing list describes the most recent 20 feature launches.