Website Under Construction

Question Answering on Scientific Research Papers

It’s hard for researchers to keep up with the ever-increasing number of papers getting published every day.

Is there some way we can make the consumption of scientific content more efficient?

We believe there is.

Imagine the typical process involved in reading a research paper. You first read the title and the abstract, and have a pretty good idea of what the paper is about. You might have some follow-up questions to the ideas introduced in the abstract that you then try to answer by reading the rest of the paper. Let’s look at an example: the title and the abstract of a paper that appeared (and won the best paper award) at NAACL 2004.

Catching the Drift

Probabilistic Content Models, with Applications to Generation and Summarization

We consider the problem of modeling the content structure of texts within a specific domain, in terms of the topics the texts address and the order in which these topics appear. We first present an effective knowledge-lean method for learning content models from un-annotated documents, utilizing a novel adaptation of algorithms for Hidden Markov Models. We then apply our method to two complementary tasks: information ordering and extractive summarization. Our experiments show that incorporating content models in these applications yields substantial improvement over previously-proposed methods.

Some questions you might have after reading this abstract are:

  • What domains did they experiment with?
  • How do they adapt algorithms for Hidden Markov Models?
  • How do they define the information ordering task?
  • What previously proposed methods do they compare against?
  • Did they experiment with languages other than English?

To answer these questions, you have to scroll through the paper, locate and read the sections that seem relevant, and infer the answers.

What if instead there were an automated system that could reliably point you to the parts of the paper that you’re looking for, or even answer your questions directly? We believe such a question-answering (QA) system could help researchers who are just looking to quickly get targeted pieces of information out of a research paper, and it is with this practical use case in mind that we introduce a new QA task, Question Answering on Scientific Research Papers (or Qasper for short).

An illustration of a QA interface over research papers

An illustration of a QA interface over research papers

You can read about our definition of the task, our new dataset, and the baseline models we built for the task in our NAACL 2021 paper here. We discuss some key ideas from the paper below.

A New Dataset:

To build a Qasper system, we need appropriate data to train the underlying model. The NLP community has built many QA datasets in the past decade, but the existing datasets focus on tasks that are not quite like what we are trying to do.

Reading Comprehension tasks like SQuAD, HotpotQA, and DROP focus on verifying whether a model can process specific types of information presented in a document. Hence, these datasets contain questions written by crowd-workers who already knew the answers to the questions they were writing.

Models built for such datasets are not expected to work well for Qasper since the follow-up questions you might have after reading an abstract are information-seeking in nature.

Information-Seeking Question Answering tasks like Natural Questions, TyDiQA, and BioASQ come with datasets containing questions asked by real people who did not know the answers to those questions. However, they were not asked in the context of specific documents and were later linked to potentially relevant documents. We believe that having a good understanding of what is in a document will prompt readers to ask questions that are more strongly grounded in documents, like the ones shown above.

Since existing datasets are not directly applicable to our task, we built a new QA dataset.

We hired graduate students and NLP practitioners for two separate tasks:

Writing Questions: We showed the annotators only the titles and abstracts of papers and asked them to write questions that the abstracts do not answer, and are expected to be answered by the papers. We encouraged them to write questions that they think can be answered in a sentence or two to limit the scope of the task to answering specific questions.

Providing Answers and Evidence: To obtain answers, we showed the annotators entire papers and questions. For each question, we asked them to highlight a minimal set of paragraphs, figures, or tables that provide all the information required to answer it. This serves as the evidence for answering the question. After selecting the evidence, they were asked to provide a concise answer that is either a span in the paper, a written out phrase, “Yes”, or “No”. Many questions may not be answerable from the papers too, in which case, we asked them to mark them as such.

By separating the two tasks, we ensured that the question writers did not know the answers to the questions, and thus ensured that the data we collected is more realistic. We also paid the data providers an hourly rate, and not a per-question rate, incentivizing quality over quantity. The process resulted in 5,049 questions over 1,585 NLP papers, with about 44% of the questions having multiple annotations.

Building a Model:

Qasper requires processing entire documents. For this reason, we chose the Longformer model to encode the long contexts. The dataset has different types of answers: extractive (when the annotators select spans in the paper), abstractive (when the answers are written out), boolean (yes/no), and null (when the questions are not answerable). To handle all these types, we used Longformer in the Encoder-Decoder setting, which is trained to encode entire papers and directly generate answers. For example, for questions that are unanswerable, the model is trained to generate the string “Unanswerable”.

In addition to generating the answers, the model is also trained to select a set of paragraphs as the evidence. For this sub-task, we simply train the model to make binary decisions on all paragraphs in the paper as to whether they should be included in the evidence or not. We train the model to minimize both the answer and evidence losses jointly. Note that the dataset contains questions for which evidence includes figures and/or tables (12%of the data). The model cannot handle multiple modalities, and hence we ignore evidence that is not text.

You can read our paper for more details on our experiments and results.

Here are the high-level results:

The Longformer Encoder Decoder models generally do significantly worse than humans: about 27 F1 points lower at generating answers, and 32 F1 points lower in terms of selecting evidence. When models are trained to answer questions given gold evidence, they do significantly better, up to 24 F1 points on extractive and abstractive answers, indicating that the difficulty mostly lies in selecting appropriate evidence. Manual error analysis shows that the two most common error classes are of the model incorrectly predicting questions to be unanswerable, and the entity types of the generated answers being incorrect, indicating that the model (unsurprisingly) lacks domain knowledge.

You can test the model yourself by playing with our demo.

Next Steps:

Now that we have a dataset, we are closer to having a usable QA system over research papers, but we are not fully there yet. If you test the model using our demo, you will see that it does well on some common questions like What datasets do they use?, but not on questions that require a deeper understanding of the papers. One such example is the question How do they adapt algorithms for Hidden Markov Models? on the paper we cited at the beginning of this article. You can test it out on the paper here.

So what needs to be done to make a usable Qasper system? Here are some potential directions.

Domain-specific pre-training: The Longformer model we used is based on BART, which was pre-trained on web text and publicly available books, which are very different from research papers. It is possible that pre-training on research papers, or at least transferring a pre-trained model into the research paper domain, can significantly improve the model’s end task performance.

More task-specific data: Our current Qasper dataset is smaller than general domain document-level QA datasets. Training on more data is expected to result in better models. Moreover, the dataset currently includes only NLP papers.

While models trained on this set may perform well on related fields like Machine Learning, they are not expected to be directly applicable to say, Biomedicine. Hence, we need more data, including data from other fields of research as well. However, collecting such data would be very expensive since we need domain experts to provide data, and we cannot rely on crowd-workers. Hence we need more efficient data collection methods.

Modeling innovations:

The models still have a lot of room for improvement, as indicated by the large gap between model and human performance. We invite the NLP community to work on this task.


Allen Institute for AI

Towards Data Science Blog: https://medium.com/ai2-blog/question-answering-on-scientific-research-papers-f6d6da9fd55c

Allen Institute for AI: https://allenai.org/

HubDocs | HubBucket Document Library

A Dataset of Information-Seeking Questions and Answers Anchored in Research Papers

Pradeep Dasigi, Kyle Lo, +3 authors Matt Gardner | Published 2021 | Computer Science | ArXiv

Readers of academic research papers often read with the goal of answering specific questions.

Question Answering systems that can answer those questions can make consumption of the content much more efficient. However, building such tools requires data that reflect the difficulty of the task arising from complex reasoning about claims made in multiple parts of a paper. In contrast, existing information seeking question answering datasets usually contain questions about generic factoid-type information.

We therefore present QASPER, a dataset of 5,049 questions over 1,585 Natural Language Processing papers. Each question is written by an NLP practitioner who read only the title and abstract of the corresponding paper, and the question seeks information present in the full text.

The questions are then answered by a separate set of NLP practitioners who also provide supporting evidence to answers. We find that existing models that do well on other QA tasks do not perform well on answering these questions, under-performing humans by at least 27 F1 points when answering them from entire papers, motivating further research in document-grounded, information-seeking QA, which our dataset is designed to facilitate

Qasper | 2021

A dataset containing 1585 papers with 5049 information-seeking questions asked by regular readers of NLP papers, and answered by a separate set of NLP practitioners

License: CC BY


Current Version: 0.1

Clicking Download will provide a link to download the training and development sets of the latest version of the dataset.

Test set and official evaluator

Once you are ready to evaluate your finalized model on the test set, click here to download the test split from the latest version of the data and the official evaluation script.

Authors: Pradeep Dasigi, Kyle Lo, Iz Beltagy, Arman Cohan, Noah A. Smith, Matt Gardner

The Qasper Demo

The goal of the Qasper project is to make it easier for you to read academic papers. We do so by adding a Question Answering interface on papers that you can use to ask questions after you read the title and the abstract of a paper, and have the interface navigate to the relevant parts of the paper.

You can try it out on some papers below.

Note that the underlying model is still not very accurate, but it represents the current state-of-the-art.

HubDocs | HubBucket Document Library
HubDocs | HubBucket Document Library
HubDocs | HubBucket Document Library
HubDocs | HubBucket Document Library
HubDocs | HubBucket Document Library