Since its establishment in 2016, Megagon Labs has conducted research in Natural Language Processing and published research papers at top-notch venues such as ACL, NAACL, EMNLP, and TACL.
This year’s ACL-IJCNLP 2021 was a fully virtual conference, which is becoming a new standard following the current situation. There were 710 accepted papers out of 3,350 submissions on a range of topics. In this blog, we discuss some of the highlights of ACL-IJCNLP 2021.
We had a great time exploring and discussing some of the state-of-the-art research in Natural Language Understanding, Language Modeling, Explainability, Summarization and Question-Answering. Below we share summaries of papers that we found interesting from the lens of our research goals, followed by snippets from keynotes, panels and other discussions.
Natural Language Understanding and Language Modeling
Database reasoning over text. This paper envisions Neural Databases, a class of systems for querying facts represented as short natural language sentences. The data, therefore, is not represented using a predefined schema like in traditional database systems. A Neural Database uses NLP transformers as localized answer derivation engines and uses database-style operations (support set generation, select-project-join, aggregation) to answer complex database-style queries. The paper shows that while SPJ queries can be answered easily with a transformer, set-based and aggregation queries need database-style treatment. The authors released a novel dataset and code-base at GitHub.
Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning. This paper investigates why fine-tuning pre-trained language models with hundreds of millions of parameters on datasets with only hundreds or thousands of labeled examples produce state-of-the-art results. The authors hypothesize that pre-trained models have a low intrinsic dimension, which means a low dimension reparameterization exists that is as effective for fine-tuning as the full parameter space. They empirically show that pre-training implicitly minimizes intrinsic dimension. Surprisingly, larger models tend to have a lower intrinsic dimension and thus are effective on smaller datasets.
When Do You Need Billions of Words of Pretraining Data? This paper investigates what knowledge or skills Transformer LMs learn from large-scale pretraining that cannot be learned from less data. A series of MiniBERTs are trained on varying pre-training data volume and then evaluated (and probed) to understand their abilities. The authors find that good NLU task performance requires far more data than achieving good representations for linguistic features. Linguistic knowledge of models pre-trained on 100M words and 30B words is similar. There are skills critical to solving downstream NLU tasks that LMs can only acquire with billions of words of pre-training data. These could include factual/commonsense knowledge.
Knowledgeable or Educated Guess? Revisiting Language Models as Knowledge Bases. This paper claims that benchmarks like LAMA and subsequent prompt-based methods (optimizing prompts using the LAMA dataset) do not reflect how well these methods extract factual knowledge from LMs. The authors use prompt-based retrieval to extract factual knowledge from LMs. Some of their main findings are: a) prompt-based retrieval generates similar predictions for quite different datasets, b) prediction distribution is prompt-biased, c) better prompts are leading to overfitting over the answer distribution instead of improving the prompt’s retrieval ability. The paper includes some case studies on how to improve retrieval ability of a prompt and show that type-guidance can greatly improve the performance of prompts.
It is impossible to cover all the interesting papers! Here are some other notable mentions:
- On the Gap between Adoption and Understanding in NLP. This paper discusses several issues in current research trends in NLP and makes several recommendations.
- How Reliable are Model Diagnostics? There is recent impetus for developing suites of probes for understanding models beyond simple metrics like accuracy or BLEU. This paper, however, finds that current likelihood-based and representation-based model diagnostics are not very reliable.
- UnNatural Language Inference. State-of-the-art NLI models are largely invariant to word order, and often accept permuted examples. Permutations where local-bigrams are preserved tend to be acceptable. Average entropy of models on many permuted examples is low, indicating that they are quite confident about these predictions. Human accuracy on permuted dataset (acceptable by the model) is 60%.
Explainability and Interpretability
As in the last year, there are several papers on the faithfulness of attention-based explainability. Chrysostomou and Aletras try to improve faithfulness of attention-based explanations by introducing task-specific non-contextualised information for each token to scale the original attention weights.
Recently, sparse attention models have been introduced to highlight influential inputs for better interpretability. However, Meister et al. did not find any plausible mapping from sparse attention to a sparse set of influential inputs through a set of experiments. The authors conclude that further research is needed to claim that sparse attention increases model interpretability.
Shapley values from game theory have been widely adopted for model explainability in the ML community. Ethayarajh and Jurafsky encourage NLP practitioners to adopt Shapley values as well because they can offer more specific interpretations with theoretical guarantees and group-wise interpretations. In detail, this paper proves that attention weights and leave-one-out values, which have been predominantly used in interpretable NLP works, cannot be Shapley values. On the other hand, attention flows are Shapley values in certain conditions.
Sevastjanova et al. explore how contextualization is captured in language models using visual analytics. This paper treats word functionality as a continuum between function and content words, and compares self-similarity scores of different word groups. Their visual analytics tool LMExplorer visualizes layer-wise embeddings and their contextualization scores. Utilizing the visualization, this paper observes that BERT can model content words and functional words well, but not semi-functional-semi-content words. Overall, BERT learns content words easier and earlier than function words.
Summarization
MultimodalSum, as the name implies, takes images and tables as input in addition to reviews for opinion summarization. The model has three separate encoders for three modalities, which are connected to a single decoder that fuses hidden representations from three different sources (texts, images, and tables). MultimodalSum outperforms existing opinion summarization techniques that only use textual information. A qualitative analysis in the paper shows that MultimodalSum can generate a summary that contains a description that does not explicitly appear in the original review, which suggests that the other modalities complement the information for summary generation.
Re-ranking generations is a common approach for NLP tasks such as Machine Translation but is not (yet) for summarization. PASS is a two-stage framework that introduces a re-ranking approach for review summarization. To summarize multiple reviews into a single summary, PASS perturbs input reviews to create multiple review sets, which are then fed into a summarizer that generates a summary for each (synthetically created) review set. Then, PASS selects the most coherent generated summary using a model that is trained to rank text by coherence.
It is getting difficult to find papers that do not use any pre-trained LMs. For summarization and generation tasks, pre-trained encoder-decoder models such as BART or T5 have been commonly used. However, little is known to what extent knowledge from pre-trained models contributes to the summarization performance. This paper sheds light on this question. The authors conduct analysis on the token distribution by a fine-tuned model, the decoder of the fine-tuned model, and the decoder of the original LM to answer the question. Based on the distance between token distributions of those models, the paper categorizes each token into four categories to discuss whether or not the input source/fine-tuning is necessary.
Prompt learning is a hot topic in the NLP community. Prefix-tuning inserts a sequence of continuous task-specific vectors to the input to “switch” the model behavior for each task. The use of the continuous vector is the major difference against existing prompt learning methods. Although it has a downside that the “prompt” is not interpretable anymore, it can flexibly control the behavior of the LM. For training, prefix-tuning optimizes the task-specific continuous vectors while keeping language model parameters frozen. With this approach, Prefix-tuning offers a lightweight solution in contrast to the conventional fine-tuning approach, which needs to store a snapshot of huge LMs. The authors applied prefix-tuning to GPT-2 for table-to-text generation and to BART for summarization. The experimental results show that prefix-tuning achieves competitive performance as fine-tuning with 1000x fewer parameters. For the literature review, it’s worth taking a look at this survey paper recently uploaded to arXiv.
Question Answering
Mind Your Outliers! Investigating the Negative Impact of Outliers on Active Learning for Visual Question Answering. This paper shows that active learning methods on visual question answering tasks are often ineffective because of collective outliers. These outliers can be detected using dataset maps that help distinguish between easy-to-learn useful data and hard-to-learn collective outliers. The authors show that the sample efficiency of active learning methods increases significantly as the number of collective outliers in the active learning pool decreases.
Few-Shot Question Answering by Pre-training Span Selection. This paper shows that aligning the pre-training scheme to the downstream task eliminates the need for large datasets for fine-tuning. The authors propose a new pre-training scheme for reading comprehension. Given a passage with multiple sets of recurring spans, they mask all but one instance of each recurring span with a special [QUESTION] token and ask the model to select the correct span for each such token. At fine-tuning, they simply append a [QUESTION] token to the question. Their model obtains 72.71 F1 on SQuAD with only 128 examples, outperforming the baselines by 17 points (SpanBERT) to 30 (RoBERTa).
Challenges in Information-Seeking QA. This paper addresses the challenges in Information-Seeking QA datasets such as Natural Questions and TyDi-QA. Using controlled experiments, the authors show that answerability prediction (that is, whether the answer is long, short, or an empty string) and paragraph retrieval remain open problems for the task. They further examine the unanswered questions and make the following suggestions to improve answer coverage: (a) consider additional sources of information besides Wikipedia, (b) address ambiguous queries instead of marking them as unanswerable, (c) enable abstractive answers for non-factoid questions.
UnitedQA: A Hybrid Approach for Open Domain Question Answering. This paper combines an extractive reader (ELECTRA) and an generative reader (T5) for open-domain question answering. The authors’ hybrid approach outperforms single models and homogeneous ensembles and establishes new state-of-the-art performance on NaturalQuestions and TriviaQA (by about 3 points).
On the Efficacy of Adversarial Data Collection for Question Answering: Results from a Large-Scale Randomized Study. The authors show that models trained on adversarial data perform well on other adversarial benchmarks but perform worse on other datasets that are not collected adversarially. Using adversarial data collection for data augmentation is better than training on adversarial data only.
Summary: From Accuracy to Real Impact
Keynote speakers, Christopher Potts, Alex Christia, and ACL President Rada Mihalcea emphasized the need to broaden the focus of our research contributions away from “accuracy” and towards novelty and impact of research. This was also the central theme during several other sessions including the GreenNLP panel and business meeting.
The community should build tools that work at scale, across languages and cultures, and that enable people with limited resources to contribute to research in NLP. The key messages that resonated with us were:
- “Do exactly what you said you’d do”. We should collectively think about the dissemination of knowledge to not only the scientific community, but also to NLP practitioners and non-experts. When we optimize for paper acceptance, we are incentivized to downplay the limitations even when we’re aware of them. This can be especially misleading to the practitioners and leaders.
- We should promote responsible use of benchmark datasets and focus system assessments to connect with the real-world concerns.
- Beyond results focused on averages, the edge cases, deployment feasibility, and performance on underrepresented groups are also important. Relevant metrics could be efficiency, interpretability, applicability, robustness, and scalability.
- Communicating progress in NLP is a social responsibility. Engage with the media in more productive ways: Cut the hype and choose ways to convey the results in a manner that actually educates the public.
Some other snippets from various conversations:
- Q: Are models always desired to be “human-like”? Is it okay for models to do correctly what humans do correctly, but not make mistakes where humans do. We expect that machine learning models correctly do what humans can do. What about mistakes? Should we also expect models to make the same types of errors that we make?
- Adina Williams: If you are interested in investigating how well they act as cognitive models of human behavior, then it makes sense to try to discover how models differ from humans. However, if you want a model for an engineering specific goal (such as, find the best answer to a query, and so on), then it should be fine to only care about the positive side; that means we would try to make models achieve what humans can do on the task. In fact, that might be ideal given an engineering goal.
- Q: How do we rethink neural IR from a user perspective?
- Christopher Potts: Traditional IR methods with document-scoring based on keywords enable tracking the provenance of the results, frequent updates of the corpus but don’t offer a precise answer to user queries. In contrast, modern end-to-end systems digest the corpus offline and offer a synthesized answer to a user query; however, they don’t support provenance and can’t be updated easily. Practical systems, therefore, should use neural retrieval to find relevant sentences, extract information from them, and synthesize the answer. This solution supports provenance, updatability, and synthesis.
- GreenNLP panel suggestions for reviewers.
- Phil Blunsom: Training a bigger model on more data will make things better. But we don’t need to keep publishing that result.
- Noah Smith: NLP should be accessible to people with limited resources. We can’t afford to turn NLP into something doable only by people with GPU farms just because language is a very complex problem.
- Panelists: Reviewers should ask for more experiments in terms of missing scientific evidence supporting the main claims of the paper. We should push for papers to explicitly formulate hypotheses.
Attending a virtual conference is not always easy because of the time zone difference, but we had a great time at ACL-IJCNLP 2021 thanks to the organizers, the speakers, and the authors. We feel like we are becoming more accustomed to attending virtual conferences than ever before. We hope we’ll see you (in person) at the next one!
Written by Nikita Bhutani, Hannah Kim, Yoshihiko Suhara, and Megagon Labs