ACL 2022 Highlights

This year, Dublin, Ireland hosted ACL 2022, a hybrid conference on computational linguistics (CL). Our team sponsored and attended the conference. This year ACL commemorated their 60th anniversary with a special theme track of language diversity. Their aim was to stimulate discussions that would drive CL and natural language processing (NLP) research towards promoting language diversity from low-resource and endangered languages. There were 701 papers accepted out of 3,378 submissions on a range of topics. In this blog, we provide an overview of the invited talks and panel discussions. In addition, we discuss our top paper picks on information extraction, language understanding, prompting, language generation, and explainability, which are also relevant to ongoing research at Megagon Labs.

Next Big Ideas Talks Panel

A new initiative at the ACL was the plenary session where leading researchers shared their thoughts on important research directions in NLP. Here are some of those big ideas: 

  1. Heng Ji emphasized the need for more structured information in NLP. For example, Ji suggested moving from sentence-level or document-level extraction towards corpus-level extraction.
  2. Mirella Lapata, who is on the Megagon Labs advisory board, urged researchers to look at stories as data-rich ways of studying interesting natural language understanding (NLU) problems around shape, structure, and recurrent themes in natural language text.
  3. Dan Roth emphasized the importance of reasoning abilities in NLU for making decisions. Reasoning abilities cannot be acquired by training directly nor can they be evaluated through a single model.
  4. Thamar Solorio advocated for multi-lingual technologies. She argued that we need language technologies that are actually representative of the ways in which people use language.
  5. Marco Baroni promoted modularity wherein pre-trained models can learn to interact with each other to solve new tasks together.
  6. Eduard Hovy urged researchers to invest in schemas and representations to guide the models as they cannot learn knowledge that is rare or implicit in the training data. We need to put more effort into identifying what data might indirectly contain the relevant knowledge (e.g. goals, event schemas, action plans, entity properties) and what methods might elicit this knowledge.  
  7. Hang Li also advocated for symbolic reasoning and suggested that neuro-symbolic architectures are the way to go.

Spotlight Talks by Young Rising Stars (STIRS)

STIRS was an initiative put forth by ACL 2022 to help foster the growth of young and promising researchers. Spotlight Talks by Young Rising Stars featured 10-minute talks by Eunsol Choi, Ryan Cotterell, Sebastian Ruder, Swabha Swayamdipta, and Diyi Yang. We summarize a few selected ones below.

  1. Sebastian Ruder pointed out major challenges that arise when scaling NLP systems to the next 1,000 languages. These included multimodality, computational efficiency, real-world evaluation, and language varieties. This was in line with the special theme track “Language Diversity: from Low-Resource to Endangered Languages.” Ruder’s talk also aligned with other panels and papers at the conference that emphasized the need to study under-represented languages.
  2. Diyi Yang urged attendees to look at the social impact of NLP models when they are adapted in the real world. She suggested teams should augment the NLP pipeline with components inspired by social science research.
  3. Swabha Swayamdipta’s talk was centered around generalization of NLP models. She emphasized the need for better tools to analyze model-data relationships before investing more resources into data collection and annotation.

Our Picks:

Let’s discuss some of our favorite picks from the conference, organized by topic.

Evaluating factuality and hallucinations in natural language generation

Factually inconsistent and hallucinated content has become a major obstacle for the application of natural language generation techniques. Thus, it has received increasing attention in recent years. Yet, the evaluation of factuality and hallucination is still an unsolved problem. At this year’s ACL, three papers studied this issue.

This study performs a systematic analysis of factual errors and hallucinations in text-simplification benchmarks and models. Through crowd-sourced manual evaluation, the authors discovered that both kinds of errors are common in popular text simplification datasets. Moreover, they categorized three types of errors —insertion error, deletion error, and substitution error—and quantified the error distributions on reference examples and system outputs. Besides, they further compared the human annotated results with existing automatic metrics for semantic similarity and factuality. The results show that existing semantic similarity correlates well with deletion errors, but poorly with other types of errors. The study confirms that evaluating factuality is a challenging and unsolved task.

This study explores the abstractive summarization systems and found that hallucinated content in generated summaries could sometimes be factual. Such factually correct but hallucinated content can in fact benefit the quality of the summaries. To distinguish “good” hallucinations from “bad” hallucinations, the authors proposed novel methods for predicting a summary’s hallucination and factuality status using the prior and posterior probabilities of the corresponding entity. The authors further applied the proposed hallucination detector as a reward function in a reinforcement learning summarizer. They went on to demonstrate that the hallucination detector can significantly improve the factuality of the generated content. 

The authors perform a causal-inspired analysis to assess pre-trained language models’ (PLMs) ability to produce factual content. Essentially, this paper tries to answer a fundamental question: How do PLMs capture factual knowledge? To address this question, the paper studies the associations between PLMs and words in the remaining context. In particular, it categorizes words into the context in three different groups: knowledge-dependent words (KD), positionally close words (PC), and highly co-occurred words (HC). Through the analysis, the authors discovered that PLMs tend to rely more on the less effective groups (PC and HC) than the more effective knowledge-dependent clues.

While all three papers address ways of moving towards more factual and less hallucinated content, they are still either task-specific or model-specific. Thus, we can expect more works examining effective and task-agnostic evaluation metrics for text generation problems in the future.

Prompting

Prompting is a new paradigm in natural language processing post GPT-3 era, which adapts a pre-trained large language model into a desired task by providing natural language instructions, demonstrating some training examples, or prepending continuous embeddings. At least 20 papers about prompting were accepted for this ACL. Our favorites include the following:

This study analyzes how the order of training examples affects in-context few-shot learning. The authors found that the order of training examples used in the prompt results in large variations in the final model’s performance. Based on these observations, they proposed a method to automatically determine a good order of training examples using the pre-trained language model’s entropy. Experimental results show that their proposed method improves accuracy regardless of the choice of the language model used.

This study examines the benefits of prepending retrieved training examples as a prompt during training and testing for more complex NLP tasks such as summarization, machine translation, and question answering. The authors retrieved some training examples using BM25 and prepended them to present similar training examples. They found that this simple approach improves the performance of many complex NLP tasks. With these techniques, we can obtain additional performance gains by demonstrating training examples even for the fully supervised settings.

Unlike the above studies using natural language prompts, this study focuses on continuous prompting, which directly optimizes sequences of embeddings to control frozen large pretrained language models. In particular, this study’s authors found that introducing intermediate learning of soft-prompt (between pre-training of the language model and soft-prompting of downstream tasks) greatly improved the performance of the model with soft-prompting and also reduced the model’s performance variances. They also found that the similarity of prompt embeddings allows us to predict which intermediate training tasks can improve the performance of soft-prompting for downstream tasks.

Various Topics

This paper studies the generalization and memorization capabilities of models in noisy and low-resource scenarios. The authors found that training is almost unaffected by label noise and that it is possible to reach near-optimal results even on extremely noisy datasets. When fine-tuning BERT and other models, they empirically found three stages of learning:

  • Phase 1: Fitting, model learns the simplest patterns which are reflected in sharp increases in accuracy
  • Phase 2: Settling, the increase in performance plateaus, and neither the validation nor the training performance change considerably.
  • Phrase 3: Memorization, models start to memorize the noisy training examples

This paper studies the problem of data sparsity in attribute value extraction from e-commerce websites. Prior work addressed this problem by modeling extraction as a QA task — given an attribute (e.g. brand name) as query and product data as context, the goal is to extract its value (e.g. Nike) from the context. However, previous models struggled to generalize when they encountered rare and ambiguous attributes. This paper proposes knowledge-driven query expansion techniques to mimic the inference scenarios where perfect knowledge of values may not be available. Specifically, they use knowledge dropout and token mixing to create imperfect, synthetic examples for training. Models trained with these techniques generalize better when it comes to rare and ambiguous attributes.

Conclusion

We hope you found some inspiration in the papers and big ideas that came from this year’s ACL 2022. Follow us on LinkedIn, Twitter or Facebook for more articles on the leading NLP, machine learning, and database conferences and discussion topics. 

Written by: Nikita Bhutani, Hayate Iso, Xiaolan Wang, and Megagon Labs

Follow us on LinkedIn and Twitter to stay up to date with new research and projects.

Share:

More Blog Posts: