ACL 2023 was held in Toronto, Canada on July 9-14. Megagon Labs sponsored the conference and a group of our researchers, alongside Dunja Mladenic and Marko Grobelnik, organized a workshop. This year, ACL highlighted the theme of “Reality Check” to stimulate discussion on the current state of NLP research. The conference also hosted an “industry track” for the first time, which indicates the growing importance of industry research in the community. ACL 2023 received 4,864 submissions (+1,486 from last year), of which 301 were made through ACL Rolling Review (ARR). Interested readers can check out a very detailed report released by the program chairs. (Note: One of the ACL 2023 program chairs, Naoaki Okazaki, is on Megagon Labs’ advisory board!) The overall acceptance rate to the main conference was 20.73% (long/short=22.13%/15.54%).
This article is meant to provide an overview of ACL 2023 with a focus on papers highlighting recent exciting breakthroughs such as large language models (LLMs). If you’d like more info on the overall conference, check out our ACL 2023 Conference Highlights article. Here we briefly capture the papers that stood out to our attending research scientists and research engineers.
Let’s dive into some ACL papers.
LLMs with External Tools
LLMs still face challenges, for instance, in arithmetic and logical reasoning due to their entirely data-driven nature. Recent advancements show that these limitations can be alleviated using external tools and resources. The concept of augmenting data-driven models with task-specific sub-modules has been around for a while, but the exceptional reasoning and planning capabilities of LLMs have recently drawn significant interest from diverse research communities. This fusion of LLMs with external tools not only broadens their capabilities but also introduces numerous possibilities across diverse applications.
MultiTool-CoT: GPT-3 Can Use Multiple External Tools with Chain of Thought Prompting
Prior work has demonstrated that LLMs can use external tools like web search and calculators for complex reasoning. Building on this, this paper introduces a method that integrates multiple tools using chain-of-thought prompting in an in-context learning approach. The authors’ presentation focused on empirical analyses using the NumGLUE dataset, which requires numerical reasoning and specialized domain knowledge. The results highlight GPT-3 (text-davinci-003)’s capability to effectively merge a knowledge retriever, symbolic processing module, and calculator to answer questions, underscoring the potential of enhancing LLMs with auxiliary tools to tackle complex NLP problems.
Previous studies primarily focused on generating plans or programs to access external tools and resources with LMs. This study, however, emphasizes the discriminative power of LMs. The introduced framework, Pangu, couples a symbolic agent with an LM, where the agent dynamically produces plans based on the environment, such as databases or the physical world, while the LM assesses each plan’s feasibility. In this paper, Pangu was tested on the knowledge base question-answering task and demonstrated consistent performance improvements with multiple LMs. (Additional evaluation results for other tasks are also presented in the appendix.) The implementation of Pangu is available on the authors’ GitHub repository (https://github.com/dki-lab/Pangu).
Other notable limitations of data-driven models include hallucination, factual incorrectness, and bias, among others. In this section, we discuss papers that propose solutions to address some of these challenges through various interventions, from employing pipelines to intervene and repair erroneous outputs to characterizing the effectiveness of LLMs with long tail distributions and designing feedback mechanisms to continuously improve LLM performance.
LLMs are prone to hallucination and often generate text that lacks factual correctness. Factual inaccuracies may negatively impact the trust of end users of any system employing LLMs. This paper employs a two-stage pipeline to improve trust and model performance on complex reasoning tasks solved using Chain-of-Thought (CoT) prompting. The authors instrumented a self-consistency-based metric to first evaluate the reliability of LLMs output on several CoT prompts for the same task and to leverage external knowledge sources to increase prediction factuality by post-editing reasoning chains of unreliable outputs. Building on top of GPT-3, the proposed framework leads to accuracy improvements in multiple open-domain question-answering tasks. To this end, another contemporary work, presented at ACL, proposes a natural language-based feedback mechanism that leverages reinforcement learning to repair outputs of LLMs.
LLMs’ sole reliance on their parameters to encode a wealth of world knowledge has limitations, especially with under-represented knowledge in the pre-training corpus. This paper characterizes LLMs’ strengths and limitations in memorizing factual knowledge by conducting large-scale knowledge probing experiments, and it proposes an adaptive retrieval-augmented language model to improve performance with long tail information. The authors introduce PopQA, a new open-domain QA dataset with 14,000 questions to cover factual information in the long tail that might have been missed in popular QA datasets. The authors convert knowledge triples from Wikidata — with diverse levels of popularity as measured by Wikipedia page views — into natural language questions, anchored to the original entities and relationship types. The experiments on PopQA highlight that LMs struggle with less popular factual knowledge, and scaling fails to appreciably improve memorization of factual knowledge in the long tail. However, the LLMs remain competitive in questions about high-popularity entities. The paper proposes an efficient retrieval-augmented LM, which retrieves non-parametric memories only when necessary, i.e., for less popular entities. Experimental results show that the adaptive retrieval-augmented LM significantly improves QA performance while reducing the inference costs.
The comprehension of language often requires understanding the situation — a partial snapshot of relevant objects, their attributes, and interrelations at a particular time and location. For instance, in the sentence “the bat flew past Chris,” “bat” could mean either the animal or a baseball bat, depending on the situation. This dependency on the situation is even more evident when interpreting human utterances and actions. The field of situated reasoning, which addresses such situational understanding, has long been an important research focus, a fact highlighted by a rich line of studies on situation semantics. To advance NLP systems in this area, several ACL papers have introduced new, exciting challenges.
To develop interactive systems that function optimally in various sociocultural contexts, this paper presents “NormBank,” a social knowledge resource containing over 155,000 examples of normative reasoning grounded on situational conditions. These conditions are defined using a new taxonomy, SCENE, which encompasses various factors ranging from agents’ roles/attributes to the physical, social, and cultural aspects of a situation. For example, while “drinking coffee” is a typical action for a spectator at an athletic event, it’s less so for an athlete in the middle of a game. This example highlights how slight changes in situational conditions can significantly affect our interpretation of human actions. Empirical results show that PLMs can adapt to this kind of knowledge through fine-tuning, and the knowledge learned from NormBank is useful for performing moral reasoning tasks. These findings not only indicate the potential of neural models to expand on NormBank’s capabilities but also demonstrate its practical significance.
This paper by Naoki Otani, a research scientist who recently joined Megagon Labs, explored situational conversational systems during his PhD study. This paper introduces a new dataset called SUGAR, featuring situated conversations in help-seeking scenarios. This dataset highlights the challenge in conversational AI: enabling models to proactively respond to user requests based on situational context. For example, let’s consider potential responses to a user requesting to open a window. The appropriate response depends on situational context, such as weather conditions (you don’t want to open the window when heavy rain is pouring outside) and various environmental factors (if there is an air circulator in the room, activating it might be helpful). While many modern systems produce relevant answers, they could be enhanced by a richer situational understanding. SUGAR is designed to advance the research in this direction by pairing each conversation example with detailed situational information, such as location, time, and environment. Experiments indicate that even sophisticated neural models find this task challenging, offering promising research avenues for enhancing conversational systems. For more insights, also refer to Naoki’s position paper presented at the NLP4ConvAI workshop.
Translation and Multilingualism
To apply NLP techniques to solve diverse practical problems worldwide, working with multilingual data is vital. Although machine translation (MT) is a well-established NLP task, and MT systems are already in use in many scenes, the advent of LLMs introduced novel applications and research avenues. Furthermore, huge amounts of multilingual data available on the internet now facilitate the development of unified, multilingual MT systems that process tens or hundreds of different languages at the same time.
We all know that multilingual language models do well in machine translation, but the origin of this translation capability is still unclear. This paper found that the origin is the so-called incidental bilingualism, which means the occasional bilingual signals in training data. With only 1.4% bilingual training data, large language models (PaLM in this paper) are able to do zero-shot or few-shot machine translation with good performance. On the other hand, removing these bilingual instances has a dramatic impact on translation capabilities. This indicates the importance of training data for learning a multilingual language model.
This paper also focuses on the topic of machine translation with large language models (PaLM in this paper). Generally, to achieve few-shot machine translation, LLM needs a prompt with several samples of parallel translation pairs. This paper found that the choices of translation pairs has a significant impact on translation performance. The difference of BLEU score can be as large as 40 points for some cases. Compared with random sampling, they found that the best practice is to choose samples similar to the sentence requiring translation, using the algorithm of kNN. Unfortunately, even with good translation pairs in the prompt, LLM-based translation still underperforms state-of-the-art machine translation models such as Google Translate.
Research on LLMs generally focuses on high-resource languages such as English, Chinese, etc. Even for multilingual LLMs, the number of languages in training data is at most about 100 languages. This paper created a large multilingual corpus consisting of 511 languages, filtered from data sources of 2,266 languages. Furthermore, they trained a multilingual LLM using this corpus. They observed that the quality of a multilingual LLM was determined by a combination of different factors. Importantly, they found that related languages in the same language family can support each other, improving the performance on downstream tasks. This observation hints at why it is difficult to improve performance of multilingual LLMs on Japanese tasks. Japanese is not a member of main language families, hence Japanese can hardly benefit from related languages.
Machine unlearning aims to remove specific information from trained machine learning (ML) models, addressing needs like privacy, fairness, and compliance. While this could be done simply by retraining models after omitting sensitive data, the increasing costs of model training make updating without full retention more attractive. Despite growing attention in fields such as computer vision and data mining, interest at this year’s ACL is relatively limited. However, with the rise of large language models and related concerns, we expect increased focus on this issue in the NLP community to develop safer NLP systems.
This paper provides a nice introduction to machine unlearning within the realm of language models, offering a comprehensive review of major related work. While most previous NLP studies focused on data preprocessing or differential privacy techniques, the proposed approach uses gradient ascent, not descent, on target token sequences to make the model “forget” them. The results show this method effectively suppresses the generation of specific sequences and that sequential unlearning is more effective than deleting all sensitive data simultaneously. The study further shows that the success of unlearning varies based on the data type or domain, paving the way for future research.
This paper introduces a new framework, KGA, that tunes a model to maintain the “knowledge gap,” which represents distribution differences between models trained on different data configurations. Unlike existing methods that update models to converge towards a single distribution, the proposed method offers more flexibility on the assumption about distributions. Some of the readers may find this method similar to soft knowledge distillation techniques, which are also concerned with internal/output distributions of ML models. The authors presented novel evaluation metrics for unlearning and demonstrated KGA’s effectiveness across several NLP tasks, including classification, translation, and response generation.
NLP stands as one of today’s most exciting research domains. We hope this article has illuminated recent breakthroughs and innovations for you. For learning more about NLP, machine learning, and database insights, follow us on LinkedIn, Twitter, or Facebook.