The NAACL 2024 conference is shaping up to be a milestone in the natural language processing (NLP) field, showcasing the latest trends and cutting-edge research. While we do not intend to cover every theme and paper presented at the conference, we’re excited to share a quick rundown of three key themes emerging from the array of presented papers. The first trend, targeted evaluation, focuses on developing precise and domain-specific evaluation of generated text, ensuring that models are assessed with accuracy and relevance. The second, reasoning, delves into enhancing the logical and inferential capabilities of NLP systems. Finally, the combination of fine-tuning and retrieval-augmented generation (RAG) is gaining momentum, highlighting groundbreaking methods that boost model adaptability and improve the efficiency of information retrieval. In this piece, we will take a closer look at these trends at NAACL to explore them in greater detail.
Targeted Evaluation: Fairness and Hallucination
As generative models have become widespread, evaluating their output has emerged as the Achilles’ heel of their real-world adaptation. In recent years, using large language models (LLMs) as judges to evaluate generated text has become common practice. However, the lack of trustworthiness and correlation with human judgment has surfaced as a significant challenge, prompting researchers to seek more targeted evaluation methods. Along the same line, more research has begun to tackle the reliability of human-based evaluation in NLG, more specifically toward the quality of guidelines, such as [1] which received an outstanding paper award.
One primary targeted evaluation at NAACL 2024 was directed at the crucial challenge of fairness in generated output. These works either studied biases on novel tasks such as ranking [2] and models’ explanations [3], or introduced new resources for bias evaluation, such as [4], where authors introduce a new benchmark for studying hate speech across geographically and culturally diverse English posts, which received the best resource paper award.
Besides fairness, one of the most problematic aspects of using generated text from LLMs in various real-world contexts is hallucination—generating responses that are factually incorrect, nonsensical, or irrelevant to the input. Many works at NAACL investigated this issue by introducing new techniques to reduce hallucination [5], using LLMs to detect hallucinations [6], or probing the reasons and extent of hallucinations in various tasks [7].
Reasoning
Improving and understanding the reasoning capabilities of LLMs has a significant impact on their applicability, as reasoning is directly related to tasks such as code generation, problem-solving, and planning. At NAACL, several works focused on better understanding these capabilities through various lenses, including creativity [8] and deductive reasoning [9], the latter of which received a best paper award. Moreover, many researchers are attempting to improve the reasoning capabilities of LLMs. Their works aim to enhance reasoning by introducing new internal or external modifications to the reasoning steps. Examples include generating prologs [10] or relying on symbolic solvers [11].
Fine-tuning and RAG
As LLMs become more powerful, many practitioners are adopting them for their own use cases. However, beyond general issues such as bias and hallucination, these models often lack up-to-date and domain-specific knowledge. To address these issues, a common practice is to fine-tune models, which helps them acquire domain-specific knowledge and patterns/structures. Alternatively, practitioners adopt retrieval-augmented generation (RAG) techniques, which significantly assist with handling missing or unknown information.
Many works at NAACL, targeting these pressing issues, took a closer look at training LLMs. Authors in [12] thoroughly investigated the impact of different components on training an LLM, winning an outstanding paper award. In [13], a new paradigm was proposed to fine-tune LLMs in low-resource settings. Additionally, [14] introduced a new instruction-tuning method for LLMs to generate “I don’t know” responses instead of attempting to answer when they lack the necessary knowledge, addressing an issue closely related to hallucination. This paper also won an Outstanding Paper award.
Moreover, relying on RAG-based systems has gained significant attention in recent years to address issues in LLMs, such as missing information and hallucinations, and to improve the quality and accuracy of their output. Many works at NAACL focused on enhancing RAG-based generation. For example, [15] introduced a framework to select the best strategy for improving accuracy in answering complex questions. Similarly, [16] proposed a new decomposition approach that first plans and then retrieves the necessary information, thereby improving the decision-making capabilities of LLMs.
Conclusion
NAACL 2024 was a fantastic event! Kudos to the organizing committee and everyone involved for making it a success. Drawing from our experience at the conference, the Megagon Labs team has crafted this blog post to highlight three major trends: targeted evaluation, reasoning, and fine-tuning/RAG. These trends represent significant advancements in the field of NLP and showcase the innovative approaches researchers are taking to enhance the capabilities of LLMs.
Looking ahead, we believe several future directions can further advance these areas:
- Fairness: Introduce and investigate new forms of bias to address a wider range of fairness issues in generated outputs, ensuring more equitable and inclusive NLP models.
- Hallucinations: Go beyond addressing factual hallucinations in knowledge-intensive settings, making models more adaptable to real-world scenarios by reducing irrelevant or nonsensical outputs.
- Reasoning: Develop benchmarks that challenge LLMs with more complex and novel forms of reasoning based on real-world use cases. This involves moving beyond simple reasoning tasks and prompt engineering approaches, requiring a deeper understanding of these models and incorporating other forms of reasoning, such as symbolic reasoning.
- Fine-tuning: Gain a better understanding of the impact of different model-based and data-based components, as well as the use of real versus synthetic data in fine-tuning and instruction tuning. This will help create comprehensive guidelines for adapting LLMs in various domains and settings.
- RAG: Better understand the complex relationship between LLMs’ inherent knowledge and its impact on the retrieval component, as well as the information derived from retrieved documents. This will enhance the effectiveness of RAG-based systems in providing accurate and relevant responses.
By addressing these future directions, the NLP community can continue to push the boundaries of what is possible with LLMs, making them more reliable, adaptable, and useful across diverse applications. Follow Megagon Labs on LinkedIn and Twitter as we delve into these key themes and work to enhance the current state of LLMs in each area.
Sources Cited
[2] Do large language models rank fairly? An empirical study on fairness of LLM as Rankers
[3] Discovering and mitigating indirect bias in Attention based model explanations
[5] Trusting your evidence: Hallucination Less with Context-Aware Decoding
[6] Language Models Hallucinate, but May Excel at Fact Verification
[7] Deceptive Semantic Shortcuts on Reasoning Chains: How Far can Models Go Without Hallucination?
[8] MacGyver: Are Language Models Creative Problem Solvers?
[9] Evaluating the Deductive Competence of Large Language Models
[10] Arithmetic Reasoning with LLM: Prolog Generation & Permutation
[11] LeanReasoner: Boosting Complex Logical Reasoning with Lean
[13] Generalizable and Stable Finetuning of Pretrained Language Models on Low-Resource Texts
[14] R-Tuning: Instructing Large Language Models to Say `I Don’t Know’
Written by Pouya Pezeshkpour, Estevam Hruschka, Seiji Maekawa, and Megagon Labs
Follow us on LinkedIn and Twitter to stay up to date on industry trends.