Order Matters: Assessing LLM Sensitivity in Multiple-Choice Tasks

Large language models (LLMs) have demonstrated remarkable capabilities in various NLP tasks. However, previous works have shown these models are sensitive to prompt wording, the choice of few-shot demonstrations, and the order of these demonstrations, posing challenges to fair assessment of these models. As these models become more powerful, it becomes imperative to understand and address these limitations. In this paper, we focus on LLMs’ robustness on the task of multiple-choice questions (MCQ)—commonly adopted tasks to study the reasoning and fact-retrieving capabilities of LLMs. Within this context, we aim to address the following research questions: (1) To what extent do LLMs exhibit sensitivity to the order of options in multiple-choice questions? (2) What factors contribute to LLMs’ sensitivity to the order of options? (3) How can we improve LLMs’ robustness to the order of options sensitivity? In this work, answering these questions, we show that LLMs are not only very sensitive to the order of options but also that this sensitivity is not easily resolvable.

LLMs’ Sensitivity to Order of Options in MCQ Tasks

We examine several state-of-the-art LLMs across five different MCQ benchmarks. Our aim is to observe how the predictions of these LLMs change when the order of options is altered, as depicted below:

Figure 1. GPT-4 sensitivity to reordering options: upon changing the order of choices, GPT-4 change its prediction form “hen house” to “outside of the bedroom window” (the example is from CSQA dataset).

To measure sensitivity, we calculate the gap between the minimum and maximum performance of LLMs by reordering the options. Several key findings emerge from our study: (1) GPT-4 shows a significantly smaller sensitivity gap to option order compared to other LLMs. (2) Despite GPT-4’s high accuracy, exceeding 90%, there remains a notable sensitivity gap of 13.1%, underscoring that even high-performing models are affected by option order changes. (3) Sensitivity to option order does not completely correlate with overall model performance, suggesting that other factors might influence this sensitivity. (4) While the domain and number of MCQ options impact model performance, they do not clearly correlate with sensitivity.

Table 1: Zero-shot order sensitivity; all three LLMs display a notable level of sensitivity to the order of options across various benchmarks.

Can Demonstrations in Few-shot Setting Resolve the Sensitivity?

The answer is no. We select the most similar instances as demonstrations by calculating the Euclidean distance between vector representations of questions using Sentence-RoBERTa. The results, visualized in the figure below, include error bars showing the sensitivity gap for each case. From the results, we make the following observations: Firstly, the sensitivity gap consistently remains substantial even with the inclusion of more demonstrations in the few-shot setting. Furthermore, as performances improve, the sensitivity gap tends to shrink. However, adding more demonstrations does not necessarily lead to a reduction in the sensitivity gap. This highlights that while demonstrations may marginally improve robustness, they do not entirely mitigate the models’ sensitivity to options order.

Figure 2: Order sensitivity in the few-shot setting: The error bars represent the range of minimum and maximum accuracy achievable in each task through oracle reordering. Our observations are as follows: (1) The sensitivity gap consistently remains substantial in the few-shot setting. (2) As performances improve, the sensitivity gap shrinks. (3) Adding more demonstrations does not necessarily results in a reduction of the gap.

Why Do LLMs Show Sensitivity to the Order of Options?

After analyzing instances where reordering the options changed LLM predictions, we propose the following conjecture:

The sensitivity of LLMs in MCQs stems from two forces: (1) LLMs’ uncertainty about the correct answer from the top choices, and (2) positional bias, which leads LLMs to prefer certain options based on their placement.

We show the impact of uncertainty on sensitivity through several experiments. We first show that the sensitivity gap exhibits a strong correlation with the error rate of LLMs in MCQ tasks:

Figure 3: Correlation between the sensitivity gap and error rate for GPT-4 and InstructGPT across various MCQ tasks (each point represents the performance of an LLM on one of the benchmarks).

Then, using log probabilities and self-approximation of confidence in LLMs, we further validate our conjecture. To examine the impact of positional bias on LLMs’ order sensitivity, we maintain only the top choices in their original order while eliminating the rest. After doing so, we observe that LLMs’ performance either remains nearly unchanged or shows minor improvements or declines. This observation provides further evidence of positional bias influencing order sensitivity in LLMs. We also identify specific patterns in the options that either amplify or reduce the model’s positional bias: To reduce bias, the top-2 choices are better to be close to each other, while for increasing the bias, they should be as far as possible.

Calibrating LLMs for MCQ Tasks

Now that we show LLMs are sensitive to the order of options, the question that remains is how to address this sensitivity. One potential solution we have considered is the calibration of an LLM’s predictions. We adopt two calibration strategies: (1) taking a majority vote from the model’s predictions across 10 random reorders, and (2) multi-evidence calibration (MEC), where the model is asked to explain its reasoning before making a prediction. The effects of these strategies are detailed in the table below:

Table 2: Impact of calibration methods on LLMs’ performance.

These two calibration methods result in improvements of up to 8 percentage points across various models and benchmarks. Additionally, the impact of MEC on LLMs’ performance differs from that of majority voting, raising questions about its suitability for MCQ tasks.

Conclusion and Future Directions

In this work, we show that LLMs are sensitive to the order of options in multiple-choice question tasks. Through our investigation, we also observe a similar form of sensitivity in other tasks involving multiple fragments, such as odd word detection, sorting lists of items, and ranking documents. Despite our effort to validate our conjecture through various experiments, we believe a thorough exploration of the training data is necessary to better understand the root cause of order sensitivity in language models. Finally,  although we investigate two methods of calibrations, introducing better calibration techniques is essential for mitigating LLMs’ order sensitivity.

This work will be presented at Findings of ACL: NAACL 2024. For more insights read our publication.

Written by Pouya Pezeshkpour, Estevam Hruschka, and Megagon Labs


More Blog Posts: