Hybrid Active Learning for Low-Resource LM Fine-tuning

While fine-tuning pre-trained language models has become a standard NLP practice, data labeling remains a significant bottleneck in NLP. To alleviate this problem, active learning (AL) methods have been used for various NLP tasks such as sentiment analysis and document classification.

Even with the state-of-the-art AL approaches, the number of labels needed to fine-tune language models is still significant making them prohibitive when human annotations remain expensive and limited. For example, labeling tens of thousands of data samples may be impractical in domains such as medicine or law, considering the cost and time for labeling as well as the overhead of finding and training domain experts. With the increasing effectiveness of pre-trained language models, we’ve seen more and more potential in active fine-tuning for downstream applications in a low-resource setting (e.g., using less than 1000 human-labeled samples).

Given such a low-resource setup, the interactivity of AL methods is another challenge. Long wait times (latency) between active labeling iterations can break the interactivity of the model development processes, creating a significant bottleneck for data science practitioners. Low latency is essential for the early stages of model building, where NLP researchers and practitioners aim to explore model performance over faster AL iterations. 

Therefore, it is essential to understand how the design of AL algorithms can impact labeling cost and acquisition latency, especially in a low-resource and interactive setting.

Existing active acquisition functions

Active learning methods interactively and iteratively acquire new sets of data points to label for training/fine-tuning the models. The strategy used to acquire the most informative data is often referred to as the acquisition function, a crucial part of active learning algorithm design. 

Existing AL acquisition strategies are typically based on either uncertainty or diversity. Uncertainty-based methods select data points that the current model is most uncertain about, which are usually points near decision boundaries. On the other hand, diversity-based methods aim to maximize the diversity among selected data points. There are also hybrid approaches that combine both aspects into their optimization objectives. However, we observed that all of these approaches acquire redundant samples in each active iteration as explained in the Figure below.

 

Illustration of the sample redundancy challenge on AgNews dataset (Zhang et al., 2015)

Figure 1: Illustration of the sample redundancy challenge on AgNews dataset (Zhang et al., 2015).*

Existing methods usually suffer from redundant samples within or between iterations, and wasting labeling budget on unnecessary data points. Based on an investigation of existing methods, we propose a novel active learning method: TYROGUE.

Our proposed framework: TYROGUE

We identified two key designs that can improve the effectiveness and efficiency of sample acquisition: random sampling reduces the unlabeled pool being considered for acquisition, and  decouples the diversity and uncertainty objectives in hybrid acquisition. 

D1. Random sampling to reduce acquisition latency. The first design involves applying random sampling to an unlabeled data pool to obtain a smaller candidate set to apply the acquisition function. Such filtering reduces the latency of acquisition, a bottleneck in applying existing methods in an interactive setting. Despite the significant computational cost reduction, we have shown empirically that such sampling does not hurt performance much in a low-resource setting.

D2. Employing diversity and uncertainty sampling independently to reduce redundancy. The second design proposes effectively combining diversity and uncertainty sampling to avoid intra- and inter-iteration redundancies. Existing hybrid methods may suffer from these redundancies due to unifying the uncertainty and diversity objectives into a single acquisition function — such strategies often exhibit affinity toward one objective over another. The basic idea is a two-step selection, as shown in the diagram below. The first step is performing 1) diversity sampling, e.g., selecting cluster centers to reduce intra- iteration redundancy. The next step is 2) uncertainty sampling, e.g., selecting data points with high entropy to avoid inter-iteration redundancy.

 

Figure 2: Overall pipeline of TYROGUE.

Figure 2: Overall pipeline of TYROGUE 

Evaluation results

To demonstrate the reduction in labeling cost and acquisition latency, we compared TYROGUE with SOTA uncertainty-based (Entropy), diversity-based (FTbertKM), and hybrid (BADGE, ALPS, and CAL) methods. The comparison was made over eight popular datasets focusing on tasks ranging from topic classification to natural language inference and paraphrase detection.

To evaluate labeling cost reduction, we measured the number of labeled data points needed to achieve comparable prediction performance with the models fine-tuned on the entire training set (i.e., fully supervised). We set the target F1 score to be 85% and 95% of the fully supervised model.  Figure 3 shows that with TYROGUE, models can achieve the same prediction F1 using up to 43% fewer labeled training examples compared to the second-best acquisition algorithm.

Figure 3: Average labeling cost (number of data samples) per iteration to achieve 85% and 95% of the F1 score by a model trained with the entire training set. 

To ensure an interactive experience for iterative model development and debugging, the latency of acquisition algorithms matters. Figure 4 reports the time needed to select the next batch of samples to annotate for each acquisition method, averaged over all active iterations and five random trials. TYROGUE reduces the run time up to 11 times (compared with CAL on QQP) and is the fastest algorithm for six of the eight datasets.

Figure 4: Average per-iteration acquisition time over 5 random runs. Unlike other approaches, TYROGUE’s runtime does not increase with the size of the datasets, thereby significantly reducing acquisition latency.

Figure 4: Average per-iteration acquisition time over 5 random runs. Unlike other approaches, TYROGUE’s runtime does not increase with the size of the datasets, thereby significantly reducing acquisition latency.

Interesting future work

Adaptive acquisition: The trade-off between uncertainty and diversity is essential for active acquisition algorithms. We believe TYROGUE and the observations in this work lay the foundation for future work on adaptive acquisition functions that balance both objectives. We aim to investigate strategies for attaining the optimal balance of uncertainty and diversity by taking into account aspects such as model performances and dataset characteristics.

Adoption in practical systems: We believe our multi-step adaptive approach can be incorporated into any annotation platform. Such frameworks can enable rapid iterations in the early stages of modeling building. Therefore, understanding how TYROGUE can be integrated into the existing annotation platforms is an interesting research problem.

Transparency and control for practitioners: Interactive AL is still an understudied approach that requires future studies. Our proposed design affords control to the users in balancing the acquisition objectives. However, it is imperative to understand how aspects such as transparency of the framework and interpretability of the model may impact users’ experience as they reason over the control parameters.

Please check our EMNLP findings paper if you are interested in more details.

* (a) shows 2D projection of BERT embeddings, where colors indicate ground truth class labels. (b) Uncertainty-based methods tend to acquire similar data points from a specific area within an iteration (see the red box). (c) Diversity-based methods tend to acquire data points similar to the samples acquired in previous iterations (see the blue circles). (d) Hybrid methods may suffer from either sample redundancy issue depending on which objective they prioritize, i.e., diversity (BADGE [Ash et al., 2020] and ALPS [Yuan et al., 2020]) vs. uncertainty (CAL [Margatina et al., 2021]).

Written by: Dan Zhang and Megagon Labs.

Follow us on LinkedIn and Twitter to stay up to date with us.

 

 

Share:

More Blog Posts: