Deep Learning is revolutionizing almost all fields of computer science including computer vision, natural language processing, and data management. However, the success of deep neural nets heavily depends on the existence of large, high-quality labeled training datasets. For example, the best image classifiers are trained on ImageNet of 14M labeled images and the best Question-Answering model requires a dataset like SquAD 2.0 which contains 150K high-quality QA pairs. Collecting such datasets can be quite time-consuming and expensive.
To this end, Data Augmentation (DA) has become a common practice in machine learning. Data Augmentation is the technique that transforms existing labeled examples to generate additional training examples. For example, in computer vision, one can transform an image (e.g., a cat) by flipping, rotating, or cropping the image to obtain new images of the same content:
Similar techniques have also been applied to Natural Language Processing (NLP). For example, one can transform an input sentence by deleting a word or replacing a word with its synonym and expect the meaning of the sentence to stay “roughly” unchanged.
There have also been many success stories of applying DA to data management tasks including entity matching, data cleaning, and information extraction. However, there are several technical challenges of applying DA in practice. In this blog post, we focus on two of those challenges: (i) label corruption vs. diversity, and (ii) efficiency.
Challenge 1: label corruption vs. diversity
Transformation operators used in data augmentation can potentially generate examples with corrupted labels, for example when operators change the semantics of the sentence to a degree where the original label is no longer applicable. On the other hand when transformation operators are applied conservatively they may not generate diverse enough examples.
In this example of query intent classification, “Where is the Orange Bowl” with an intent of seeking for locations, a word replacement operator might replace the word “Where” with “What” thus changing the intent into seeking for information.
Because of the risk of label corruption, people tend to rely on simple transformations, for example dropping the word “the”, that are less likely to change the meaning of the sequence. However, the resulting sequences can be too similar to the original sequence such that the machine learning model fails to learn much additional information.
The case becomes even harder when combining multiple operators: one can easily generate diverse sequences but preserving the label is almost impossible.
Challenge 2: Inefficient process of DA
The second challenge that we would like to address is the inefficient process of applying DA in NLP. DA introduces a whole new set of hyper-parameters for tuning. For example, in word replacement, one needs to decide which words to replace and how to sample the target words. If there is more than one operator, the developer needs to go through the process of picking an operator each time, training the model (which can be very slow), and observing the results. If the results are not good enough, this process has to be repeated until the result is satisfactory. The situation becomes even harder when combining multiple operators, which blows up the search space combinatorially. The final result can be sub-optimal even after a long and cumbersome process.
Rotom: a multi-purposed data augmentation framework for sequence classification
The goal of Rotom is to train high-quality machine learning models while requiring only a small number (e.g., 200) of labeled examples. Rotom has a simple task formulation of sequence classification. As a result, Rotom covers a wide range of data management and NLP tasks including entity matching, error detection in data cleaning, text classification, and more. Just like the pokemon of the same name, Rotom can serve in different applications and is very good at transformation.
InvDA: DA as self-supervised sequence generation
Rotom addresses the first challenge by formulating DA as a sequence-to-sequence (seq2seq) generation task: given as input an original sequence from the training set, generates an augmented training sequence. By formulating the task as seq2seq, we allow Rotom to generate augmentations that are arbitrarily different from the original sequence. However, training a high-quality seq2seq model usually requires a large number of labeled sequence pairs (like in machine translation). To overcome this label requirement, we apply the idea of self-supervision.
We observe that although high-quality augmentations are hard to obtain, the original training examples can be regarded as high-quality results of augmenting some other sequences. For example, “Where is the Orange Bowl” is a good augmentation from the corrupted sequence “the Bowl ? orangish arena”. Based on this observation, we trained a seq2seq operator InvDA to “invert” a series of corruption operators. At prediction time, we can apply the InvDA operator on an original training sequence and expect InvDA to enrich the sequence with natural additional information.
This simple method works surprisingly well as it can generate natural yet diverse augmentations such as “Where is the Indianapolis Bowl in New Orleans?” or “Where is the Syracuse University Orange Bowl?”.
The Meta-learning Framework
Next, although we have InvDA and maybe other powerful operators, there is still no guarantee that the augmented examples indeed help boost the target model’s performance. To this end, we develop a meta-learning framework that automatically selects and combines augmented examples generated by multiple operators.
The framework consists of two policy models, namely the filtering model and the weighting model, in addition to the target model to be trained. Intuitively, the filtering model checks on all augmented examples from all DA operators and decides which one to keep and which one to discard. Next, the weighting model assigns weights to the remaining examples and assembles them into a training batch to train the target model. We design the filtering model to be a simple linear classification model while the weighting model is a more heavy-duty Transformers-based model.
To train these policy models, we apply the idea of meta-learning: by learning from the past experience of “teaching” the target model, the filtering and weighting models are expected to gradually learn how to generate training batches of higher quality. Our training algorithm jointly trains the policy and the target models following a pattern commonly seen in automatic machine learning. Please check out our paper for more details.
Experiments: low-resourced Entity Matching, Error Detection, and TextCLS
In our experiments, Rotom achieves significant results in low-resourced settings, i.e., training ML models when given only a small number of labels. We configured Rotom to combine regular DA operators such as word replacement or deletion and the seq2seq operator InvDA.
For Entity Matching (EM), we compared Rotom with DeepMatcher [SIGMOD ‘18] on 5 standard EM benchmark datasets. While using only 6.4% of training labels from these tasks (750 labels each), Rotom achieves 4.6% F1 score improvement compared to DeepMatcher.
For Error Detection, we compared Rotom with Raha [SIGMOD ‘19] which is the previous SOTA error detection method in the low-resourced setting. On the 5 evaluated datasets, Rotom achieves 7.6% F1 score improvement using only 200 labeled cells for each task which is strictly less than the number of labels used by Raha. An interesting observation is that, in some cases (e.g., in the “beers” or “movies” datasets) where neither regular DA operators nor InvDA yield performance improvements, Rotom effectively combines the two operators and significantly boosts the model’s performance.
For Text Classification, Rotom outperforms two recently proposed data augmentation techniques (Hu et. al ‘19 and Kumar et al. ‘20) for NLP tasks. Please check out our papers for more interesting results and findings!
Written by Yuliang Li and Megagon Labs