A multi-purposed data augmentation framework for training high-quality machine learning models
Deep Learning is revolutionizing almost all fields of computer science including computer vision, natural language processing, and data management. However, the success of deep neural nets heavily depends on the existence of large, high-quality labeled training datasets. To this end, Data Augmentation (DA) has become a common practice in machine learning for generating additional training examples from existing ones via data transformation.
We proposed Rotom, a multi-purposed data augmentation framework for training high-quality machine learning models while requiring only a small number (e.g., 200) of labeled examples. Rotom has a simple task formulation of sequence classification so that it covers a wide range of data management and NLP tasks including entity matching, error detection in data cleaning, text classification, and more. Rotom leverages (1) pre-trained Seq2Seq models to generate diverse yet natural augmented sequences and (2) meta-learning for training effective policy models for combining sequences generated by multiple DA operators.
Rotom: A Meta-Learned Data Augmentation Framework for Entity Matching, Data Cleaning, Text Classification, and Beyond
Zhengjie Miao, Yuliang Li, Xiaolan Wang – SIGMOD 2021