Yuliang Li, Xiaolan Wang, Zhengjie Miao, Wang Chiew Tan
In recent years, we have witnessed the development of novel data augmentation (DA) techniques for creating additional training data needed by machine learning based solutions. In this tutorial, we will provide a comprehensive overview of techniques developed by the data management community for data preparation and data integration. In addition to surveying task-specific DA operators that leverage rules, transformations, and external knowledge for creating additional training data, we also explore the advanced DA techniques such as interpolation, conditional generation, and DA policy learning. Finally, we describe the connection between DA and other machine learning paradigms such as active learning, pre-training, and weakly-supervised learning. We hope that this discussion can shed light on future research directions for a holistic data augmentation framework for high-quality dataset creation. PVLDB Reference Format: Yuliang Li, Xiaolan Wang, Zhengjie Miao, and Wang-Chiew Tan. Data Augmentation for ML-driven Data Preparation and Integration.