The ACM SIGMOD conference is the leading forum for the principles, techniques, and applications of database management systems and data management technology. This year it took place in mid-June in Seattle, WA. 2023 also marked the year that SIGMOD returned to all in-person activities for the conference. There are 26 sponsors for SIGMOD this year, and Megagon Labs also sponsored.
The conference consisted of the research track, the industry track, the demonstration track, 11 tutorials, and 10 workshops. During our trip to SIGMOD, Eser Kandogan served as the co-chair of the demo track; Jin Wang presented a tutorial about dataset discovery from data lakes; Zhengjie Miao served as the mentor of HILDA workshop. And all of us listened to several research sessions. This year the research track accepted 186 papers with an acceptance rate of 28.2%. There were three submission cycles for the research track, and the last cycle had the most number of submissions while the first cycle had the least. This year SIGMOD started to employ a new publication policy: all the peer-reviewed papers from the research and industrial tracks are published in the new journal “Proceedings of the ACM on Management of Data,” or PACMMOD. Meanwhile, other content is published in a companion volume.
There were three keynote speakers at SIGMOD this year. Dr. Don Chamberlin from IBM gave an excellent talk about the evolution of the relational data model and SQL, or as he put it, the “structured English query language.” He emphasized their focus on the “human” in designing the language and said they even ran a user study at a university early on. Another key message was the struggles of the research project early on (yes, even a major technology like SQL), and how the community and standardization helped it gain widespread adoption. These two key aspects, focus on the “human” as well as the community and standardization, carry important lessons for any research project. Don also presented several open challenges for the design of SQL language with human involvement.
Dr. Vanessa Murdock from Amazon also gave a keynote and introduced her research about the Mixed Methods Machine Learning (MXML) paradigm, which focused on human-in-the-loop machine learning. Professor Shazia Sadiq from The University of Queensland used her keynote to talk about some practices of applying DEI techniques in the field of education.
This year there were 28 accepted demonstrations, spanning a wide range of topics from star schemas, distributed transactional databases, query planning, to natural language based data exploration. The best demo paper award went to KAMEL, which utilized a BERT-based model for trajectory imputation, especially useful for sparse trajectories to improve accuracy. Best paper runner-ups were NaturalMiner and ArgusEyes. Overall, with in-person participation this year, the demo session was quite lively and successful, as people got to play with the demos firsthand and interact with the presenters.
Three keynote speakers and 13 papers at The HILDA (Human-In-the-Loop Data Analytics) Workshop covered different aspects of human-data interaction. The first keynote was “Curation as Programming: AI, Data Management, and Mediated Knowledge Interaction” by Prof. Bill Howe from the University of Washington, which covered data curation for building AI applications that require objectively correct input like urban data and legal documents. The second keynote was “Scaling data-driven Decision-making through Human-AI Interaction” by Anamaria Crisan from Tableau Research. Dr. Crisan emphasized the importance of understanding people, their data, and processes that influence decision-making with AI and then introduced findings about data visualization as the medium for interaction, which fosters the understanding and provides guidance to examine possible bad decisions. The last keynote speaker, Yunyao Li from Apple, shared her decade-long experience in having humans in the loop throughout the entire life cycle for knowledge graph construction, growths, and services.
Although this workshop aimed to exchange initial results, there were already many interesting papers. For example, the paper “DIG: The Data Interface Grammar” by Columbia University proposed a formal grammar that unifies the declaration of data programs, interfaces, and interactions. There were also works on supporting human users in examining the results of table understanding and dataset discovery tasks, which are related to the ongoing projects at Megagon Labs, through carefully designed visualizations.
There were two panels at SIGMOD: Future of Database System Architectures and Personal Data for Personal Use. I missed the first one unfortunately and was only able to attend the latter. The Personal Data panel led by Alon Halevy and Wang-Chiew Tan focused on collecting, summarizing, and using personal information to build AI models. Across several social media websites and apps, there is no question that so much personal data could drive interesting applications. With the recent advances in AI, the question now is not “could we” but “should we”. This led to several interesting conversations around ethics and at times heated discussions on how far model development should go, even if they have high predictive power.
This paper studied the problem of fairness aware data preparation for machine learning applications. The core insight is that the individual fairness of a training set can be improved by flipping the ground truth label of some instances. The authors made the formal problem definition by formulating it into an optimization problem. The authors then proved that the problem is NP-Hard and tried to provide an approximate algorithm with the help of linear programming techniques. Besides, the authors also provided some theoretical analysis to justify the efficiency of the proposed algorithms.
This paper came up with a new variant of entity matching problem called MIER, which aims to capture varying interpretations of real-world entities based on different needs from users. The authors defined MIER based on the discussion of real application scenarios and generalized the problem definition of entity matching. Then the FlexER framework, which models the intent-based representations of record pairs as a multiplex graph representation, was proposed as the solution of the new problem. In this framework, the authors proposed a variant of GNN as the solution to capture the inherited relationships from the graph, which are essential for matching.
This paper studied the table union search problem, one of the essential tasks in dataset discovery from data lakes, from a new perspective considering the semantic relationships between pairs of columns. The researchers presented two new methods for discovering semantic relationships between columns. The first method utilizes an existing knowledge base (KB), while the second method, called “synthesized KB,” leverages knowledge extracted from the data lake itself. The researchers also introduced new open benchmarks representing small and large real data lakes, which have been used in our work at Megagon Labs.
The paper addressed the challenge of generating data preparation pipelines (sequence of operators like filtering, missing value imputation, transpose, etc.) for machine learning tasks. It explored combining the more general human-generated pipelines (HI-pipelines) from domain experts and more specific and optimized machine-generated pipelines (AI-pipelines) to create better-performing HAI-pipelines. The paper proposed the HAIPipe framework, which uses an enumeration-sampling strategy to select the combined pipeline and reinforcement learning to optimize AI-pipelines.
From the research introduced above, we can see some trends in SIGMOD. First of all, the core topics of traditional database systems remain popular at this conference. There are still many research sessions about related topics such as join, transactions, indexing, and query processing. Also, the system papers are preferred for those topics. Secondly, machine learning continued to be an important methodology for data management problems. While the application of machine learning over DB systems is not as popular as before, machine learning has been applied to a broader scope of the management of specific types of data formats, such as graph, spatial, temporal, etc. Finally, the topic of representation learning for table data is becoming a popular topic, and researchers are trying to provide new benchmarking datasets, tasks, as well as application scenarios.