The ACM SIGMOD conference is the leading forum for the principles, techniques, and applications of database management systems and data management technology. This year it took place in mid-June in Philadelphia, PA this year. It is the first time SIGMOD has held in-person activities since the pandemic began. The whole event ran in a hybrid format; for each session, the speaker could present either online or remotely via Zoom. Based on the statistics from the organizers, there were more than 550 people attending the conference in person. There are 28 sponsors for SIGMOD this year and Megagon Labs was a Golden sponsor among them.
The conference consisted of the research track, the industry track, the demonstration track, 8 tutorials, and 9 workshops. During my trip to SIGMOD, I attended one workshop, presented in one research session, and listened to two other research sessions. This year the research track accepted 151 papers with an acceptance rate of 29.3%. There were two submission cycles for the research track, and the number of submissions for each track was almost equivalent.
Keynotes and Awards
There were three keynote speakers at SIGMOD this year. Professor Barbara Liskov from MIT gave an excellent talk about her adventures in academia and shared many valuable thoughts with young researchers. Professor Laks V.S. Lakshmanan from UBC introduced his research about solving misinformation problems in social networking applications. Professor Christopher Ré from Stanford University provided a big-picture perspective on trending AI techniques, then introduced the trends of data-centric AI, which shows close relationships to data management.
Personally, I was most impressed by the keynote made by Professor Christopher Ré on the trends of AI. In this talk, he described three trends in the development of AI techniques: data-centric AI, declarative machine learning, and foundation models. Specifically, the foundation models featured very large pre-trained language models, which could be applied to a wide spectrum of NLP and data science applications. In his talk, one typical application scenario of foundation models was data cleaning and integration. These research efforts have a large overlap with Megagon Labs’ ongoing projects related to matching, which is a commonality that illustrate important directions for future work.
There was one best paper award and two honorable mentions for the research track. The best paper was the joint work by HKUST and Duke University on the topic of data privacy. The two honorable mentions were about the topics of query optimization and transaction processing. All three papers featured novel theoretical results and exploration of brand new methodologies. The best industry paper was awarded to the Photon system by Databricks.
This year, the winner of the Jim Gray Dissertation Award was Dr. Chenggang Wu from UC Berkeley. His research focused on serverless computing, which is a hot topic for cloud computing. He addressed many crucial problems, from storage management to transaction processing to system implementation based on the idea of serverless computing. There were also two honorable mentions. All their research included solid system implementation as well as real application scenarios
A Few Interested Papers
I would like to introduce some papers published in SIGMOD 2022 that are closely related to the research happening at Megagon Labs. For each paper, I will give a high-level summarization. Included are links to the original paper and related resource (if there are any).
In this paper, authors studied the problem of entity resolution. The cornerstone of this paper is that there are many correlations between records within each dataset, and the performance of entity resolution can be further improved by leveraging such signals. To this end, the authors suggested constructing a hierarchical heterogeneous graph for the dataset by considering three kinds of nodes: entity, attribute, and token. Then, the authors proposed HHG, an end-to-end framework that takes two datasets of entity entries as input and learns the representations by constructing the graph and enhancing them with contextual features. Resource
This paper highlights the problem of interactive visualization of SQL queries, which is a topic in conjunction with database and HCI fields. The motivation is to support visualization of SQL queries in a multi-view interface. To this end, interactive visualizations need to address the challenge of the computation overhead in switching the computation overhead of switching between different views in response to the user interactions. The authors proposed a framework that supports incremental computation in this process. Specifically, a novel index structure Difftree was proposed. There are also some other techniques to improve the query processing, such as interaction schema mapping and rule-based methods for search-space pruning.
This paper outlines the problem of providing explanations for the results of link prediction over knowledge graphs. The problem setting is that given two entities, the model aims to predict the tail relation between them. To provide more insight into this dynamic, the paper’s authors first provide two kinds of explanations, namely sufficient and necessary ones. Then, the authors came up with the notion of mimic to denote the intermediate results by insertion and deletion on entities so as to generate the candidates for explanations. The paper also proposes a light-weight post-training process on only the mimics to help find the final results of explanations. Resource
This paper focuses on the problem of representation learning over tabular dataset. The motivation is that the correlation between different tables might be crucial in many downstream applications. To make use of such information, the authors built a graph over the set of tables and developed techniques to learn the representation with different granularities. Then the learned representations were used for different classification and regression tasks to illustrate the effectiveness of proposed techniques.
From the research introduced above, we can see some trends in SIGMOD. First, SIGMOD appreciates novel methodology that examines the core problems of classic relational DBMS such as indexing, query optimization, and transaction processing. Secondly, a new trend in recent years for SIGMOD is that the methodologies from other research fields–such as machine learning, Natural Language Processing, and Human-Computer Interaction–are widely adopted to address data management problems. For example, 7 out of 27 research sessions in SIGMOD this year were about ML for data management or data management for ML. It seems it will continue to be a popular topic in upcoming years. Finally, an emerging new research direction in SIGMOD, as well as in other database venues, is Responsible Data Management. This year was the first time there was a research session for this topic. Actually, two tutorials were made by Fatemeh et al. and Aditya et al. introducing the existing works and future directions of this topic.