Alon Halevy, Flip Korn, Natalya F. Noy, Christopher Olston, Neoklis Polyzotis, Sudip Roy, Steven Euijong Whang
Enterprises increasingly rely on structured datasets to run their businesses. These datasets take a variety of forms, such as structured
files, databases, spreadsheets, or even services that provide access
to the data. The datasets often reside in different storage systems,
may vary in their formats, may change every day. In this paper,
we present Goods, a project to rethink how we organize structured
datasets at scale, in a setting where teams use diverse and often
idiosyncratic ways to produce the datasets and where there is no
centralized system for storing and querying them. Goods extracts
metadata ranging from salient information about each dataset (owners, timestamps, schema) to relationships among datasets, such as
similarity and provenance. It then exposes this metadata through
services that allow engineers to find datasets within the company,
to monitor datasets, to annotate them in order to enable others to
use their datasets, and to analyze relationships between them. We
discuss the technical challenges that we had to overcome in order
to crawl and infer the metadata for billions of datasets, to maintain the consistency of our metadata catalog at scale, and to expose
the metadata to users. We believe that many of the lessons that we
learned are applicable to building large-scale enterprise-level datamanagement systems in general.