Alon Halevy, Flip Korn, Natalya F. Noy, Christopher Olston, Neoklis Polyzotis, Sudip Roy, Steven Euijong Whang
Recently, search engines have invested significant effort to
answering entity–attribute queries from structured data, but
have focused mostly on queries for frequent attributes. In
parallel, several research efforts have demonstrated that there
is a long tail of attributes, often thousands per class of entities, that are of interest to users. Researchers are beginning
to leverage these new collections of attributes to expand the
ontologies that power search engines and to recognize entity–
attribute queries. Because of the sheer number of potential
attributes, such tasks require us to impose some structure
on this long and heavy tail of attributes.
This paper introduces the problem of organizing the attributes by expressing the compositional structure of their
names as a rule-based grammar. These rules offer a compact
and rich semantic interpretation of multi-word attributes,
while generalizing from the observed attributes to new unseen ones. The paper describes an unsupervised learning
method to generate such a grammar automatically from a
large set of attribute names. Experiments show that our
method can discover a precise grammar over 100,000 attributes of Countries while providing a 40-fold compaction
over the attribute names. Furthermore, our grammar enables us to increase the precision of attributes from 47% to
more than 90% with only a minimal curation effort. Thus,
our approach provides an efficient and scalable way to expand ontologies with attributes of user interest.