Entities are integral to understanding natural language text. To this end, the task of entity set expansion (ESE) aims to obtain a comprehensive set of entities (e.g., ‘mini bar’, ‘tv unit’) for a concept (e.g., room features) given a textual corpus and a seed set of entities (e.g., ‘coffee’, ‘iron’) for each concept of interest. Since obtaining large-scale training data for the task is expensive, existing approaches focus on low-resource settings where the seed set is small (< 10 entities per concept).
Figure: Low-resource entity set expansion
Despite the recent progress, reported success of ESE methods is largely limited to benchmarks focusing on named entities (e.g., ‘countries’, ‘diseases’) and well-written text such as Wikipedia versus user-generated reviews. The evaluation is also limited to top 10-50 predictions regardless of the actual size of an entity set of a concept. As a result, it is unclear whether the reported effectiveness of ESE methods is conditional to datasets, domains, and evaluation methods. In this work, we investigate the generalizability of existing ESE methods to user-generated text as it is widely used in many real-world applications and is known to have more distinctive characteristics than well-written text.
Characteristics of User-generated Text
Since there are no existing benchmarks for user-generated text, we created new benchmarks for three domains (hotels, restaurants, and jobs) and found new characteristics that distinguish them from benchmarks on well-curated text. The figure below illustrates characteristics of Wiki benchmark (well-curated text) and Tripadvisor benchmark (user-generated text).
2) Ground truth for concept-entity pairs in Wiki can be obtained by referring to external resources or common sense. However, some concepts in Tripadvisor are open-ended and subjective, leading to ambiguity. For instance, an entity ‘civic center’ may be either an attraction or a nearby location depending on the context in the review. We refer to such entities as vague entities.
3) Non-named entities (e.g., ‘coffee’ and ‘tv unit’) are typically noun phrases that are not proper names. Although prevalent in all domains, they are largely ignored in benchmarks. Even still, Tripadvisor has twice as many non-named entities as Wiki.
4) Different concepts within a domain may exhibit diverse cardinality, i.e., concepts can have a different number of entities in the corresponding entity set. Therefore, simply evaluating top-k predictions may not provide a reliable estimate of performance. The table below shows the distribution of concept sizes across multiple benchmarks.
To expand the seed set, ESE methods typically rank candidate entities extracted from a textual corpus. These methods can be broadly classified into: (a) corpus-based methods that rank candidate entities using contextual features and patterns learned from the corpus, (b) language model-based methods that rank candidate entities by probing prior knowledge in a large pre-trained language model. We selected the following representative ESE methods:
b) Embedding baseline (Emb-Base): a simple corpus-based baseline that derives an entity embedding using average context embedding of the sentences that mention the entity using BERT.
c) CGExpan: a state-of-the-art language model-based method that iteratively uses Hearst patterns as prompts for the language model, in addition to other features such as concept-name guidance.
d) LM Probing Baseline (LM-Base): a simple language model-based baseline that excludes additional features such as iterative expansion and concept-name guidance.
Figure: Experiment setup (methods and benchmarks)
We now summarize some of the key insights from our study.
1) Benchmarks based on user-generated text have up to 10x more multifaceted entities, 2x more non-named entities, and 43% higher vagueness than well-curated benchmarks.
2) Existing evaluation metrics tend to overestimate the real-world performance of ESE methods and may be unreliable for evaluating concepts with large entity sets. The figure below shows performance drop for example concepts for different k values. We propose to estimate mean average precision (MAP) at gold-k (kg), where kg equals the concept size. This can adapt to concepts of various sizes and can give better estimates of recall.
3. Compared to baselines, state-of-the-art methods tend to underperform on user-generated text, indicating that their methods do not generalize effectively beyond well-curated text.
Our findings indicate user-generated text poses new challenges for the entity set expansion task, especially as entities can be vague, non-named, and multi-faceted. We found that state-of-the-art methods are not very effective at generalizing user-generated text and are often outperformed by simpler baselines. Thus, there is potential for future research on developing entity set expansion methods for user-generated text.
We release new benchmarks at: https://github.com/megagonlabs/eseBench.
Our paper, “Low-resource Entity Set Expansion: A Comprehensive Study on User-generated Text” by Yutong Shao, Nikita Bhutani, Sajjadur Rahman, and Estevam Hruschka, was accepted to NAACL Findings 2022.