Low-resource Entity Set Expansion on User-generated Text: Insights and Takeaways

Entities are integral to understanding natural language text. To this end, the task of entity set expansion (ESE) aims to obtain a comprehensive set of entities (e.g., ‘mini bar’, ‘tv unit’) for a concept (e.g., room features) given a textual corpus and a seed set of entities (e.g., ‘coffee’, ‘iron’)  for each concept of interest. Since obtaining large-scale training data for the task is expensive, existing approaches focus on low-resource settings where the seed set is small (< 10 entities per concept).

Figure: Low-resource entity set expansion

Despite the recent progress, reported success of ESE methods is largely limited to benchmarks focusing on named entities (e.g., ‘countries’, ‘diseases’) and well-written text such as Wikipedia versus user-generated reviews. The evaluation is also limited to top 10-50 predictions regardless of the actual size of an entity set of a concept. As a result, it is unclear whether the reported effectiveness of ESE methods is conditional to datasets, domains, and evaluation methods. In this work, we investigate the generalizability of existing ESE methods to user-generated text as it is widely used in many real-world applications and is known to have more distinctive characteristics than well-written text.

Characteristics of User-generated Text

Since there are no existing benchmarks for user-generated text, we created new benchmarks for three domains (hotels, restaurants, and jobs) and found new characteristics that distinguish them from benchmarks on well-curated text. The figure below illustrates characteristics of Wiki benchmark (well-curated text) and Tripadvisor benchmark (user-generated text).

1) Concepts in Wiki are well-defined while concepts in Tripadvisor are domain-specific, often with overlapping semantics. As a result, an entity can belong to multiple concepts, referred to as multi-faceted entities (highlighted in blue). 

2) Ground truth for concept-entity pairs in Wiki can be obtained by referring to external resources or common sense. However, some concepts in Tripadvisor are open-ended and subjective, leading to ambiguity. For instance, an entity ‘civic center’ may be either an attraction or a nearby location depending on the context in the review. We refer to such entities as vague entities.

3) Non-named entities (e.g., ‘coffee’ and ‘tv unit’) are typically noun phrases that are not proper names. Although prevalent in all domains, they are largely ignored in benchmarks. Even still, Tripadvisor has twice as many non-named entities as Wiki.

4) Different concepts within a domain may exhibit diverse cardinality, i.e., concepts can have a different number of entities in the corresponding entity set. Therefore, simply evaluating top-k predictions may not provide a reliable estimate of performance. The table below shows the distribution of concept sizes across multiple benchmarks.
Benchmark data and cardinality of each of the above sources

Experimental Setup

To expand the seed set, ESE methods typically rank candidate entities extracted from a textual corpus. These methods can be broadly classified into: (a) corpus-based methods that rank candidate entities using contextual features and patterns learned from the corpus, (b) language model-based methods that rank candidate entities by probing prior knowledge in a large pre-trained language model. We selected the following representative ESE methods:

a) SetExpan: a state-of-the-art corpus-based method that iteratively ranks entity candidates by filtering out noisy skip-gram features.

b) Embedding baseline (Emb-Base): a simple corpus-based baseline that derives an entity embedding using average context embedding of the sentences that mention the entity using BERT.

c) CGExpan: a state-of-the-art language model-based method that iteratively uses Hearst patterns as prompts for the language model, in addition to other features such as concept-name guidance.

d) LM Probing Baseline (LM-Base): a simple language model-based baseline that excludes additional features such as iterative expansion and concept-name guidance.

Figure: Experiment setup (methods and benchmarks)

We evaluated these methods on two types of benchmarks: well-curated text-based ones such as Wikipedia (Wiki) and new articles (APR)  and user-generated text-based benchmarks such as hotel (Tripadvisor) and restaurant (Yelp) reviews. The well-curated text (WCT) benchmarks have been typically used to evaluate the performance of SOTA methods such as SetExpan and CGExpan. For the purpose of this benchmarking study, we created the user-generated text (UGT) benchmarks. Due to the diversity of concept sizes across WCT and UGT benchmarks, we introduced a new metric called mean average precision at gold-k (MAP@gold-k) to rigorously profile the ESE methods. Here, gold-k refers to the actual concept size of an entity. As an example, for the concept countries, gold-k is 195.


We now summarize some of the key insights from our study.

1) Benchmarks based on user-generated text have up to 10x more multifaceted entities, 2x more non-named entities, and 43% higher vagueness than well-curated benchmarks.

2) Existing evaluation metrics tend to overestimate the real-world performance of ESE methods and may be unreliable for evaluating concepts with large entity sets. The figure below shows performance drop for example concepts for different k values. We propose to estimate mean average precision (MAP) at gold-k (kg), where kg equals the concept size. This can adapt to concepts of various sizes and can give better estimates of recall. 

3. Compared to baselines, state-of-the-art methods tend to underperform on user-generated text, indicating that their methods do not generalize effectively beyond well-curated text.

4. The performance drops on entities with unique characteristics, i.e., multifaceted, vague, and non-named entities. The figure below shows the performance of the ESE methods on non-named entities (see subfigure-a red bar) and vague entities (see subfigure-b red bar) compared to entities that do not exhibit the unique characteristics (see the green bar in subfigure-a and subfigure-b). State-of-the-art methods suffer a larger performance drop. Therefore, the distinctive entity characteristics partially explain the lower performance of state-of-the-art methods on user-generated text.

Concluding Remarks

Our findings indicate user-generated text poses new challenges for the entity set expansion task, especially as entities can be vague, non-named, and multi-faceted. We found that state-of-the-art methods are not very effective at generalizing user-generated text and are often outperformed by simpler baselines. Thus, there is potential for future research on developing entity set expansion methods for user-generated text. 

Please check out our paper for more interesting results and findings. We also release new benchmarks at: https://github.com/megagonlabs/eseBench.

Written by: Nikita Bhutani and Sajjadur Rahman and Megagon Labs

Follow us on LinkedIn and Twitter to stay up to date with new research and projects.


More Blog Posts: