Paraphrases are vital resources for a wide variety of natural language processing (NLP) applications. Consequently, several paraphrase mining techniques have been developed. While these mining techniques are successful at discovering generic paraphrases, they often fail at identifying domain-specific paraphrases. To solve this, we built Essentia. Using word-alignment graphs, this elegant system extracts domain-specific paraphrases from a set of input sentences even if the set contains less than a handful of sentences.
The Problem With Current Paraphrase Mining Techniques
Whether it’s text-to-text generation, machine translation, or textual entailment recognition, paraphrases play a crucial role in numerous NLP tasks. In the case of the first two categories, they’re integral for creating organic and diverse output texts.
Imagine if a chatbot only greeted users with “How can I help you?” This would quickly become dull and irritating. Thanks to paraphrases, this chatbot can utilize the expressions “Can I help you?” and “What can I do for you?” interchangeably in user interactions.
Existing paragraph mining techniques focus on generic paraphrases rather than domain-specific paraphrases. As the term implies, domain-specific paraphrases are expressions that only apply to a specific niche. For instance, “having a late checkout” is a paraphrase of “extending the checkout” in the hospitality domain. Likewise, “get a table,” “make a reservation,” and “book a table” are prime examples of paraphrases in the restaurant domain.
To put this in proper perspective, let’s examine three sentences in the restaurant domain:
- Can I please get a table for tonight?
- Can I book a table for dinner?
- Can I make a reservation for tonight?
The expressions “get a table” and “make a reservation” are not semantically similar in generic contexts, and general mining systems would not consider them as paraphrases of each other. But to a human, it’s clear that these sentences do align contextually. Furthermore, general paraphrase mining techniques usually rely on large corpora and statistical methods to find paraphrases. But mining domain-specific paraphrases faces the challenge of not having sufficient data. Sometimes, a corpus of dialogue system scripts only contains a few hundred sentences.
Introducing Essentia
We developed Essentia to address the limited data availability restricting automatic discovery of domain-specific paraphrases. This novel system mines high-quality domain-specific paraphrases from small corpora. In a nutshell, Essentia creates graph-based representation of a set of sentences — called word-alignment graphs — where a word that is shared (in a similar context) by multiple sentences is represented by a single graph node. It then outputs a set of domain-specific paraphrases.
In our previous restaurant domain example, Essentia mines {“book a table,” “make a reservation,” “get a table”} from those three example sentences. Context is key to this capability. While these expressions are not semantically similar, Essentia’s word-alignment graph reveals that they do share contextual similarities; it recognizes that the surrounding words before and after the phrases share the same patterns.
To understand this better, let’s delve deeper into Essentia’s pipeline.
How Essentia's Pipeline Mines Domain-Specific Paraphrases
Essentia contains three main components: a word aligner, a word-alignment graph generator, and a paraphrase generator:
The word aligner determines identical words and synonyms between sentences. In the case of our restaurant domain example, it recognizes the following shared words: “Can,” “I,” and “for.” The word-alignment graph generator then uses these shared words to represent the input sentences in a graph-based data structure called a word lattice:
The word lattice representation merges identical words and synonyms together by consulting the well-known paraphrase database (PPDB) and examining the POS tags of each token. The merge is done only if the two sentences exhibit a certain degree of semantic similarity (measured through the cosine similarity of their embeddings computed using FastText). Words that cannot be merged are represented as independent paths. Once this is done, the paraphrase generator mines domain-specific paraphrases from the word lattice by discovering parallel paths between nodes.
In our example, {“book a table,” “get a table,” “make a reservation”} fall between the nodes “I” and “for.” As a result, Essentia reports this group as a set of paraphrase candidates. These candidates are then given to human annotators on crowd-sourcing platforms to filter the incorrect mined paraphrases.
The paraphrase generator can also mine words that are unnecessary in a sentence. It does this by recognizing nodes that form a loop, such as “please” in the first sentence in our example. Such phrases can be removed without affecting the core meaning of the sentences.
Evaluating Essentia
We evaluate Essentia by comparing the domain-specific paraphrases it mines with those in a widely-used generic paraphrase database called PPDB, which is considered the most extensive paraphrase database available. It turns out that our domain-specific paraphrases complement and augment the PPDB. Specifically, PPDB only contained 4% of the 726 correct extractions that Essentia made.
We also evaluate Essentia on a public dataset called Snips that contains a collection of queries submitted to smart conversational devices like Google Home or Alexa. Essentia manages to mine non-trivial domain-specific paraphrases such as {“show me the way”, “get my directions”} from the dataset.
Future Possibilities For Essentia and Word-Alignment Graphs
Our evaluation results show that Essentia holds vast potential for the future. There are many opportunities for this system to provide immense value. And future developments could go in several directions.
One possibility is to use Essentia to derive domain-specific sentence templates from corpora. These could be useful for natural language generation in question-answering systems or dialogue systems. Essentia could also potentially recognize linguistic patterns other than paraphrases; this would allow us to identify the essential constituents of a sentence.
For now, our next goal is to expand Essentia’s capabilities so it can handle noisy corpora where sentences do not necessarily come from the same domain. For example, given two sentences “Can you tell me the direction to the parking lot?” and “How can I order a delivery?”, can we identify paraphrases {“Can you tell me”, “How can I”} that convey the request part of these sentences? Stay tuned! We’ll be sure to post updates to our blog soon.
Want to learn more about Essentia? Check out our research paper! Do you have any questions? Feel free to contact us today!
Written by Chen Chen, Behzad Golshan and Megagon Labs