Open information extraction (Open-IE) is one of the primary processes for building knowledge bases that power question answering systems. But research in this field has solely focused on extracting information (e.g., arg1-rel-arg2 triples) from individual sentences in a text, drastically limiting its potential to extract information from multi-sentence texts such as conversations. To solve this problem, we built NeurON, an end-to-end system that can extract information from question-answer pairs in conversational data.
What Is Open Information Extraction?
Open-IE refers to the task of automatically generating structured, machine-readable information representations from unstructured data sources like web documents and free-form text. Despite its crucial role in knowledge bases, Open-IE’s capabilities are mostly relegated to single sentence use cases; unlocking value from multi-sentence texts such as conversations has long been overlooked.
The status quo of Open-IE elucidates two limitations in extracting information from question-answer pairs in conversations:
- Relevant insights that span multiple sentences are unavailable.
- Information from a sentence is hard to interpret correctly without understanding the context. This is especially the case for question-answer pairs, where it can be difficult to understand the answer text without the context provided by the question.
Conversational data contain the exact knowledge that users care about and offer an objective-oriented method for extending knowledge bases. Owing to the limitations of existing Open-IE technologies, we focus on developing methods to extract information from question-answer pairs in conversations.
Conversational Data: The Key to Contextual Question-Answer Pairs
Let’s look at an example to comprehend this conundrum better. Consider hotel knowledge bases; they power websites or conversational interfaces for hotel guests. These knowledge bases typically provide information about the hotel’s amenities such as complimentary breakfasts, free Wi-Fi, or spa services.
However, it’s not unusual for hotel knowledge bases to lack deeper data about these features. For instance, what’s on the menu for breakfast, and when does it start? What are the Wi-Fi credentials? And what’s the spa’s cancellation policy? To answer these queries, you may have to ask an employee of the facility.
Given the wide range of information that may be of interest to guests, it’s not exactly clear how to extend these knowledge bases most effectively. Fortunately, conversational logs present a promising avenue. Many hotels keep these records which contain a wealth of actual questions from real guests and answers from the hotel staff. Therefore, they can be used as a resource to extend knowledge base capabilities.
Facts in knowledge bases normally consist of a relation and a set of arguments. These take the form of a finite ordered list of elements, known as a tuple. In these scenarios, tuples usually adhere to the format of <arg1, rel, arg2>. Here are two examples:
Harvesting tuple facts from conversational data presents significant challenges. Particularly, the Open-IE system must interpret information collectively between the questions and answers. In the first example’s case, it must realize that <third floor> refers to the location of the <gym>. Similarly, it must recognize that <6:00am daily> applies to the pool’s opening time in the second example.
Existing Open-IE systems operating over individual sentences ignore the context hidden in the discourse of question-answer pairs. Without knowing the question, they either incorrectly or completely fail to interpret the right answer.
Building an End-To-End Open-IE System
To explicitly model both queries and responses in question-answer pairs, we perceive Open-IE from conversational data as a multi-source sequence-to-sequence generation problem. An encoder-decoder framework is a popular choice for sequence-to-sequence generation.
For this configuration, the encoder encodes the input sequence to an internal representation. The decoder uses this to generate an output sequence. The input sequence is a sentence, and the output sequence is a tuple with special placeholders (e.g., <arg1>pool</arg1>open<arg2>6:00am</arg2>).
In a conversational setting, both the question and answer are integral for extracting meaningful tuples. Thus, we propose a multi-encoder, constrained-decoder framework for NeurON that uses two encoders. The dual encoder encodes both the question and answer sequences to an internal representation. On the other end, the decoder then uses this to generate the output tuple sequence.
NeurON's Unique Capabilities
Encoder-decoder frameworks have been used extensively for machine translation and summarization. But they pose some key challenges for information extraction from conversational data.
First, it’s imperative to correctly model both questions and answers. NeurON employs different encoders for questions and answers in order to handle the differences in their respective vocabularies.
Secondly, the model must learn certain constraints, such as:
- Arguments and relations are sub-spans from the input sequence.
- Output sequences must follow proper syntax and contain two arguments and one relation (e.g., <gym, is located on, third floor>).
NeurON utilizes a constrained decoder to integrate these restraints as hard constraints.
Lastly, the model must recognize auxiliary information that’s irrelevant to knowledge bases. Greetings are a prime example of this in the hotel domain. Since existing facts in knowledge bases are representative of the domain, NeurON can incorporate this prior information as soft constraints in the decoder. This allows it to rank various output sequences based on their relevance.
Evaluating NeurON's Open-IE Performance
To train NeurON, we bootstrapped high-quality training examples using StanfordIE, a state-of-the-art (SOTA) Open-IE system. We compared NeurON with two SOTA: BI-LSTM-CRF and Neural Open IE. They were trained for tuple extraction from conversational question-answer pairs.
We evaluated the systems’ performance on two conversational datasets: Concierge and AmazonQA. Our experiments revealed that, by incorporating hard and soft constraints, our constrained decoder improves extraction performance.
More specifically, NeurON outperforms SOTA sentence-based models in precision and recall of tuple extraction from question-answer pairs by as much as 13.3%. We also found that it can discover 15-25% more tuples than the other extraction methods.
NeurON has immense potential for extending the capabilities of knowledge bases and opens up several possibilities for future research. We are interested in investigating whether it can be expanded to work with open-domain question-answer corpora and be used for creating knowledge bases for open-domain question answering. We are also looking to apply NeurON to other NLP problems such as text comprehension and data augmentation. Stay tuned and be sure to check our blog regularly — we’ll update it with the latest developments on NeurON!
Written by Nikita Bhutani and Megagon Labs