Community-based Question Answering (CQA) is very popular in a number of different web platforms and domains. It enables users to post their questions about specific entities, such as products, services, companies, and to obtain answers from other users who have previous experiences with the same entity. With the increase in online services, CQA has become essential for various purposes, including online shopping, hotel/restaurant booking, and job searching.
While CQA greatly benefits users when making decisions, digesting information from original question-and-answer pairs (QA pairs) also has become increasingly difficult. Due to the community-based nature, CQA tends to have a large number of heavily repetitive QA pairs, which makes it difficult for users, especially those who do not have specific intent (i.e., questions), to find and digest key information (see Figure 1).
Figure 1. An example of Community-based Question Answering in an online shopping context with an overflow of QA pairs for a specific entity.
To help users facing such a difficulty (with an overflow of QA pairs for the specific entity of interest), we proposed “CQA Summarization” as a new Natural Language Processing (NLP) task.
As shown in Figure 2, CQA Summarization was designed to input QA pairs about a single entity and from that, produce a summary written in declarative sentences as output.
Figure 2. An example of a possible CQA Summarization task input and output.
Together with the new NLP task, we also created a publicly available corpus (named CoQASum) and a strong baseline model (named DedupLED) for the CQA Summarization task. All these three contributions will be presented to the research community in our paper accepted to be published at the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP2022), a top-tier NLP scientific conference that will happen in December 2022 in Abu Dhabi.
For the remainder of this blog post, we will continue to explore the novel “CQA Summarization” and present the two scientific artifacts created by our lab: CoQASum and DedupLED.
The CoQASum Corpus
The CoQASum corpus was created so that the research community could contribute new models and approaches for the CQA Summarization task. The main goal was to create a collection of QA pairs and their respective summaries, created by humans. One of the biggest challenges in the creation of such a collection is that reading and summarizing a set of QA pairs is not an easy task (even for humans). The main difficulties to be overcome during CoQASum creation were: (1) the large number of QA pairs for each entity, (2) the heavy repetition and noise in both questions and answers, and (3) the difficulty of converting questions and answers into declarative summaries.
To help in the process of constructing CoQASum, Megagon Labs researchers designed a multi-stage annotation framework that first simplifies this complex annotation task into more straightforward annotation tasks and then enriches the collected annotations. Figure 3 depicts the schematic procedure of the multi-stage annotation framework, which consists of 3 basic steps. For each entity and its corresponding QA pairs in the original corpus, the first step is to select representative seed QA pairs and ask annotators to rewrite them as declarative sentences, which are then concatenated into a raw summary. In the second step, we ask highly-skilled annotators to polish the raw summary into a more fluent summary. In the last step, we enrich the seed QA pairs by selecting semantically similar QA pairs from the original corpus.
Figure 3. Schematic procedure of the multi-stage annotation framework used to create CoQASum.
The multi-stage annotation framework was used to create the CoQASum benchmark based on the Amazon QA dataset (Wan and McAuley, 2016; McAuley and Yang, 2016). Considering the Amazon QA dataset as the original dataset, 1,440 entities were selected from 17 product categories with 39,485 input QA pairs and 1,440 reference summaries. Additionally, CoQASum also contains rewritten QA pairs in declarative sentences for the QA pair rewriting task in Step 1, which consisted of 3 annotations for each of the 11,520 seed QA pairs (k = 8 seed QA pairs for each entity).
By analyzing the statistics of CoQASum (presented in Table 1), it is possible to confirm that the average word count of input QA pairs/raw summaries/reference summaries is consistent for different categories. The novel n-gram distributions also confirm that CoQASum offers a fairly abstractive summarization task. Some product categories such as “Office Products” and “Patio Lawn and Garden” have lower novel n-gram ratios, indicating that the task becomes relatively extractive. The word count difference between the raw summary and the reference summary supports the value and quality of the summary writing task in Step 2, indicating that the raw summary still contains some redundant information.
As previously mentioned, one of the strong characteristics of CQA data is the large number of redundant and duplicated information. When analyzing the current state of the art summarization models (such as pre-trained encoder-decoder models), it is possible to notice that we do not explicitly implement deduplication functionality. Thus, the DedupLED model was created by simply adding a deduplication layer on a classical summarization model named Longformer-Encoder-Decoder – LED (a state of the art summarization model). More technical details about DedupLED can be found in our paper.
To better understand the impact of adding the deduplication layer to the traditional LED model, our researchers conducted comparative experiments to evaluate DedupLED, LED, and other classical (extractive, abstractive, and hybrid) summarization models. CoQASum data was split into train/validation/test sets, which consisted of 1152/144/144 entities, respectively. Table 2 shows the results of the experiments based on automatic evaluation. These results reveal that DedupLED outperformed all other classical summarization models on the CoQASum dataset. In addition to the automated evaluation, human evaluation was also conducted to judge the quality of generated summaries by the different summarization models. The human evaluation performance trend aligns with the automatic evaluation performance. Thus, it is possible to conclude that adding the deduplication layer to LED can generate a positive impact for the new CQA Summarization task.
Table 2. Performance of the models on CoQASum. R1/R2/RL/BS denotes ROUGE-1/2/L F1 and BERTScore F1, respectively.
We proposed a new CQA Summarization task focused on summarizing QA pairs in Community-based Question Answering. In addition, we developed a multi-stage annotation framework and created a benchmark CoQASum for the CQA Summarization task. Also, we used a collection of extractive and abstractive summarization methods and established a strong baseline method DedupLED for the CQA summarization task. An empirical evaluation confirmed the impact of the deduplication layer present on DedupLED in the performance of classical summarization models. More details about this research work can be found in the scientific paper that will be presented at EMNLP 2022, Abu Dhabi.