Unlocking the Potential of Transformers for Long-Form Text Matching: A Simple yet Powerful Approach

Long-form text matching is critical in natural language processing (NLP) and information retrieval (IR) applications. Traditional methods often struggle to capture the global semantics and handle the large computational overhead. In our research, we propose a simple yet effective solution using sequence pair classification with Transformer models, demonstrating its superiority over state-of-the-art Siamese network-based methods.

Challenges in Long-Form Text Matching

Long-form text matching presents several challenges:

  1. Global Matching Semantics: In long-form texts, crucial matching signals are sparsely distributed across the document. Traditional methods often miss these signals, leading to suboptimal performance.
  2. Hierarchical Structure: Long documents consist of a hierarchical structure with sentences and words. Capturing this multi-level structure is essential for accurate matching.
  3. Handling Long Texts: Efficiently processing long texts is challenging due to the computational limitations of traditional neural networks and the input length constraints of Transformer models, like BERT, which typically handle up to 512 tokens.

Existing recent solutions are usually based on the Siamese network. We argue that in such methods, the interaction between terms in two texts happened too late, which could result in the loss of important matching signals.

Proposed Method: Sequence Pair Classification

Our approach diverges from traditional Siamese networks by adopting a sequence pair classification framework starting at the token-level stage. This simple method allows us to capture global semantics between documents—specifically, relevant tokens across texts—starting with the very first attention layer of the encoder.  We implement the sequence pair classification method in the following straightforward way: we begin by tokenizing the two documents in a pair into separate sequences of tokens. These sequences are then concatenated using a [SEP] token, with a [CLS] token placed at the beginning. To ensure both documents are represented fairly, especially in cases where one document might dominate the token budget, we pre-calculate a token allocation. This ensures that both documents have an adequate number of tokens after truncation. This concatenated sequence is fed into a Transformer encoder (DistilBERT, RoBERTa, or Longformer), and the output from the [CLS] token is used for final classification. On the basis of the features introduced above, users can also make more design choices about the output layer on top of the Transformer-based encoder to include more matching signals. This architecture allows the model to capture interactions between terms in the two texts at an early stage, leveraging the self-attention mechanism to learn rich matching signals across different granularities.

The Overview of the Sequence Pair Architecture
Figure 1. The Overview of the Sequence Pair Architecture

Experimental Setup and Results

We evaluated our method on several benchmark datasets (AAN-abstract, OC, S2ORC, PAN, and AAN-body) and compared it with state-of-the-art methods. Three variants of our method are named by the transformer encoder used: SEQ-D (DistilBERT), SEQ-R (RoBERTa), SEQ-L (Longformer). As shown in Figure 2, sequence pair classification significantly improves accuracy and F1 scores, especially for longer sequences, demonstrating the efficacy of capturing richer matching signals through early interaction.

The Main Experimental Results
Figure 2: The Main Experimental Results

Implications and Future Directions

Our study represents a notable advancement in long-form text matching, with practical implications for various NLP and IR applications, such as document matching, news deduplication, citation recommendation, plagiarism detection, and job targeting. The success of the sequence pair classification architecture highlights the potential for Transformer models to handle complex text-matching tasks more effectively than traditional Siamese networks.

Future research should focus on optimizing the computational efficiency of self-attention mechanisms in Transformers to reduce overhead. Exploring the trade-offs between sequence length and attention sparsity could lead to more efficient models. Additionally, integrating large language models (LLMs) like GPT-4 could further enhance long-form text matching by leveraging their extensive pre-trained knowledge, despite the challenge of input length limitations.


We explore a simple baseline method based on Transformer encoders by casting the problem into a sequence pair classification one. The experimental results show that such a simple method has achieved promising results and outperformed state-of-the-art methods in this field. It illustrates the superiority of the sequence pair classification architecture over the Siamese network-based one, which is widely adopted by existing solutions. Future studies could consider the baseline method proposed in this work as a cornerstone and could subsequently propose new techniques to alleviate the computational overhead of the self-attention mechanism in Transformer while ensuring good performance.

Written by Chen Shen and Megagon Labs

The research is to be presented at DAASFA 2024. Follow us on LinkedIn or Twitter for more. 


More Blog Posts: