Paraphrase Generation for Long Text

We recently worked on a project aimed at producing new articles, given a few input articles on the same subject. Each input article ranged from 5 to 15 paragraphs and roughly covered similar content. We framed this problem as a multi-document summarization task, and attempted to use modern generative models, such as T5, to produce coherent and grounded articles on the same subject yet wouldn’t copy from source to the degree of plagiarism. To better understand this problem, I conducted a survey study of contemporary paraphrasing approaches in NLP.

We specifically targeted document-level paraphrasing; the goal was to find an approach that addressed long text. However, nearly all of the papers reviewed addressed sentence-level paraphrasing, a much easier paraphrasing task. Nevertheless, studying these approaches helped us understand the landscape better. The sparsity of document-level paraphrasing approaches presents an opportunity for further research.

In this blog post, we will define the problem of paraphrasing. We will explain the challenges of document-level paraphrasing, especially in the business domain. These challenges include evaluation. Following this, we will briefly describe the results of the survey study, and identify key ideas.


Definition of Paraphrasing

There are three main dimensions to the quality of a paraphrase: lexical, syntax, and semantic. Lexical is the word choice or vocabulary of the paraphrase. A common solution to word-choice paraphrasing issues would be to replace words with their synonyms. Syntax is the overall structure of the sentence. For example: Is the sentence written in an active or passive voice? How are the clauses organized? Semantics refers to the meaning of the sentence. Changing the sentence’s meaning would not be an ideal outcome of a paraphrase. However, we want to change the syntax and lexicality as much as possible.

The general definition of paraphrasing is to rephrase an input such that the surface level (syntax and lexical) is different, while the meaning (semantics) is the same. For example, “The person ate the cheese” and “The cheese was eaten by someone” are paraphrases of one another. In this example, the high-level sentence structure is changed, in addition to specific nouns being replaced by their synonyms. However, the specific event that is being described remains largely unchanged. This captures the essence of a paraphrase.

Moving to Document-level Paraphrasing

Performing paraphrasing at the paragraph or even document-level is exponentially more difficult than sentence-level paraphrasing. The most immediate problem pertains to the length of the input. Contemporary approaches to sentence-level paraphrasing use pretrained language models to generate sentences. However, these models are oftentimes unprepared for the much longer inputs of documents. Therefore, you cannot simply transfer the sentence-level approaches directly to the document-level, as significant architectural refactoring is needed.

Tackling a longer source input isn’t the only challenge with document-level paraphrasing. You must also consider the structure of the paragraph and the entire document, which is notably more complex than a sentence. The document must follow some coherent structure in which each paragraph follows a reasonable order. Following this, each paragraph should stay consistent within itself. A paragraph that strays far from its initial subject is a poor-quality one. 

For example, it might not be possible to move the first sentence in the paragraph to the end of the paragraph, as the first sentence may introduce an important concept. How can you rephrase a sentence, given the paragraph or document that it is in? Lin et. al. 2021 explore this problem in a recent paper where they define two actions to perform a document-level paraphrase: sentence rewriting and reordering.

Evaluating Document-level Paraphrasing

Because there is more room to make errors in document-level paraphrasing, it is important to measure performance across these novel dimensions. 

Similar to sentence-level paraphrasing, we have measures of similarity and grammaticality. Are the newly generated sentences grammatical? Do they make sense? Are they copying significant sections of the original sentence? Measuring these at the paragraph or document-level is more difficult, as we do not know how the sentences are aligned. 

With document-level paraphrasing, there are new factors to consider beyond similarity and grammaticality. Is the new paragraph coherent? Do we conclude a section before we even begin it? Do we use pronouns before we even introduce the antecedent? We need a high-level understanding of the flow of information in the paragraph and document in order to make this assessment. Most models are unable to capture this amount of context, so it becomes necessary to construct a unique approach to coherence. This metric is a worthy endeavor in itself. 

Business Needs for Document-level Generation

There is an enormous amount of content on the web in the form of online articles. These articles are produced by companies to increase their reach on search engines (known as Search Engine Optimization, SEO).

These businesses have a human-intensive pipeline approach to publishing articles. It starts with a rough draft that evolves into a polished product that can be shared publicly. The process begins with topic selection, in which a background study is done to gather content from other sources. These sources are distilled through a summarization process. A final editor reviews the content and checks for quality and copyediting issues. This pipeline has been traditionally done by humans, but recent advances suggest initial drafts can be generated by an automated system.

While it is not ideal to take the raw outputs of a generative model, such as GPT3, and publish them directly, they can serve as a rough draft to be edited by humans. In this way, a model can help reduce the amount of tedious work an initial editor would have to do.


The quality demands of a business-level paraphrasing application can be different from those of research. Putting content on a company website has certain quality requirements that likely exceed current research results. Avoiding offensive language is a higher priority item than proper grammar. For this reason, it is necessary to construct business-specific metrics for evaluating paraphrasing quality. 

An example of this would be a metric that measures the use of gendered terms in language. Using the term “actress” would be an example of a gendered term, so this phrase would lead to a lower gender metric score. 

Another example would be a plagiarism metric. In writing articles, it is often the goal of a company to rank highly on a search engine’s ranking algorithm. To this end, avoiding plagiarism is paramount. Measuring this is difficult, however, with so many potential sources to consider. While a variety of text comparison metrics such as ROUGE and BLUE can be used, measuring plagiarism is difficult, with so many potential sources to consider.


Overview of Sentence-level Paraphrasing Research

The majority of papers reviewed focused on sentence-level generation approaches. Across these papers, there are a few common approaches that establish a theme in the literature. Training models without large, labeled corpora is a hot trend in NLP, and paraphrasing is no exception. Many papers aim to train unsupervised models by generating synthetic training data that can be used to train a generative model. Another trend present in paraphrasing research is controllability: what kind of output can we expect from the model? Finally, many papers follow a similar scheme in how they evaluate the performance of their paraphrasing approaches. Furthermore, some even use these scoring methods as signals for the models themselves.

Synthetic Training Data

In the unsupervised setting, it is necessary to construct a dataset that can be used to train or fine-tune a model. These datasets can take different shapes depending on the way they are used to train a model. One particular approach has been to train an initial model that would be used to generate training examples for a subsequent model (Niu et. al 2021). 

Another class is to retrieve training data from weakly labeled examples, and use a trained model to select high-quality examples (Ding et. al. 2021). This form of weak supervision is desirable because there are many forms of text that contain implicit signals that can be used for paraphrasing. For example, if two different sentences are surrounded by the same sentences, we can assert that they must mean the same thing. Meng et. al. 2021 use this idea to construct a synthetic training dataset. While the probability of two different sentences being surrounded by the exact same sentences is low, they can use pre-trained language models to measure the probability of generating two sentences given the same context (surrounding sentences). 

The benefits of an unsupervised approach are numerous. These paraphrasing approaches can tackle low-resource languages. They can also be applied to specific domains, which might not have a lot of training data previously annotated.

Figure 1: Synthetic training data pipeline

Figure 1: Synthetic training data pipeline


The controllability of a model is becoming increasingly important. Black box approaches can offer extremely impressive outputs, but without proper controls, they are unusable in a production setting. Consumer-facing products are subject to a high level of scrutiny, where the margin for error is much smaller than what might exist in research. Generating offensive content or perpetuating inequalities could mean legal issues (in addition to moral ones). For this reason, many paraphrasing papers focus on training a model that conforms to a specified target (Sun et. al. 2021). This target can be a specific syntax structure or quality controls.

For example, the syntax of the output paraphrase is often used as additional input to the model. The model is trained to follow the target syntax, as well as the semantic meaning of the input text. This allows specific structures to be generated. The downside is that additional data is required; you need to provide the target syntax. Knowing the best target syntax is not trivial.

Scoring Functions

In any ML task, measuring performance is critical. In paraphrasing, it is common to use a scoring function to control outputs and filter training data. These scoring functions are often multi-dimensional and try to capture some of the fundamental characteristics of a paraphrase. These include syntactic diversity, lexical diversity, semantic similarity, and grammaticality.

Measuring the “goodness” of a paraphrase is very helpful, but not necessarily straightforward.

There are many dimensions of a good paraphrase that can be prioritized differently depending on the domain we are in. For example, syntactic diversity might be incredibly important while lexical variety is less. “He ate cheese” and “Cheese was eaten” are syntactically diverse but lexically very similar. “He ate cheese” and “They consumed curdled milk” are syntactically identical but lexically unique. 

Scoring functions also offer a means of controllability. A human can intervene in how the different dimensions of a scoring function are prioritized. Because these affect how the model is trained or what output is chosen, they can influence the types of content being produced by the model.

Figure 2: Paraphrasing scoring dimensions

Technical Details


Using pivots as a technique to generate paraphrases is a recurring idea in the literature. A pivot can be thought of as a central component to two paraphrased sentence pairs. The most traditional is using language as a pivot. To generate a paraphrase of an input sentence, we would potentially use multiple models that were trained on translating text from one language into another. For example, we can use T5 to translate an English sentence into German. We then translate that German sentence into English using T5. (T5 can handle translating from multiple languages.) This output sentence can be considered a paraphrase of the original. 

There are other cases of pivots besides language. The semantics of a sentence can be considered a pivot. Abstract Meaning Representations (AMR) can be used as a semantic embedding of a sentence, and are used in one approach to paraphrase sentences (Cai et. al. 2021). Another example of a pivot is context. Context is defined as the sentences immediately surrounding a sentence. If multiple sentences can fit in the same context, they can be considered paraphrases, and thus the context is a pivot.

Encoder-decoder Network

Encoder-decoder models are oftentimes used for the task of conditional generation, and paraphrasing is no different. In many cases, a Conditional Variational Autoencoder is used to better map the input sequence into a usable latent space. While this approach is powerful, it makes it difficult to translate sentence-level approaches directly to document-level approaches. 

Translating These Approaches to Document-level Paraphrasing

Most models are trained on inputs of about 500 tokens, which is unsuitable for the task of document-level paraphrasing. In order to use these models in this more challenging task, it is necessary to construct a pipeline approach that breaks the problem into smaller pieces. (Generating each paragraph at a time and limiting the size of each paragraph.) Alternatively, models such as Longformer can be used on inputs as long as 16K tokens. However, with such a long input sequence, it is difficult to control the scope of what the model will generate.


Sentence-level paraphrasing is a well-studied task. There are many papers focusing on improving this task in specific dimensions: controllability, unsupervised, overall quality, etc. However, paraphrasing on the sentence level is not adequate for business-level applications where entire articles need to be generated. For this reason, we must look deeper for more sophisticated solutions that solve this more difficult problem. The sparsity of the research in this area reveals a need to pursue new approaches. Transferring advancements in sentence-level approaches to document-level research could be a good start. 

With a new problem comes new metrics. While sentence-level paraphrasing is somewhat straightforward to evaluate, document-level paraphrasing is not. The increased complexity makes determining the quality of the paraphrase very difficult. The solution space for generated documents is huge, so comparing with a reference is not sufficient. Measuring the goodness of a novel piece of text is a sophisticated process that needs to be considered when solving this problem.

Written by:  Austin King, Eser Kandogan, and Megagon Labs.

Follow us on LinkedIn and Twitter to stay up to date with us.


More Blog Posts: