Text Style Transfer methods, with LLMs
Abstract
Transferring the written (or spoken) style of a text to another text is a challenging task in natural language processing. In the modern days (2024) with LLMs (Large Language Models) like GPT-4, this task is becoming more feasible. Still, the practical method has been via prompt engineering, for example, by asking the model to answer with an specific style (e.g. "write formal"), or by providing examples of the desired style, or by asking to write in the style of a specific author (e.g. "write like Charles Dickens"). There is a more sophisticated method, to fine-tune the model with a style dataset, but this is more complex and requires more computational resources.
In this article we are going to explore the pros and cons of these methods, and how to use them in practice. As well as how to test the output quality of the methods.
The result of this article is that the fine-tune method is the best method, achieving a 90% accuracy in a text classification task against original author texts.
The methods
When performing text style transfer what we want is a blackbox that transforms a text from one style to another. Conceptually:
original_text = 'I am a happy person'
styled_text = model(original_text)
print(styled_text) # 'I am a joyful person'
Here pesudocode is used, but actual implementations can be found in the the evaluation colab. We will use as examples the style of Charles Dickens in the book "Great Expectations" as examples of text, to display some output examples.
The original Charles Dickens text we are going to use is:
Now, if I could have believed that she favoured Drummle with any idea of making me—me—wretched, I should have been in better heart about it; but in that habitual way of hers, she put me so entirely out of the question, that I could believe nothing of the kind.
And rewritten to standard engligh by gpt-4o turns out to be:
Now, if I could have believed that she favored Drummle with any intention of making me—me—miserable, I would have felt better about it; but in her usual manner, she disregarded me so completely that I couldn't believe anything of the sort.
As we can see, asking gpt-4o to rewrite it to standard english, does not change that much the original text. Also we can notice that the Charles Dickens style is quite subtle.
Prompt engineering, asking for a specific style
The simplest method is to ask the model to rewrite the text in a specific style. This can be done by adding a preamble to the prompt, asking the model to rewrite the text in a specific style.
original_text = 'I am a happy person'
styled_text = llm_model('Rewrite this text in a joyful style:\n' + original_text)
print(styled_text) # 'I am a joyful person'
This is simple, but it will feel generic. It is hard to capture the nuances of a style by describing the style in a few words. Using this strategy the example standard english text is rewritten as:
Had I been able to persuade myself that her partiality towards Drummle arose from a desire to inflict misery upon my unfortunate self—yes, upon me—I might have derived a grim satisfaction from the notion; but in her characteristically indifferent fashion, she dismissed my very existence with such totality that no such motive could be credited.
Prompt engineering, asking for an author style
Similar to the previous one is to ask the model to write in the style of a specific author.
original_text = 'I am a happy person'
styled_text = llm_model('Write this text in the style of Charles Dickens:\n' + original_text)
print(styled_text) # 'I am a joyful person'
This is more specific, but requires the model to have a good understanding of that author's style, therefore relying on the model implicit knowledge learnt during the training. It is possible that the model does not know the author, for example, if you want a model to write like you do in your emails, the model will not be able to do it, because it has not seen your emails.
It is possible to combine this method with the previous one, by asking a model to describe the style of an author and then asking the model to write in that style.
original_text = 'I am a happy person'
style = model('Describe the writing style of Charles Dickens in few words.')
styled_text = llm_model(f'Given this style: \n {style}\n Write this text in the given style:\n' + original_text)
print(styled_text) # 'I am a joyful person'
Using this strategy the example standard english text is rewritten as:
Now, if perchance I could have entertained the notion that she bestowed her favor upon Drummle with the deliberate intent of casting me—poor, wretched me—into the depths of despondency, I might have found some perverse comfort in the thought. Yet, in her customary and maddeningly indifferent manner, she disregarded my very existence with such thorough completeness that it was beyond my capacity to harbour even the faintest belief in such cruel artifice.
Prompt engineering, providing examples
This method is more powerful, as it allows to provide examples of the desired style. This may definetly provide more context to the model to understand the style. To do this, we can provide a list of examples of the desired style, and ask the model to rewrite the text in that style. This method is also often referred as "few-shot", where each "shot" is an example. Normally, the amount of examples is from 1 to 10, but it can be more.
examples = [{
'original': 'I am a sad person',
'styled': 'I am a melancholic person'
}]
original_text = 'I am a happy person'
prompt_examples = '\n'.join([f'Original: {ex["original"]}\nStyled: {ex["styled"]}' for ex in examples])
styled_text = llm_model(f'Given these examples:\n{prompt_examples}\nRewrite this text in the style of the examples:\n' + original_text)
print(styled_text) # 'I am a joyful person'
This method requires the manual or automated creation of the original examples. Conceptually, we are helping the model to "reverse-engineer" the style. We want to , and to do that we are giving the model some examples by performing the inverse operation, . This will come more systematically later with the fine tune.
Being more powerful than the previous methods, it is still limited by the amount of examples we can provide. What if the author uses more often "nonetheless" instead of "however"? This particular lexical frequency is hard to capture with a few examples. We can provide more examples, but it is not scalable. What we can do as backup is to add the name or style of the author along the examples:
# ...
prompt_1 = f'Given these examples:\n{prompt_examples}\n'
prompt_2 = 'Rewrite this text in the style of Charles Dickens as in the provided examples:\n' + original_text'
styled_text = llm_model(prompt_1 + prompt_2)
# ...
Using this strategy the example standard english text is rewritten as:
Had I been able to believe that she showed favor to Drummle with the aim of making me—specifically me—miserable, I might have been somewhat comforted; yet, in her customary manner, she ignored me so thoroughly that I found it impossible to entertain such thoughts.
Fine-tuning with a style dataset
Here we go to the next level. Instead of providing examples, we are going to fine-tune the model with a style dataset. This is more complex, as it requires more computational resources, but it is more powerful, as it allows the model to learn the style more systematically. The idea is to provide a dataset of texts in the desired style, and ask the model to learn the style from that dataset.
This method has two distinct phases: training and inference. What we have done so far with the other methods is inference, we are asking the model to perform a task. With this new phase, the training, we are going to build a new model from a base model, that is going to be able to perform the trained task better.
The training phase itslef has two steps: generating the dataset, and training the data. Classicly we would have also the validation and test steps, but for simplicity we are going to ignore them. Also, perform the evaluation if this task is more complex that with other tasks as we will see later.
As we did with the few-shot method, we need to provide examples of the desired style. But in this case, we are going to provide a lot of examples, and we are going to ask the model to learn the style from them. It is important the amount of examples, too few and the model will not learn the style, too many and the model will overfit to the examples. We are going to do 100 examples.
To systematically generate the 100 examples, we are going to use an LLM model, asking him to perform the .
styled_texts = [...]
original_texts = []
for styled in styled_texts:
original = llm_model('Write this text as standard modern English:\n' + styled)
original_texts.append(original)
There are other ways to generate, refine and curate datasets, and possibly this is the most important step in any training process. As community says "shit in, shit out", if we feed to the training low quality data then the model will genreate low quality outputs.
Using this strategy the example standard english text is rewritten as:
Now, if I could have believed that she favoured Drummle with any idea of making me—me—wretched, I could have borne it; but in the selfishly cruel manner that she had, she was indifferent to me as if I were a great dog in the street.
Full example of dataset prep, training and inference can be found in this colab. It uses gpt-4o as base training model, and also gpt-4o to generate the training dataset.
Nonetheless this method introduces other challenges, like overfitting the content or the style, or being unable to extrapolate to other texts the style. Steering a fine tunning is complex and is going to be the theme on future articles.
Evaluation: Testing the output quality
There are two big questions: quality and cost. We are going to explore quality here.
How do we define that a text has captured the style of text of an author (or of a set of texts)? One way would be to ask an expert of that author, "Hey, is this text written by that author, or not?". But often we do not have access to those experts, so we are going to train an expert, that is, a text classifier.
Training a text classifier
We want a text classifier such that given a text from Charles Dickens (our author) it outputs , otherwise it outputs . To do that we are going to train a text classifier with a dataset of texts from Charles Dickens labelled as , and a dataset of texts derived from Charles' texts labelled as . The idea is that the texts derived from Charles' texts are going to be similar to Charles' texts in content, but not in style. We don't want the classifier to learn to classify the content, but the style, this is why, theoretically, it is important that the content is similar.
To generte that training set we can ask any LLM to genreate text in other random styles for each text of Charles Dickens. In this colab we demonstrate how to do that, with OpenAI gpt-4o, and a bunch of defined styles. The colab takes 100 paragraphs from Charles Dickens "Great Expectations" book, and generates 1000 paragraphs in 10 different styles. Therefore the dataset has a 1 to 10 ratio of Charles Dickens to other styles. We split this dataset in train and test datasets, stratisfied by style, to ensure that the test dataset has the same distribution of styles as the train dataset. The dataset can be found in huggingface.
Now let's train a text classifier. We can follow the huggingface guide on how to train a text classifier, which uses distilbert/distilbert-base-uncased as base model. The training process can be found in this colab, and the model can be found in huggingface.
Evaluating the methods quality
With that classifier we can evaluate the quality of the generated texts. To do so, we are going to take Charles Dicken's texts from the test dataset used to train the classifier, then rewrite them in plain standard English, and use each method to rewrite it in the style of Charles Dickens. We are going to use the classifier to evaluate the quality of each method. We have to use texts from the test dataset, to ensure that the classifier has not seen the text before, and therefore it is not biased. Also, the test dataset is generated from the same book as the training dataset, so the classifier should be able to classify the style of the text, and not the content. The full evaluation of the 4 methods with the text-classifier can be found in this colab. The results are:
method | accuracy |
---|---|
original | 0.95 |
write_as_fine_tune | 0.90 |
write_as_charles_dickens | 0.10 |
write_with_shots | 0.10 |
std_english | 0.00 |
write_with_style_description | 0.00 |
Where original
is the best reference of actual Charles Dickens text, and std_english
is the
worst reference of Charles Dickens texts rewritten in plain standard English. As we see, the best
method is the write_as_fine_tune
. The other methods provide some quality, but they are not
as good as the fine-tune method.
Conclusion
In this article we have explored the methods to perform text style transfer with LLMs, and how to evaluate the quality of the output. We have seen that the fine-tune method is the best method, but it requires more computational resources and engineering efforts. The other methods are simpler, but they are not as good as the fine-tune method.
Next steps
There are two clear vectors of improvement:
- The training dataset for the text classification model can be improved, by adding more examples, or by using a more sophisticated method to generate the examples.
- The fine tunned model has been trained with random examples from the same book where the text-classification dataset has been generated. This is a limitation, as the model may have memorized some of the test dataset. A better approach would be to split the book paragraphs in 3 parts: classifier train, fine tunning train, and validation
- The fine tunned model has many hyperparameters that have to be tunned. This is a complex task, and it is going to be the theme of future articles.