
Plug & Play: Text style transfer
A collaboration between KPI6 and the University of Rome 3 for the creation of a new type of artificial intelligence capable of generating entire stories starting from a small sentence.
It hasn’t been a while since artificial intelligence models became able to generate whole stories by inputting a short prompt. While the text generated showed great fluency, in order to create it we need a lot of resources to just train the model which is costly in terms of money. The research is currently looking for new ways to generate text which doesn’t involve building complex and huge training corpora and instead of training a new model from scratch we want to reuse existing models to solve new tasks.
In KPI6 we are always eager to meet new talents, collaborate with important institutions, and launch innovative challenges.
In order to develop new methods and models, Andrea Salvoni, Chief Research Officer @KPI6, has collaborated with the graduating student Alfredo Rubin, University of Rome 3, in building a new controllable generative model which is able to solve Text Style Transfer reusing an existing model, thus skipping the training step. Given a sentence as input and being defined the difference between style and context, the purpose of the task is to generate a new sentence that shares the context but is expressed in a different style.
For example, given the sentence “This chicken is terrible, its meat is tasteless”, we can define the style by the fact that the meal it’s not good which holds a negative value, and we define the context as the act of eating the chicken. A possible style transfer output could be: “This chicken is awesome, its skin is so crispy”.
This model can be called Plug and Play, which means that in order to define a new style we don’t need to retrain the whole model, but just a little portion of it which is fast. We call this portion “auxiliary classifier”, and we mainly use it to create a new gradient which is used to edit the generation of the text at runtime. This process is called gradient steering.
We have taken inspiration from recent work by Uber AI’s department, Plug and Play Language Model. They managed to prove with gradient steering it is possible to edit the sentence at runtime in such a way that the text will show certain properties, specified at the start of the generation.
At every generation step, they perturb the internal representation of the GPT-2 for a fixed number of times in such a way the new representation will be close to those properties. The perturbation is executed using the gradient generated from the auxiliary classifier which will steer the meaning of the sentence towards the context of the classifier and the selected label. For example, if we use a classifier that is trained to recognize if a sentence talks about the military or not, and we generate the gradient maximizing the probability of a sentence talking about military stuff, we will generate the new sentence with a military context.
The main difference between our study and Uber’s AI’s is that we have a starting sentence that must be changed in a different style, while they have a prompt that they have to complete in the requested style.
We used the OpenAI GPT-2 version with 345M parameters. We chose a language model as a starting point because it is easier to generate coherent sentences since they learn to predict the most probable next word given a sentence. We chose sentiment as the style for our experiments and we used an already pretrained auxiliary classifier by Uber, a linear layer with about 6k parameters.
At generation time, we selected the input nouns and we codified them into a Bag Of Word paying attention to not include words that may evoke the style. At every generation step we maximize the following formula <<\log{\sum_{x \mathop \in W} p_{t+1}(w_x)}>>. In this way, we can retrieve those nouns when they are needed in order to maintain the style without creating any sentences with poor grammatics.
We used Kullback Leibler divergence in our loss to handle the perturbation, by making the new perturbed representation not too far from what GPT-2 can recognize as its own internal state. Last but not least, we invented a new loss which we call future loss, where fixed a predicted token at the i-th timestep, we compute n future step in advance to measure how much this token will influence the future generation of the sentence. To be able to apply this new loss, we need to set N appropriately because for high values of n noise is introduced, while for small values the estimation will be too approximative. The final loss is a weighted average between the BoW and the future loss.
From our results, we noted that we can generally solve the task if we don’t include any style-evoking word in the prompt passed to the network. There are also a few conditions to met specified in the paper. In conclusion, we deeply explored this task by experimenting with different models. This work could be expanded by switching to GPT-3 and by refining the loss functions we presented.
How can we apply this innovation in the business and marketing field? This study can be applied in different ways such as giving different personalities to a chatbot, content rewriting, and more. We can change the applications simply by changing the auxiliary classifier without retraining the whole model, making it easy to adapt to new problems.