Generating new sentences from an input sentence using Variational Autoencoders

10 min readJul 5, 2021

Let us see how to generate new sentences which are grammatically correct and related to the input sentence. This has wide applications in many fields including language translation, sentence completion, machine assisted story writing etc. We will do sentence generation using variational autoencoders.

Idea

We will initially train the model with training data. Once the model is ready, we will provide it an input sentence and the model will output a new sentence which is produced from the input sentence using the knowledge it gained in training phase.

In the training phase, input sentences will be changed to corresponding word embeddings and then passed onto encoders so as to produce encoded hidden state output. This output is used to find the mean and variance of the distribution of latent variable in the latent space. Thereafter we will sample one point from this distribution and feed it to the decoder. Decoder output is converted from vector representation back to sentence representation which will provide us with the newly generated sentence.

What are other similar approaches and what advantages does the VAE approach have?

A naive approach is to try generating sentences from a feedforward neural network. This fails in generating good sentences as data can flow only in one direction in feedforward networks and hence finding language rules and applying that in generating new sentences would be very difficult.

A better approach is to use recurrent neural networks to build a sequence to sequence model which can generate sentences. In this approach, encoder takes the training data, produces output which is fed into the decoder which comes up with the final output. The issue here is that even though RNNs can let data move in both directions, it fails in remembering patterns that it had seen long before. It can remember patterns from whatever it processed before but fails to remember patterns for long time and apply in generating sentences.

Long Short Term Memory models are the solution. Why? This is because LSTMs have the ability to selectively remember whatever is required and let off the useless information. There are multiple gates in LSTMs that perform certain tasks like removing less important data that is transferred, finding useful data from present input to keep in memory and mixing up these old and new data to form useful information that can be sent forward. .

Now, a more advanced approach is to use an autoencoder with LSTM encoder and decoder to generate new sentences. In this approach, training data is input into encoder, it creates a latent representation and then the decoder tries to reconstruct the sentence from this latent representation. The issue with this approach is that latent variable is seen as a variable and not as a probability distribution. In this case, decoder does not have the option of sampling random point from the distribution to generate output. Hence we will get lesser variety in sentences produced.

Our approach in this blog uses variational autoencoders and hence produces Gaussian distribution of latent variables in the latent space. This will let the decoder sample a point from this distribution and hence more variety in final sentence that is generated.

Modifications made on basic VAE model

Dropout layers are added at the encoder and the decoder. The way dropout layer helped in improving model is as follows: Before introducing dropout, every time the same sentence is fed to the model in a new epoch, it processed that sentence in the same manner and produced the same distribution in the latent space. This lead to model overfitting on the data and producing outputs with lesser variety. When dropout layer was added to encoder and decoder, it gave double benefit. Firstly, every time the same sentence was input, slightly different distribution was formed from that of the former case. Secondly, even if we assume that our model produced same distribution and decoder sampled same point again, still the dropout layer in decoder helped in coming up with sentences with more variety.

The KL divergence was made to decrease at a lower rate by changing the coefficient of KL divergence in loss function to a very small positive value. Due to this, the distribution that was formed didn’t mimic the original distribution. Instead, it came up with distributions very similar to the original ones which helped us in achieving more variety. When we are talking about variety, do keep in mind that we did decrease reconstruction loss and KL divergence to lower values through training process and hence our model did learn language rules that are required for it to come up with new sentences.

(The terms used in the above paragraphs will be clear towards the end of this blog. This was meant to give an introduction to what the model is and what modifications are to be expected on the basic model in this implementation.)

Components of our model

Let us first see different components of our model. Then we will see the approach used and the process of building the model.

1. Embedding layer

This layer takes the input and converts it into corresponding word embedding. We have chosen 300 dimensional word embedding. This means the input words will be converted to 300 dimensional embeddings.

2. Encoder

Encoder takes the embedded input and converts it into an encoded representation. We have chosen a bidirectional LSTM encoder with 2 hidden layers each of size 256.

3. Latent space

Latent space is where the Gaussian distribution of latent variable is formed. From the output of encoder, we will find the mean and variance of the Gaussian distribution of corresponding latent variable.

4. Decoder

Decoder samples a point from this distribution and produces an output. We are using a unidirectional LSTM decoder with 4 hidden layers of size 256.

5. Dropout layer

Dropout layer is used to bring in some uncertainty in the network. We have set probability of dropout as 0.3. This means, every neuron in the dropout layer may get dysfunctional with a probability of 0.3. This will restrict our network from overfitting by learning exact patterns of training data.

Process

Now, let’s see the process of constructing the variational autoencoder network.

Data preprocessing

The first step is to prepare data so that it can be used for training the model. This includes converting data into word vectors and loading into data loaders so that it can be fed to the encoder.

We can read sentence by sentence, use NLTK package to tokenize sentences, split into train and validation data and load it into data loader using data fields and bucketiterator from torchtext package. We will also use Glove (a set of precompiled collection of words and their vector representations) so as to get better vocabulary. This will also help us in finding words which are similar to the words in our sentence as similar words will have vector representations which are closer to each other.

The approach used

Our input is a batch of sentences. VAE will take it one by one and pass through the embedding layer. We will initialize the embedding layer with the vocabulary that we had built using Glove. The embedding layer can now create an embedded vector for each word which can be passed to the encoder.

Encoder takes the output of embedding layer as input and produces the hidden state output. Next task is to find out the Gaussian distribution of latent variable from the encoder’s hidden state. A Gaussian distribution is characterized by the mean and variance of that distribution. We have to find the mean and variance from the hidden state output produced by the encoder. This can be found by passing the hidden state output through a linear layer to produce mean and variance vector of the size of dimension of latent variable. These linear layers will learn to produce mean and variance when we train our model.

Once we have mean and variance, we have found the Gaussian distribution corresponding to the latent variable. This is an approximation of the distribution of latent variable found by us. We will find the KL divergence of this distribution from the original distribution of the latent variable which will tell us how close is our distribution to the original distribution.

Now, a random point is sampled from this distribution and fed into the decoder. Decoder takes this input and produces output. This output will have vector representation of every single word in every sentence. Now, we have to identify words from this vector representation and compare with the initial input given to the encoder.

In training phase, we train so that the VAE should be able to produce the exact sentence provided at input as the final output. The error in achieving this is called as reconstruction error. We will use cross entropy error as the error function. When we provide decoder output and original encoder input to cross entropy function, it will return us the cross entropy error which is a measure of how well we were able to reconstruct the sentence.

Our loss function is not complete yet. There is one more contributor to our loss function. It is the KL divergence that we had seen earlier. Our aim is to decrease the reconstruction loss as well as the KL divergence. If we simply look at the reconstruction loss, out model may simply start replicating given inputs. But with the KL divergence term, it will be able to come up with new sentences. Our total loss is the sum of reconstruction loss and KL divergence.

Here we are finding the KL divergence between q(z/x) and p(z/x) where z is our latent variable and q(z/x) is approximation of p(z/x). It is difficult for us to find p(z/x) directly as p(z/x) = p(x/z)p(z)/p(x) and p(x) is computationally infeasible. Hence we approximate it with q(z/x) and try to keep the KL divergence between q(z/x) and p(z/x) to the minimum so that q(z/x) remains as a good approximation of p(z/x).

Loss function -> (Reconstruction loss + KL divergence)

In the above loss function, a and b denote coefficients of reconstruction loss and KL divergence respectively. In our implementation, we will use cross entropy loss to find reconstruction loss. We have used coefficient of reconstruction loss as 8 and coefficient of KL divergence as 0.001.

This error is backpropagated in our neural network in order to calculate gradient of error with respect to each of the network weights and calculate the updated weights. This backpropagation and weight updation is done at the end of each batch. We are using a batch size of 128. This means weights will be updated whenever inspection of a batch of 128 sentences is completed. For optimization, we are using Adam optimizer with a learning rate of 0.0001.

At the end of every batch, after updation of weights we will see the improvement in our model by evaluating against validation data. This will give us a good insight on how our model is developing.

We will run this whole process for 50 epochs. This will train our model to a good extend and will help it learn language patterns.

Evaluation of model

Let us see how our model has performed. IMDB movie review data was used to train the model. We can see from the below graph that we were able to decrease reconstruction loss on train data throughout the training process. Validation data also showed a decreasing trend in reconstruction loss which means our model was getting trained in the right direction. After a point, we will start seeing very less improvement. Thereafter we have to see the point at which we have to stop training before our model starts overfitting.

Coming to KL divergence, we see a very large KL divergence value at the initial epochs but as we reached the end of training, KL divergence decreased to a very lower value. This means we are now in a position to reconstruct sentences with modifications such that we can generate new sentences.

Sentence generation

Now, let us put our model to test. We will input a random sentence to the model, pass it through the encoder, produce a latent space distribution, sample a point from the distribution and feed it to the decoder and see what sentence it finally produces as output. Here are few outputs that were generated;

As we are using variational autoencoders, if we give same input multiple times, we will get completely different outputs. Let us see an example.

Conclusion

We have achieved our aim and did manage to develop a model which can produce new sentences from given sentences by keeping grammatical accuracy and sense of topic. Further work can be done on this to achieve more perfection in grammatical accuracy by training the model with more training data and by tuning hyperparameters with further experiments. This model can also be modified to use for other purposes like language translation, machine assisted story writing, sentence completion, modification of write ups etc.

Here is the github link to the code used:

https://github.com/vivek1kerala7/Sentence-generation-using-VAE

References

Bowman, S. R., Vilnis, L., Vinyals, O., Dai, A. M., Jozefowicz, R., and Bengio, S. Generating sentences from a continuous space. 2016

Bahdanau, D., Cho, K., and Bengio, Y. Neural machine translation by jointly learning to align and translate, 2016

Bahuleyan, H., Mou, L., Vechtomova, O., and Poupart, P. Variational attention for sequence-to-sequence models, 2018

Hochreiter, S. and Schmidhuber, J. Long short-term memory. Neural computation, 9:1735–80, 12 1997. doi: 10.1162/ neco.1997.9.8.1735.

Diederik P. K., Max Welling, An Introduction to Variational Autoencoders, 2019

The Unreasonable Effectiveness of Recurrent Neural Networks

There's something magical about Recurrent Neural Networks (RNNs). I still remember when I trained my first recurrent…

karpathy.github.io

Variational autoencoders.

In my introductory post on autoencoders, I discussed various models (undercomplete, sparse, denoising, contractive)…

www.jeremyjordan.me

Introduction to autoencoders.

Autoencoders are an unsupervised learning technique in which we leverage neural networks for the task of representation…