Samuel R. Bowman $^{*}$
NLP Group and Dept. of Linguistics
Stanford University
[email protected]
Luke Vilnis $^{*}$
CICS
University of Massachusetts Amherst
[email protected]
Oriol Vinyals, Andrew M. Dai, Rafal Jozefowicz & Samy Bengio
Google Brain
{vinyals, adai, rafalj, bengio}@google.com
$^{*}$ First two authors contributed equally. Work was done when all authors were at Google, Inc.
The standard recurrent neural network language model (rnnlm) generates sentences one word at a time and does not work from an explicit global sentence representation. In this work, we introduce and study an rnn-based variational autoencoder generative model that incorporates distributed latent representations of entire sentences. This factorization allows it to explicitly model holistic properties of sentences such as style, topic, and high-level syntactic features. Samples from the prior over these sentence representations remarkably produce diverse and well-formed sentences through simple deterministic decoding. By examining paths through this latent space, we are able to generate coherent novel sentences that interpolate between known sentences. We present techniques for solving the difficult learning problem presented by this model, demonstrate its effectiveness in imputing missing words, explore many interesting properties of the model's latent sentence space, and present negative results on the use of the model in language modeling.
Executive Summary: This document addresses a key limitation in natural language processing: traditional recurrent neural network language models (RNNLMs) generate text one word at a time, relying on evolving states that capture local patterns but not overall sentence properties like topic, style, or high-level structure. This gap hinders applications needing holistic understanding, such as generating diverse sentences or filling in missing context, which are increasingly vital for AI systems in translation, summarization, and content creation amid growing demand for more interpretable and controllable text models.
The work aims to evaluate a new generative model based on a variational autoencoder (VAE), which uses a continuous "latent" representation to encode entire sentences, allowing explicit modeling of global features while generating text via RNN decoders.
Researchers built the model by adapting VAE architecture—originally successful for images—to text sequences, using RNNs for encoding sentences into latent vectors (drawn from a Gaussian distribution) and decoding them back into words. They trained it on datasets like the Penn Treebank (a standard corpus of about 1 million words) for language modeling and the larger Books Corpus (80 million sentences from fiction e-books) for word imputation, incorporating techniques like gradually increasing the penalty on latent deviations (KL annealing) and randomly dropping input words during training (word dropout) to ensure the latent space encodes useful information. Assumptions included a simple Gaussian prior for latents and right-to-left decoding to ease dependencies, with hyperparameters tuned via automated search over development data.
Key results show the VAE matches RNNLM performance on standard language modeling, achieving test perplexities around 119 versus 116, but with a small yet non-zero contribution from the latent space (about 2% of the loss). For imputing missing words in the Books Corpus—simulating gaps like the last 20% of sentences—the VAE generated completions 15-20% harder for classifiers to distinguish from real text (adversarial error of 22-36% versus 28-39% for RNNLM), producing more varied outputs like "through the driver's door" instead of repetitive phrases like "it, '' I said." Qualitatively, sampling from the latent prior with deterministic decoding yielded diverse, grammatical sentences capturing topics like emotions or settings. Interpolation paths between sentence encodings produced smooth transitions, such as shifting "I want to talk to you" to "She didn't want to be with him" via coherent intermediates, unlike abrupt jumps in simpler models. Encodings also aided downstream tasks, boosting paraphrase detection accuracy to 77% when combined with other features.
These findings mean the VAE creates a structured, continuous space for sentences that better handles global context, reducing reliance on local word patterns and improving tasks like text completion where directionality limits traditional models. This could lower risks in AI outputs, such as repetitive or off-topic generations, and enhance performance in real-world uses like chatbots or automated editing, though it differs from expectations by not outperforming RNNLMs on raw likelihoods—likely because local modeling suffices for many cases, but global features shine in structured tasks.
Next, teams should pilot the VAE in applications like conditional text generation (e.g., specifying style) or integrate it with supervised tasks for better embeddings in question answering. Factor the latent space into content and style components for finer control, or explore adversarial training to refine evaluations without likelihood computations. Trade-offs include higher training complexity versus gains in interpretability; further data from diverse genres could help.
Limitations include training instability without annealing and dropout, leading to models sometimes ignoring the latent space, and assumptions like Gaussian priors that may not capture all nuances in discrete text. Confidence is strong in imputation and latent analysis benefits, supported by qualitative examples and classifier baselines, but cautious on broad superiority—readers should test in specific domains before scaling.
Section Summary: Recurrent neural network language models are the leading tools for generating natural language sentences in an unsupervised way and excel in supervised tasks like translation and image captioning, producing words one at a time based on an evolving internal state that captures complex patterns, including long-range dependencies. However, they struggle to represent broader sentence features like topics or syntax in an interpretable manner. To address this, the authors propose extending these models with a variational autoencoder that introduces a continuous latent variable for global features, enabling smoother interpolation between sentences, better performance on tasks like filling in missing words, and the generation of diverse, coherent text, as shown through new evaluation methods and qualitative analyses.
Recurrent neural network language models ($\textsc{rnnlm}$ s, [1]) represent the state of the art in unsupervised generative modeling for natural language sentences. In supervised settings, $\textsc{rnnlm}$ decoders conditioned on task-specific features are the state of the art in tasks like machine translation ([2, 3]) and image captioning ([4, 5, 6]). The $\textsc{rnnlm}$ generates sentences word-by-word based on an evolving distributed state representation, which makes it a probabilistic model with no significant independence assumptions, and makes it capable of modeling complex distributions over sequences, including those with long-term dependencies. However, by breaking the model structure down into a series of next-step predictions, the $\textsc{rnnlm}$ does not expose an interpretable representation of global features like topic or of high-level syntactic properties.
We propose an extension of the $\textsc{rnnlm}$ that is designed to explicitly capture such global features in a continuous latent variable. Naively, maximum likelihood learning in such a model presents an intractable inference problem. Drawing inspiration from recent successes in modeling images ([7]), handwriting, and natural speech ([8]), our model circumvents these difficulties using the architecture of a variational autoencoder and takes advantage of recent advances in variational inference ([9, 10]) that introduce a practical training technique for powerful neural network generative models with latent variables.
:Table 1: Sentences produced by greedily decoding from points between two sentence encodings with a conventional autoencoder. The intermediate sentences are not plausible English.
| i went to the store to buy some groceries . |
|---|
| i store to buy some groceries . |
| i were to buy any groceries . |
| horses are to buy any groceries . |
| horses are to buy any animal . |
| horses the favorite any animal . |
| horses the favorite favorite animal . |
| horses are my favorite animal . |
Our contributions are as follows: We propose a variational autoencoder architecture for text and discuss some of the obstacles to training it as well as our proposed solutions. We find that on a standard language modeling evaluation where a global variable is not explicitly needed, this model yields similar performance to existing $\textsc{rnnlm}$ s. We also evaluate our model using a larger corpus on the task of imputing missing words. For this task, we introduce a novel evaluation strategy using an adversarial classifier, sidestepping the issue of intractable likelihood computations by drawing inspiration from work on non-parametric two-sample tests and adversarial training. In this setting, our model's global latent variable allows it to do well where simpler models fail. We finally introduce several qualitative techniques for analyzing the ability of our model to learn high level features of sentences. We find that they can produce diverse, coherent sentences through purely deterministic decoding and that they can interpolate smoothly between sentences.
Section Summary: Standard recurrent neural network language models predict words sequentially but fail to produce a fixed vector representation for entire sentences, prompting the need for unsupervised methods like sequence autoencoders, skip-thought, and paragraph vectors to create such encodings. However, these approaches have limitations: sequence autoencoders produce disjoint or ungrammatical interpolations between sentences and lack a prior for sampling new ones, while the others are not truly generative. The variational autoencoder addresses these issues by regularizing latent codes with a Gaussian prior, using a probabilistic recognition model to enable smooth, interpretable representations and the generation of novel sentences through an objective that balances reconstruction accuracy with divergence from the prior.
A standard $\textsc{rnn}$ language model predicts each word of a sentence conditioned on the previous word and an evolving hidden state. While effective, it does not learn a vector representation of the full sentence. In order to incorporate a continuous latent sentence representation, we first need a method to map between sentences and distributed representations that can be trained in an unsupervised setting. While no strong generative model is available for this problem, three non-generative techniques have shown promise: sequence autoencoders, skip-thought, and paragraph vector.
Sequence autoencoders have seen some success in pre-training sequence models for supervised downstream tasks ([11]) and in generating complete documents ([12]). An autoencoder consists of an encoder function $\varphi_{enc}$ and a probabilistic decoder model $p(x| \vec{z} =\varphi_{enc}(x))$, and maximizes the likelihood of an example $x$ conditioned on $\vec{z}$, the learned code for $x$. In the case of a sequence autoencoder, both encoder and decoder are $\textsc{rnn}$ s and examples are token sequences.
Standard autoencoders are not effective at extracting for global semantic features. In Table 1, we present the results of computing a path or homotopy between the encodings for two sentences and decoding each intermediate code. The intermediate sentences are generally ungrammatical and do not transition smoothly from one to the other. This suggests that these models do not generally learn a smooth, interpretable feature system for sentence encoding. In addition, since these models do not incorporate a prior over $\vec{z}$, they cannot be used to assign probabilities to sentences or to sample novel sentences.
Two other models have shown promise in learning sentence encodings, but cannot be used in a generative setting: Skip-thought models ([13]) are unsupervised learning models that take the same model structure as a sequence autoencoder, but generate text conditioned on a neighboring sentence from the target text, instead of on the target sentence itself. Finally, paragraph vector models ([14]) are non-recurrent sentence representation models. In a paragraph vector model, the encoding of a sentence is obtained by performing gradient-based inference on a prospective encoding vector with the goal of using it to predict the words in the sentence.
The variational autoencoder ($\textsc{vae}$, [9, 10]) is a generative model that is based on a regularized version of the standard autoencoder. This model imposes a prior distribution on the hidden codes $\vec{z}$ which enforces a regular geometry over codes and makes it possible to draw proper samples from the model using ancestral sampling.
The $\textsc{vae}$ modifies the autoencoder architecture by replacing the deterministic function $\varphi_{enc}$ with a learned posterior recognition model, $q(\vec{z}| x)$. This model parametrizes an approximate posterior distribution over $\vec{z}$ (usually a diagonal Gaussian) with a neural network conditioned on $x$. Intuitively, the $\textsc{vae}$ learns codes not as single points, but as soft ellipsoidal regions in latent space, forcing the codes to fill the space rather than memorizing the training data as isolated codes.
If the $\textsc{vae}$ were trained with a standard autoencoder's reconstruction objective, it would learn to encode its inputs deterministically by making the variances in $q(\vec{z}| x)$ vanishingly small ([15]). Instead, the $\textsc{vae}$ uses an objective which encourages the model to keep its posterior distributions close to a prior $p(\vec{z})$, generally a standard Gaussian ($\mu=\vec{0}$, $\sigma=\vec{1}$). Additionally, this objective is a valid lower bound on the true log likelihood of the data, making the $\textsc{vae}$ a generative model. This objective takes the following form:
$ \begin{split} \mathcal{L}(\theta; x) &= -\textsc{kl}(q_\theta(\vec{z}| x)||p(\vec{z})) \ &\quad ; + \mathbb{E}{q\theta(\vec{z}| x)}[\log p_\theta(x| \vec{z})] \ & \le \log p(x);;. \end{split}\tag{1} $
This forces the model to be able to decode plausible sentences from every point in the latent space that has a reasonable probability under the prior.
In the experiments presented below using $\textsc{vae}$ models, we use diagonal Gaussians for the prior and posterior distributions $p(\vec{z})$ and $q(\vec{z}|x)$, using the Gaussian reparameterization trick of [9]. We train our models with stochastic gradient descent, and at each gradient step we estimate the reconstruction cost using a single sample from $q(\vec{z}| x)$, but compute the $\textsc{kl}$ divergence term of the cost function in closed form, again following [9].
Section Summary: Researchers adapted a variational autoencoder, a type of neural network that compresses and reconstructs data, to handle sentences by using recurrent neural networks called LSTMs for both encoding text into a hidden code and decoding it back into words, with the code helping to regularize the process like a standard language model when it carries no extra information. They tested various tweaks to the design, such as different ways to incorporate the hidden code or more advanced sampling methods, but found little improvement, and noted similarities to prior work on music and other sequences, though their focus was on global sentence features rather than step-by-step ones. A key challenge was that the model often ignored the hidden code during training to achieve quick results with just the decoder, leading to optimization fixes like gradually increasing the importance of a regularization term to encourage meaningful use of the code.

We adapt the variational autoencoder to text by using single-layer $\textsc{lstm}$ $\textsc{rnn}$ s ([16]) for both the encoder and the decoder, essentially forming a sequence autoencoder with the Gaussian prior acting as a regularizer on the hidden code. The decoder serves as a special $\textsc{rnn}$ language model that is conditioned on this hidden code, and in the degenerate setting where the hidden code incorporates no useful information, this model is effectively equivalent to an $\textsc{rnnlm}$. The model is depicted in Figure 1, and is used in all of the experiments discussed below.
We explored several variations on this architecture, including concatenating the sampled $\vec{z}$ to the decoder input at every time step, using a softplus parametrization for the variance, and using deep feedforward networks between the encoder and latent variable and the decoder and latent variable. We noticed little difference in the model's performance when using any of these variations. However, when including feedforward networks between the encoder and decoder we found that it is necessary to use highway network layers ([17]) for the model to learn. We discuss hyperparameter tuning in the appendix.
We also experimented with more sophisticated recognition models $q(\vec{z}| x)$, including a multistep sampling model styled after $\textsc{draw}$ ([7]), and a posterior approximation using normalizing flows ([18]). However, we were unable to reap significant gains over our plain $\textsc{vae}$.
While the strongest results with $\textsc{vae}$ s to date have been on continuous domains like images, there has been some work on discrete sequences: a technique for doing this using $\textsc{rnn}$ encoders and decoders, which shares the same high-level architecture as our model, was proposed under the name Variational Recurrent Autoencoder ($\textsc{vrae}$) for the modeling of music in [19]. While there has been other work on including continuous latent variables in $\textsc{rnn}$-style models for modeling speech, handwriting, and music ([20, 8]), these models include separate latent variables per timestep and are unsuitable for our goal of modeling global features.
In a recent paper with goals similar to ours, [21] introduce an effective VAE-based document-level language model that models texts as bags of words, rather than as sequences. They mention briefly that they have to train the encoder and decoder portions of the network in alternation rather than simultaneously, possibly as a way of addressing some of the issues that we discuss in Section 3.1.
Our model aims to learn global latent representations of sentence content. We can quantify the degree to which our model learns global features by looking at the variational lower bound objective Equation 1. The bound breaks into two terms: the data likelihood under the posterior (expressed as cross entropy), and the $\textsc{kl}$ divergence of the posterior from the prior. A model that encodes useful information in the latent variable $\vec{z}$ will have a non-zero $\textsc{kl}$ divergence term and a relatively small cross entropy term. Straightforward implementations of our $\textsc{vae}$ fail to learn this behavior: except in vanishingly rare cases, most training runs with most hyperparameters yield models that consistently set $q(\vec{z}| x)$ equal to the prior $p(\vec{z})$, bringing the $\textsc{kl}$ divergence term of the cost function to zero.
When the model does this, it is essentially behaving as an $\textsc{rnnlm}$. Because of this, it can express arbitrary distributions over the output sentences (albeit with a potentially awkward left-to-right factorization) and can thereby achieve likelihoods that are close to optimal. Previous work on $\textsc{vae}$ s for image modeling ([9]) used a much weaker independent pixel decoder model $p(x| \vec{z})$, forcing the model to use the global latent variable to achieve good likelihoods. In a related result, recent approaches to image generation that use $\textsc{lstm}$ decoders are able to do well without $\textsc{vae}$-style global latent variables ([22]).
This problematic tendency in learning is compounded by the $\textsc{lstm}$ decoder's sensitivity to subtle variation in the hidden states, such as that introduced by the posterior sampling process. This causes the model to initially learn to ignore $\vec{z}$ and go after low hanging fruit, explaining the data with the more easily optimized decoder. Once this has happened, the decoder ignores the encoder and little to no gradient signal passes between the two, yielding an undesirable stable equilibrium with the $\textsc{kl}$ cost term at zero. We propose two techniques to mitigate this issue.
KL cost annealing
In this simple approach to this problem, we add a variable weight to the $\textsc{kl}$ term in the cost function at training time. At the start of training, we set that weight to zero, so that the model learns to encode as much information in $\vec{z}$ as it can. Then, as training progresses, we gradually increase this weight, forcing the model to smooth out its encodings and pack them into the prior. We increase this weight until it reaches 1, at which point the weighted cost function is equivalent to the true variational lower bound. In this setting, we do not optimize the proper lower bound on the training data likelihood during the early stages of training, but we nonetheless see improvements on the value of that bound at convergence. This can be thought of as annealing from a vanilla autoencoder to a $\textsc{vae}$. The rate of this increase is tuned as a hyperparameter.

Figure 2 shows the behavior of the $\textsc{kl}$ cost term during the first 50k steps of training on Penn Treebank ([23]) language modeling with $\textsc{kl}$ cost annealing in place. This example reflects a pattern that we observed often: $\textsc{kl}$ spikes early in training while the model can encode information in $\vec{z}$ cheaply, then drops substantially once it begins paying the full $\textsc{kl}$ divergence penalty, and finally slowly rises again before converging as the model learns to condense more information into $\vec{z}$.
Word dropout and historyless decoding
In addition to weakening the penalty term on the encodings, we also experiment with weakening the decoder. As in $\textsc{rnnlm}$ s and sequence autoencoders, during learning our decoder predicts each word conditioned on the ground-truth previous word. A natural way to weaken the decoder is to remove some or all of this conditioning information during learning. We do this by randomly replacing some fraction of the conditioned-on word tokens with the generic unknown word token $\textsc{unk}$. This forces the model to rely on the latent variable $\vec{z}$ to make good predictions. This technique is a variant of word dropout ([24, 25]), applied not to a feature extractor but to a decoder. We also experimented with standard dropout ([26]) applied to the input word embeddings in the decoder, but this did not help the model learn to use the latent variable.
This technique is parameterized by a keep rate $k\in[0, 1]$. We tune this parameter both for our $\textsc{vae}$ and for our baseline $\textsc{rnnlm}$. Taken to the extreme of $k=0$, the decoder sees no input, and is thus able to condition only on the number of words produced so far, yielding a model that is extremely limited in the kinds of distributions it can model without using $\vec{z}$.
::: {caption="Table 2: Penn Treebank language modeling results, reported as negative log likelihoods and as perplexities. Lower is better for both metrics. For the $\textsc{vae}$, the $\textsc{kl}$ term of the likelihood is shown in parentheses alongside the total likelihood."}

:::
Section Summary: Researchers tested whether adding a global hidden variable to language models improves performance on the Penn Treebank dataset, focusing on models that actually use this variable. In standard tests, the new variational autoencoder (VAE) model performed slightly worse than the basic recurrent neural network language model (RNNLM), but it did capture some sentence-wide patterns rather than just local word predictions. When removing input words entirely to simulate heavy dropout, the VAE showed stronger results by relying more on the hidden variable for generating coherent text, though overall it was hard to make this variable the main driver of predictions instead of simpler word-by-word patterns.
In this section, we report on language modeling experiments on the Penn Treebank in an effort to discover whether the inclusion of a global latent variable is helpful for this standard task. For this reason, we restrict our $\textsc{vae}$ hyperparameter search to those models which encode a non-trivial amount in the latent variable, as measured by the $\textsc{kl}$ divergence term of the variational lower bound.
Results
We used the standard train–test split for the corpus, and report test set results in Table 2. The results shown reflect the training and test set performance of each model at the training step at which the model performs best on the development set. Our reported figures for the $\textsc{vae}$ reflect the variational lower bound on the test likelihood, while for the $\textsc{rnnlm}$ s, which can be evaluated exactly, we report the true test likelihood. This discrepancy puts the $\textsc{vae}$ at a potential disadvantage.
In the standard setting, the $\textsc{vae}$ performs slightly worse than the $\textsc{rnnlm}$ baseline, though it does succeed in using the latent space to a limited extent: it has a reconstruction cost (99) better than that of the baseline $\textsc{rnnlm}$, but makes up for this with a $\textsc{kl}$ divergence cost of 2. Training a $\textsc{vae}$ in the standard setting without both word dropout and cost annealing reliably results in models with equivalent performance to the baseline $\textsc{rnnlm}$, and zero $\textsc{kl}$ divergence.
To demonstrate the ability of the latent variable to encode the full content of sentences in addition to more abstract global features, we also provide numbers for an inputless decoder that does not condition on previous tokens, corresponding to a word dropout keep rate of $0$. In this regime we can see that the variational lower bound contains a significantly larger $\textsc{kl}$ term and shows a substantial improvement over the weakened $\textsc{rnnlm}$, which is essentially limited to using unigram statistics in this setting. While it is weaker than a standard decoder, the inputless decoder has the interesting property that its sentence generating process is fully differentiable. Advances in generative models of this kind could be promising as a means of generating text while using adversarial training methods, which require differentiable generators.
Even with the techniques described in the previous section, including the inputless decoder, we were unable to train models for which the $\textsc{kl}$ divergence term of the cost function dominates the reconstruction term. This suggests that it is still substantially easier to learn to factor the data distribution using simple local statistics, as in the $\textsc{rnnlm}$, such that an encoder will only learn to encode information in $\vec{z}$ when that information cannot be effectively described by these local statistics.
::: {caption="Table 3: Examples of using beam search to impute missing words within sentences. Since we decode from right to left, note the stereotypical completions given by the $\textsc{rnnlm}$, compared to the $\textsc{vae}$ completions that often use topic data and more varied vocabulary."}

:::
Section Summary: Researchers tested their VAE model, which captures overall sentence meaning, against a standard RNN language model for filling in missing words in sentences from a large collection of books. To make a fair comparison, they used similar computing power for generating completions and invented a new way to evaluate results by training classifiers to spot the difference between real sentence endings and the models' fakes—the smaller the gap, the better the model. The VAE produced more varied and realistic fillings that were harder for the classifiers to distinguish from the originals, highlighting its advantage in handling connections across the whole sentence.
We claim that the our $\textsc{vae}$ 's global sentence features make it especially well suited to the task of imputing missing words in otherwise known sentences. In this section, we present a technique for imputation and a novel evaluation strategy inspired by adversarial training. Qualitatively, we find that the $\textsc{vae}$ yields more diverse and plausible imputations for the same amount of computation (see the examples given in Table 3), but precise quantitative evaluation requires intractable likelihood computations. We sidestep this by introducing a novel evaluation strategy.
While the standard $\textsc{rnnlm}$ is a powerful generative model, the sequential nature of likelihood computation and decoding makes it unsuitable for performing inference over unknown words given some known words (the task of imputation). Except in the special case where the unknown words all appear at the end of the decoding sequence, sampling from the posterior over the missing variables is intractable for all but the smallest vocabularies. For a vocabulary of size $V$, it requires $O(V)$ runs of full $\textsc{rnn}$ inference per step of Gibbs sampling or iterated conditional modes. Worse, because of the directional nature of the graphical model given by an $\textsc{rnnlm}$, many steps of sampling could be required to propagate information between unknown variables and the known downstream variables. The $\textsc{vae}$, while it suffers from the same intractability problems when sampling or computing $\textsc{map}$ imputations, can more easily propagate information between all variables, by virtue of having a global latent variable and a tractable recognition model.
For this experiment and subsequent analysis, we train our models on the Books Corpus introduced in [13]. This is a collection of text from 12k e-books, mostly fiction. The dataset, after pruning, contains approximately 80m sentences. We find that this much larger amount of data produces more subjectively interesting generative models than smaller standard language modeling datasets. We use a fixed word dropout rate of 75% when training this model and all subsequent models unless otherwise specified. Our models (the $\textsc{vae}$ and $\textsc{rnnlm}$) are trained as language models, decoding right-to-left to shorten the dependencies during learning for the $\textsc{vae}$. We use 512 hidden units.
Inference method
To generate imputations from the two models, we use beam search with beam size 15 for the $\textsc{rnnlm}$ and approximate iterated conditional modes ([27]) with 3 steps of a beam size 5 search for the $\textsc{vae}$. This allows us to compare the same amount of computation for both models. We find that breaking decoding for the $\textsc{vae}$ into several sequential steps is necessary to propagate information among the variables. Iterated conditional modes is a technique for finding the maximum joint assignment of a set of variables by alternately maximizing conditional distributions, and is a generalization of "hard- $\textsc{em}$ " algorithms like k-means ([28]). For approximate iterated conditional modes, we first initialize the unknown words to the $\textsc{unk}$ token. We then alternate assigning the latent variable to its mode from the recognition model, and performing constrained beam search to assign the unknown words. Both of our generative models are trained to decode sentences from right-to-left, which shortens the dependencies involved in learning for the $\textsc{vae}$, and we impute the final 20% of each sentence. This lets us demonstrate the advantages of the global latent variable in the regime where the $\textsc{rnnlm}$ suffers the most from its inductive bias.
Adversarial evaluation
Drawing inspiration from adversarial training methods for generative models as well as non-parametric two-sample tests ([29, 30, 31, 32]), we evaluate the imputed sentence completions by examining their distinguishability from the true sentence endings. While the non-differentiability of the discrete $\textsc{rnn}$ decoder prevents us from easily applying the adversarial criterion at train time, we can define a very flexible test time evaluation by training a discriminant function to separate the generated and true sentences, which defines an adversarial error.
We train two classifiers: a bag-of-unigrams logistic regression classifier and an $\textsc{lstm}$ logistic regression classifier that reads the input sentence and produces a binary prediction after seeing the final $\textsc{eos}$ token. We train these classifiers using early stopping on a $80/10/10$ train/dev/test split of 320k sentences, constructing a dataset of 50% complete sentences from the corpus (positive examples) and 50% sentences with imputed completions (negative examples). We define the adversarial error as the gap between the ideal accuracy of the discriminator (50%, i.e. indistinguishable samples), and the actual accuracy attained.
::: {caption="Table 4: Results for adversarial evaluation of imputations. Unigram and $\textsc{lstm}$ numbers are the adversarial error (see text) and $\textsc{rnnlm}$ numbers are the negative log-likelihood given to entire generated sentence by the $\textsc{rnnlm}$, a measure of sentence typicality. Lower is better on both metrics. The $\textsc{vae}$ is able to generate imputations that are significantly more difficult to distinguish from the true sentences."}

:::
Results
As a consequence of this experimental setup, the $\textsc{rnnlm}$ cannot choose anything outside of the top 15 tokens given by the $\textsc{rnn}$ 's initial unconditional distribution $P(x_1|\text{Null})$ when producing the final token of the sentence, since it has not yet generated anything to condition on, and has a beam size of 15. Table 4 shows that this weakness makes the $\textsc{rnnlm}$ produce far less diverse samples than the $\textsc{vae}$ and suffer accordingly versus the adversarial classifier. Additionally, we include the score given to the entire sentence with the imputed completion given a separate independently trained language model. The likelihood results are comparable, though the $\textsc{rnnlm}$ s favoring of generic high-probability endings such as "he said, " gives it a slightly lower negative log-likelihood. Measuring the $\textsc{rnnlm}$ likelihood of sentences themselves produced by an $\textsc{rnnlm}$ is not a good measure of the power of the model, but demonstrates that the $\textsc{rnnlm}$ can produce what it sees as high-quality imputations by favoring typical local statistics, even though their repetitive nature produces easy failure modes for the adversarial classifier. Accordingly, under the adversarial evaluation our model substantially outperforms the baseline since it is able to efficiently propagate information bidirectionally through the latent variable.
::: {caption="Table 5: Samples from a model trained with varying amounts of word dropout. We sample a vector from the Gaussian prior and apply greedy decoding to the result. Note that diverse samples can be achieved using a purely deterministic decoding procedure. Once we use reach a purely inputless decoder in the 0% setting, however, the samples cease to be plausible English sentences."}

:::
::: {caption="Table 6: Greedily decoded sentences from a model with 75% word keep probability, sampling from lower-likelihood areas of the latent space. Note the consistent topics and vocabulary usage."}

:::
Section Summary: This section explores how a language model's hidden vector captures variations in text compared to its built-in language patterns, by sampling from a basic probability distribution and using a straightforward decoding method to generate sentences. Experiments with word dropout reveal that reducing the words fed into the model during training pushes more detailed information into the hidden vector, producing diverse and mostly grammatical sentences, though too much dropout leads to repetitions or errors. Sampling from the model's learned distributions for given inputs also generates similar sentences, highlighting its grasp of sentence length, word types, and overall topics.
We now turn to more qualitative analysis of the model. Since our decoder model $p(x| \vec{z})$ is a sophisticated $\textsc{rnnlm}$, simply sampling from the directed graphical model (first $p(\vec{z})$ then $p(x| \vec{z})$) would not tell us much about how much of the data is being explained by each of the latent space and the decoder. Instead, for this part of the evaluation, we sample from the Gaussian prior, but use a greedy deterministic decoder for $p(x| \vec{z})$, the $\textsc{rnnlm}$ conditioned on $\vec{z}$. This allows us to get a sense of how much of the variance in the data distribution is being captured by the distributed vector $\vec{z}$ as opposed to the decoder. Interestingly, these results qualitatively demonstrate that large amounts of variation in generated language can be achieved by following this procedure. In the appendix, we provide some results on small text classification tasks.

:Table 7: Three sentences which were used as inputs to the $\textsc{vae}$, presented with greedy decodes from the mean of the posterior distribution, and from three samples from that distribution.
| $\textsc{input}$ | we looked out at the setting sun . | ** i went to the kitchen .** | how are you doing ? |
|---|---|---|---|
| $\textsc{mean}$ | they were laughing at the same time . | i went to the kitchen . | what are you doing ? |
| $\textsc{samp. 1}$ | ill see you in the early morning . | i went to my apartment . | " are you sure ? |
| $\textsc{samp. 2}$ | i looked up at the blue sky . | i looked around the room . | what are you doing ? |
| $\textsc{samp. 3}$ | it was down on the dance floor . | i turned back to the table . | what are you doing ? |
For this experiment, we train on the Books Corpus and test on a held out 10k sentence test set from that corpus. We find that train and test set performance are very similar. In Figure 3, we examine the impact of word dropout on the variational lower bound, broken down into $\textsc{kl}$ divergence and cross entropy components. We drop out words with the specified keep rate at training time, but supply all words as inputs at test time except in the $0%$ setting.
We do not re-tune the hyperparameters for each run, which results in the model with no dropout encoding very little information in $\vec{z}$ (i.e., the $\textsc{kl}$ component is small). We can see that as we lower the keep rate for word dropout, the amount of information stored in the latent variable increases, and the overall likelihood of the model degrades somewhat. Results from the Section 4 indicate that a model with no latent variable would degrade in performance significantly more in the presence of heavy word dropout.
We also qualitatively evaluate samples, to demonstrate that the increased $\textsc{kl}$ allows meaningful sentences to be generated purely from continuous sampling. Since our decoder model $p(x| \vec{z})$ is a sophisticated $\textsc{rnnlm}$, simply sampling from the directed graphical model (first $p(\vec{z})$ then $p(x| \vec{z})$) would not tell us about how much of the data is being explained by the learned vector vs. the language model. Instead, for this part of the qualitative evaluation, we sample from the Gaussian prior, but use a greedy deterministic decoder for $x$, taking each token $x_t=\text{argmax}{x_t} p(x_t| x{0, ..., t-1}, \vec{z})$. This allows us to get a sense of how much of the variance in the data distribution is being captured by the distributed vector $\vec{z}$ as opposed to by local language model dependencies.
These results, shown in Table 5, qualitatively demonstrate that large amounts of variation in generated language can be achieved by following this procedure. At the low end, where very little of the variance is explained by $\vec{z}$, we see that greedy decoding applied to a Gaussian sample does not produce diverse sentences. As we increase the amount of word dropout and force $\vec{z}$ to encode more information, we see the sentences become more varied, but past a certain point they begin to repeat words or show other signs of ungrammaticality. Even in the case of a fully dropped-out decoder, the model is able to capture higher-order statistics not present in the unigram distribution.
Additionally, in Table 6 we examine the effect of using lower-probability samples from the latent Gaussian space for a model with a 75% word keep rate. We find lower-probability samples by applying an approximately volume-preserving transformation to the Gaussian samples that stretches some eigenspaces by up to a factor of 4. This has the effect of creating samples that are not too improbable under the prior, but still reach into the tails of the distribution. We use a random linear transformation, with matrix elements drawn from a uniform distribution from $[-c, c]$, with $c$ chosen to give the desired properties ($0.1$ in our experiments). Here we see that the sentences are far less typical, but for the most part are grammatical and maintain a clear topic, indicating that the latent variable is capturing a rich variety of global features even for rare sentences.
In addition to generating unconditional samples, we can also examine the sentences decoded from the posterior vectors $p(z|x)$ for various sentences $x$. Because the model is regularized to produce distributions rather than deterministic codes, it does not exactly memorize and round-trip the input. Instead, we can see what the model considers to be similar sentences by examining the posterior samples in Table 7. The codes appear to capture information about the number of tokens and parts of speech for each token, as well as topic information. As the sentences get longer, the fidelity of the round-tripped sentences decreases.
::: {caption="Table 8: Paths between pairs of random points in $\textsc{vae}$ space: Note that intermediate sentences are grammatical, and that topic and syntactic structure are usually locally consistent."}

:::
The use of a variational autoencoder allows us to generate sentences using greedy decoding on continuous samples from the space of codes. Additionally, the volume-filling and smooth nature of the code space allows us to examine for the first time a concept of homotopy (linear interpolation) between sentences. In this context, a homotopy between two codes $\vec{z}_1$ and $\vec{z}_2$ is the set of points on the line between them, inclusive, $\vec{z}(t)= \vec{z}_1*(1-t)+ \vec{z}_2*t$ for $t \in [0, 1]$. Similarly, the homotopy between two sentences decoded (greedily) from codes $\vec{z}_1$ and $\vec{z}_2$ is the set of sentences decoded from the codes on the line. Examining these homotopies allows us to get a sense of what neighborhoods in code space look like – how the autoencoder organizes information and what it regards as a continuous deformation between two sentences.
While a standard non-variational $\textsc{rnnlm}$ does not have a way to perform these homotopies, a vanilla sequence autoencoder can do so. As mentioned earlier in the paper, if we examine the homotopies created by the sequence autoencoder in Table 1, though, we can see that the transition between sentences is sharp, and results in ungrammatical intermediate sentences. This gives evidence for our intuition that the $\textsc{vae}$ learns representations that are smooth and "fill up" the space.
In Table 8 (and in additional tables in the appendix) we can see that the codes mostly contain syntactic information, such as the number of words and the parts of speech of tokens, and that all intermediate sentences are grammatical. Some topic information also remains consistent in neighborhoods along the path. Additionally, sentences with similar syntax and topic but flipped sentiment valence, e.g. "the pain was unbearable" vs. "the thought made me smile", can have similar embeddings, a phenomenon which has been observed with single-word embeddings (for example the vectors for "bad" and "good" are often very similar due to their similar distributional characteristics).
Section Summary: This paper presents a new method using a variational autoencoder to process natural language sentences, with special training techniques that enable it to fill in missing words effectively. Analysis shows the model's hidden space can produce sensible and varied sentences by sampling from continuous values, and it creates smooth transitions between different sentences for better understanding. Looking ahead, the researchers plan to separate style from content in the model, generate sentences based on outside factors, improve language tasks like determining if texts imply each other, and adopt more advanced training methods.
This paper introduces the use of a variational autoencoder for natural language sentences. We present novel techniques that allow us to train our model successfully, and find that it can effectively impute missing words. We analyze the latent space learned by our model, and find that it is able to generate coherent and diverse sentences through purely continuous sampling and provides interpretable homotopies that smoothly interpolate between sentences.
We hope in future work to investigate factorization of the latent variable into separate style and content components, to generate sentences conditioned on extrinsic features, to learn sentence embeddings in a semi-supervised fashion for language understanding tasks like textual entailment, and to go beyond adversarial evaluation to a fully adversarial training objective.
Section Summary: Researchers tested the sentence representations from a variational autoencoder, trained on a large book collection, by applying them to two classification tasks: detecting paraphrases and classifying question types. For paraphrase detection on the Microsoft Research corpus, they created features from paired sentence vectors and trained a simple classifier, achieving results that were slightly worse than advanced models like skip-thought but improved when combined with them. In question classification using the TREC dataset, the autoencoder's features underperformed basic word-bag methods but beat a plain autoencoder, while skip-thought excelled without gains from combining representations.
::: {caption="Table 9: Results for the $\textsc{msr}$ Paraphrase Corpus."}

:::
In order to further examine the the structure of the representations discovered by the $\textsc{vae}$, we conduct classification experiments on paraphrase detection and question type classification. We train a $\textsc{vae}$ with a hidden state size of 1200 hidden units on the Books Corpus, and use the posterior mean of the model as the extracted sentence vector. We train classifiers on these means using the same experimental protocol as [13].
Paraphrase detection
For the task of paraphrase detection, we use the Microsoft Research Paraphrase Corpus ([33]). We compute features from the sentence vectors of sentence pairs in the same way as [13], concatenating the elementwise products and the absolute value of the elementwise differences of the two vectors. We train an $\ell_2$-regularized logistic regression classifier and tune the regularization strength using cross-validation.
We present results in Table 9 and compare to several previous models for this task. Feats is the lexicalized baseline from [34]. $\textsc{rae}$ uses the recursive autoencoder from that work, and $\textsc{dp}$ adds their dynamic pooling step to calculate pairwise features. $\textsc{st}$ uses features from the unidirectional skip-thought model, bi- $\textsc{st}$ uses bidirectional skip-thought, and combine- $\textsc{st}$ uses the concatenation of those features. We also experimented with concatenating lexical features and the two types of distributed features.
We found that our features performed slightly worse than skip-thought features by themselves and slightly better than recursive autoencoder features, and were complementary and yielded strong performance when simply concatenated with the skip-thought features.
Question classification
We also conduct experiments on the TREC Question Classification dataset of [35]. Following [13], we train an $\ell_2$-regularized softmax classifier with 10-fold cross-validation to set the regularization. Note that using a linear classifier like this one may disadvantage our representations here, since the Gaussian distribution over hidden codes in a $\textsc{vae}$ is likely to discourage linear separability.
::: {caption="Table 10: Results for TREC Question Classification."}

:::
We present results in Table 10. Here, $\textsc{ae}$ is a plain sequence autoencoder. We compare with results from a bag of word vectors ($\textsc{cbow}$, [36]) and skip-thought ($\textsc{st}$). We also compare with an $\textsc{rnn}$ classifier ([36]) and a $\textsc{cnn}$ classifier ([37]) both of which, unlike our model, are optimized end-to-end. We were not able to make the $\textsc{vae}$ codes perform better than $\textsc{cbow}$ in this case, but they did outperform features from the sequence autoencoder. Skip-thought performed quite well, possibly because the skip-thought training objective of next sentence prediction is well aligned to this task: it essentially trains the model to generate sentences that address implicit open questions from the narrative of the book. Combining the two representations did not give any additional performance gain over the base skip-thought model.
Section Summary: The researchers fine-tuned the key settings of their models, known as hyperparameters, using an automated Bayesian optimization method on test data. They ran each test configuration for 10 hours, performing 12 experiments simultaneously, and selected the best options after 200 trials. The chosen hyperparameters for the language modeling experiments are detailed in Table 11.
::: {caption="Table 11: Automatically selected hyperparameter values used for the models used in the Penn Treebank language modeling experiments."}

:::
We extensively tune the hyperparameters of each model using an automatic Bayesian hyperparameter tuning algorithm (based on [38]) over development set data. We run the model with each set of hyperpameters for 10 hours, operating 12 experiments in parallel, and choose the best set of hyperparameters after 200 runs. Results for our language modeling experiments are reported in Table 11 on the next page.
::: {caption="Table 12: Selected homotopies between pairs of random points in the latent $\textsc{vae}$ space."}

:::
Section Summary: The section presents additional examples of homotopies from the model, illustrated in Table 12. These homotopies produce intermediate sentences that are usually grammatical and preserve consistent themes, word choices, and sentence structures in nearby sections while smoothly blending between the starting and ending sentences. Trained on fiction like romance novels, the model generates topics that are often dramatically intense.
Table 12, on the next page, shows additional homotopies from our model. We observe that intermediate sentences are almost always grammatical, and often contain consistent topic, vocabulary and syntactic information in local neighborhoods as they interpolate between the endpoint sentences. Because the model is trained on fiction, including romance novels, the topics are often rather dramatic.
Section Summary: This section provides a bibliography of academic papers, mostly from 2011 to 2015, that explore advancements in artificial intelligence and machine learning. The references focus on neural networks for tasks like language modeling, machine translation, image captioning, and generative models, drawing from conferences such as NIPS, ICASSP, and ICML. These works by researchers like Yoshua Bengio and Quoc V. Le lay the groundwork for modern deep learning techniques in processing text and visuals.
[1] Tomáš Mikolov, Stefan Kombrink, Lukáš Burget, Jan Honza Černockỳ, and Sanjeev Khudanpur. 2011. Extensions of recurrent neural network language model. In Proc. ICASSP/.
[2] Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to sequence learning with neural networks. In Proc. NIPS/.
[3] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. In Proc. ICLR/.
[4] Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015. Show and tell: A neural image caption generator. In Proc. CVPR/.
[5] Junhua Mao, Wei Xu, Yi Yang, Jiang Wang, Zhiheng Huang, and Alan Yuille. 2015. Deep captioning with multimodal recurrent neural networks (m-RNN). In Proc. ICLR/.
[6] Jeff Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, and Trevor Darrell. 2015. Long-term recurrent convolutional networks for visual recognition and description. In Proc. CVPR/.
[7] Karol Gregor, Ivo Danihelka, Alex Graves, and Daan Wierstra. 2015. DRAW: A recurrent neural network for image generation. In Proc. ICML/.
[8] Junyoung Chung, Kyle Kastner, Laurent Dinh, Kratarth Goel, Aaron Courville, and Yoshua Bengio. 2015. A recurrent latent variable model for sequential data. In Proc. NIPS/.
[9] Diederik P. Kingma and Max Welling. 2015. Auto-encoding variational bayes. In Proc. ICLR/.
[10] Danilo J. Rezende, Shakir Mohamed, and Daan Wierstra. 2014. Stochastic backpropagation and approximate inference in deep generative models. In Proc. ICML/.
[11] Andrew M. Dai and Quoc V. Le. 2015. Semi-supervised sequence learning. In Proc. NIPS/.
[12] Jiwei Li, Minh-Thang Luong, and Dan Jurafsky. 2015a. A hierarchical neural autoencoder for paragraphs and documents. In Proc. ACL/.
[13] Ryan Kiros, Yukun Zhu, Ruslan Salakhutdinov, Richard S Zemel, Antonio Torralba, Raquel Urtasun, and Sanja Fidler. 2015. Skip-thought vectors. arXiv preprint arXiv:1506.06726.
[14] Quoc V. Le and Tomáš Mikolov. 2014. Distributed representations of sentences and documents. In Proc. ICML/.
[15] Tapani Raiko, Mathias Berglund, Guillaume Alain, and Laurent Dinh. 2015. Techniques for learning binary stochastic feedforward neural networks. In Proc. ICLR/.
[16] Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation/ 9(8).
[17] Rupesh Kumar Srivastava, Klaus Greff, and Jürgen Schmidhuber. 2015. Training very deep networks. In Proc. NIPS/.
[18] Danilo J. Rezende and Shakir Mohamed. 2015. Variational inference with normalizing flows. In Proc. ICML/.
[19] Otto Fabius and Joost R. van Amersfoort. 2014. Variational recurrent auto-encoders. arXiv preprint arXiv:1412.6581.
[20] Justin Bayer and Christian Osendorfer. 2015. Learning stochastic recurrent networks. arXiv preprint arXiv:1411.7610/ .
[21] Yishu Miao, Lei Yu, and Phil Blunsom. 2015. Neural variational inference for text processing. arXiv preprint arXiv:1511.06038/ .
[22] Lucas Theis and Matthias Bethge. 2015. Generative image modeling using spatial LSTMs. In Proc. NIPS/.
[23] Mitchell P Marcus, Mary Ann Marcinkiewicz, and Beatrice Santorini. 1993. Building a large annotated corpus of English: The Penn Treebank. Computational linguistics/ 19(2):313–330.
[24] Mohit Iyyer, Varun Manjunatha, Jordan Boyd-Graber, and Hal Daumé III. 2015. Deep unordered composition rivals syntactic methods for text classification. In Proc. ACL/.
[25] Ankit Kumar, Ozan Irsoy, Jonathan Su, James Bradbury, Robert English, Brian Pierce, Peter Ondruska, Ishaan Gulrajani, and Richard Socher. 2015. Ask me anything: Dynamic memory networks for natural language processing. arXiv preprint arXiv:1506.07285.
[26] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: A simple way to prevent neural networks from overfitting. JMLR/ 15(1):1929–1958.
[27] Julian Besag. 1986. On the statistical analysis of dirty pictures. Journal of the Royal Statistical Society Series B (Methodological)/ pages 48–259.
[28] Michael Kearns, Yishay Mansour, and Andrew Y Ng. 1998. An information-theoretic analysis of hard and soft assignment methods for clustering. In Learning in graphical models/, Springer, pages 495–520.
[29] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets. In Proc. NIPS/.
[30] Yujia Li, Kevin Swersky, and Richard Zemel. 2015b. Generative moment matching networks. In Proc. ICML/.
[31] Emily Denton, Soumith Chintala, Arthur Szlam, and Rob Fergus. 2015. Deep generative image models using a laplacian pyramid of adversarial networks. In Proc. NIPS/.
[32] Arthur Gretton, Karsten M Borgwardt, Malte J Rasch, Bernhard Schölkopf, and Alexander Smola. 2012. A kernel two-sample test. JMLR/ 13(1):723–773.
[33] Bill Dolan, Chris Quirk, and Chris Brockett. 2004. Unsupervised construction of large paraphrase corpora: Exploiting massively parallel news sources. In Proceedings of the 20th international conference on Computational Linguistics/. Association for Computational Linguistics, page 350.
[34] Richard Socher, Eric H Huang, Jeffrey Pennin, Christopher D Manning, and Andrew Y Ng. 2011. Dynamic pooling and unfolding recursive autoencoders for paraphrase detection. In Advances in Neural Information Processing Systems/. pages 801–809.
[35] Xin Li and Dan Roth. 2002. Learning question classifiers. In Proceedings of the 19th international conference on Computational linguistics-Volume 1/. Association for Computational Linguistics, pages 1–7.
[36] Han Zhao, Zhengdong Lu, and Pascal Poupart. 2015. Self-adaptive hierarchical sentence model. IJCAI/ .
[37] Yoon Kim. 2014. Convolutional neural networks for sentence classification. EMNLP/ .
[38] Jasper Snoek, Hugo Larochelle, and Ryan P. Adams. 2012. Practical Bayesian optimization of machine learning algorithms. In Proc. NIPS/.