structural_paraphrases.tex

\documentclass[11pt]{article}
\usepackage{acl-hlt2011}
\usepackage{times}
\usepackage{url}
\usepackage{latexsym}
\usepackage{graphicx}
\usepackage{amsmath}
\usepackage{url}
\usepackage{array}
\usepackage{multirow}
\usepackage{rotating}
\usepackage{todonotes}
%\setlength\titlebox{6.5cm}    
% You can expand the title box if you really have to

\DeclareMathOperator*{\argmax}{arg\,max}

\newcommand{\mnote}[1]{\marginpar{%
  \vskip-\baselineskip
  \raggedright\footnotesize
  \itshape\hrule\smallskip\footnotesize{#1}\par\smallskip\hrule}}

\addtolength{\abovecaptionskip}{-0.25cm}

\title{Learning Sentential Paraphrases from Bilingual Parallel Corpora
  \\ for Text-to-Text Generation}

\author{Juri Ganitkevitch, Chris Callison-Burch, Courtney
  Napoles, \and Benjamin Van Durme\\
  Center for Language and Speech Processing, and HLTCOE\\
  Johns Hopkins University}

\date{}

\begin{document}
\maketitle

\begin{abstract}
  Previous work has shown that high quality {\it phrasal} paraphrases
  can be extracted from bilingual parallel corpora.  However, it is
  not clear whether bitexts are an appropriate resource for extracting
  more sophisticated {\it sentential} paraphrases, which are more
  obviously learnable from monolingual parallel corpora.  We extend
  bilingual paraphrase extraction to syntactic paraphrases and
  demonstrate its ability to learn a variety of general paraphrastic
  transformations, including passivization, dative shift,
  and topicalization.  We discuss how our model can be adapted to
  many text generation tasks by augmenting its feature set,
  development data, and parameter estimation routine.  We illustrate
  this adaptation by using our paraphrase model for the task of
  sentence compression and achieve results competitive with
  state-of-the-art compression systems.
\end{abstract}

\section{Introduction} \label{introduction}

Paraphrases are alternative ways of expressing the same information
\cite{Culicover1968}.  Automatically generating and detecting
paraphrases is a crucial aspect of many NLP tasks.  In multi-document
summarization, paraphrase detection is used to collapse redundancies
\cite{Barzilay1999,BarzilayThesis}. Paraphrase generation can be used
for query expansion in information retrieval and question answering
systems
\cite{mckeown:1979:ACL,Anick1999,Ravichandran2002,Riezler2007}.
Paraphrases allow for more flexible matching of system output
against human references for tasks like machine translation and
automatic summarization
\cite{Zhou2006b,Kauchak2006,Madnani2007,Snover2010}.\nocite{Owczarzak2006}
% Paraphrases are used to generate additional reference translations
% for statistical machine translation \cite{Madnani2007} and for more
% flexible matching when evaluating MT output or automatically
% generated summaries \cite{Zhou2006b,Kauchak2006,Owczarzak2006}.

Broadly, we can distinguish two forms of paraphrases: \emph{phrasal
  paraphrases} denote a set of surface text forms with the same
meaning:
\begin{center}
\begin{tabular}{c}
the committee's second proposal \\
the second proposal of the committee
\end{tabular}
\end{center}
while \emph{syntactic paraphrases} augment the surface forms by
introducing nonterminals (or \emph{slots}) that are annotated with
syntactic constraints:
\begin{center}
\begin{tabular}{c}
the $\mathit{NP}_1$'s $\mathit{NP}_2$ \\
the $\mathit{NP}_2$ of the $\mathit{NP}_1$
\end{tabular}
\end{center}
It is evident that the latter have a much higher potential for
generalization and for capturing interesting paraphrastic transformations.

A variety of different types of corpora (and semantic equivalence
cues) have been used to automatically induce paraphrase collections
for English \cite{Madnani2010}. Perhaps the most natural type of
corpus for this task is a monolingual parallel text, which allows
sentential paraphrases to be extracted since the
sentence pairs in such corpora are perfect paraphrases of each other
\cite{Barzilay2001,Pang2003}. While rich syntactic paraphrases have
been learned from monolingual parallel corpora, they suffer from very limited data
availability and thus have poor coverage.

Other methods obtain paraphrases from raw monolingual text by relying
on distributional similarity \cite{Lin2001,Bhagat2008}. While vast
amounts of data are readily available for these approaches, the
distributional similarity signal they use is noisier than the
sentence-level correspondency in parallel corpora and additionally
suffers from problems such as mistaking cousin expressions or antonyms
(such as $\{\mathit{boy}, \mathit{girl}\}$ or $\{\mathit{rise},
\mathit{fall}\}$) for paraphrases.

Abundantly available bilingual parallel corpora have been shown to
address both these issues, obtaining paraphrases via a pivoting step
over foreign language phrases \cite{Callison-Burch2005}. The coverage
of paraphrase lexica extracted from bitexts has been shown to
outperform that obtained from other sources \cite{Zhao2008b}. While
there have been efforts pursuing the extraction of more powerful
paraphrases
\cite{Madnani2007,Callison-Burch2008,cohn-lapata:2008,Zhao2008}, it is
not yet clear to what extent sentential paraphrases can be induced
from bitexts. In this paper we:
% Squeeze some space
%\begin{itemize}\addtolength{\itemsep}{-0.2\baselineskip}
\begin{itemize}
\item Extend the bilingual pivoting approach to paraphrase induction
  to produce rich syntactic paraphrases.
\item Perform a thorough analysis of the types of paraphrases we
  obtain and discuss the paraphrastic transformations we are capable
  of capturing.
\item Describe how training paradigms for
  syntactic/sentential paraphrase models should be tailored to
  different text-to-text generation tasks.
\item Demonstrate our framework's suitability for a variety of
  text-to-text generation tasks by obtaining state-of-the-art results
  on the example task of sentence compression.
\end{itemize}


\section{Related Work} \label{related_work}

\newcite{Madnani2010} survey a variety of data-driven paraphrasing
techniques, categorizing them based on the type of data that they use.
These include large monolingual texts
\cite{Lin2001,szpektor-EtAl:2004:EMNLP,Bhagat2008}, comparable corpora
\cite{Barzilay2003,Dolan2004}, monolingual parallel corpora
\cite{Barzilay2001,Pang2003}, and bilingual parallel corpora
\cite{Callison-Burch2005,Madnani2007,Zhao2008}.  We focus on the
latter type of data.
% Several research efforts have leveraged parallel monolingual
% corpora, however they jointly suffer from the scarcity and noisiness
% of parallel corpora.  \newcite{Dolan2004} work around this issue by
% extracting parallel sentences from the vast amount of freely
% available comparable English text and apply machine translation
% techniques to create a paraphrasing system
% \cite{Quirk2004}. However, the word-based translation model and
% monotone decoder they use results in a substantial amount of
% identity paraphrases or single-word substitutions.


Paraphrase extraction using bilingual parallel corpora was proposed by
\newcite{Callison-Burch2005} who induced paraphrases using techniques
from {\it phrase-based} statistical machine translation
\cite{Koehn2003}. After extracting a bilingual phrase table, English
paraphrases are obtained by pivoting through foreign language
phrases.
% The phrase table contains phrase pairs $(e, f)$ (where the $e$ and
% $f$ stand for English and foreign phrases, respectively) as well as
% bi-directional
Since many paraphrases can be extracted for a phrase, Bannard and
Callison-Burch rank them using a paraphrase probability defined in
terms of the translation model probabilities $p(f | e)$ and $p(e |
f)$: \nocite{Callison-Burch2005}
\begin{eqnarray}
  p(e_2|e_1) &=& \sum_f p(e_2,f|e_1)\\
                  &=& \sum_f p(e_2|f,e_1) p(f|e_1) \\
                  &\approx& \sum_f p(e_2|f) p(f|e_1) .
\label{paraphrase_prob_eqn}
\end{eqnarray}

Several subsequent efforts extended the bilingual pivoting technique,
many of which introduced elements of more contemporary {\it
  syntax-based} approaches to statistical machine translation.
\newcite{Madnani2007} extended the technique to {\it hierarchical}
phrase-based machine translation \cite{Chiang2005}, which is formally
a synchronous context-free grammar (SCFG) and thus can be thought of
as a {\it paraphrase grammar}. The paraphrase grammar can paraphrase
(or ``decode'') input sentences using an SCFG decoder, like the Hiero,
Joshua or cdec MT systems \cite{Chiang2007,Joshua-WMT,Dyer_etal_2010}.
Like Hiero, Madnani's model uses just one nonterminal
$X$ instead of linguistic nonterminals.


Three additional efforts incorporated linguistic
syntax. \newcite{Callison-Burch2008} introduced syntactic constraints
by labeling all phrases and paraphrases (even non-constituent phrases)
with CCG-inspired slash categories \cite{Steedman2011}, an approach similar to
\newcite{Zollmann2006}'s syntax-augmented machine translation
(SAMT). Callison-Burch did not formally define a synchronous grammar,
nor discuss decoding, since his presentation did not include
hierarchical rules.  \newcite{cohn-lapata:2008} used the GHKM
extraction method \cite{Galley2004}, which is limited to constituent
phrases and thus produces a reasonably small set of syntactic rules.
\newcite{Zhao2008} added slots to bilingually extracted paraphrase
patterns that were labeled with part-of-speech tags, but not larger
syntactic constituents.

Before the shift to statistical natural language processing,
paraphrasing was often treated as syntactic transformations or by
parsing and then generating from a semantic representation
\cite{mckeown:1979:ACL,Muraki1982,Meteer1988,Shemtov1996,Yamamoto2002}.
Indeed, some work generated paraphrases using (non-probabilistic)
synchronous grammars
\cite{Shieber1990,Dras1997,Dras1999,Kozlowski2003}.

After the rise of statistical machine translation, a number of its
techniques were repurposed for paraphrasing.  These include sentence
alignment \cite{Gale1993,Barzilay2003a}, word alignment and noisy
channel decoding \cite{Brown1990,Quirk2004}, phrase-based models
\cite{Koehn2003,Callison-Burch2005}, hierarchical phrase-based models
\cite{Chiang2005,Madnani2007}, log-linear models and minimum error
rate training \cite{Och2003c,Madnani2007,Zhao2008b}, and here
syntax-based machine translation
\cite{Wu1997,Yamada2001,Melamed2004,Quirk2005}.


Beyond cementing the ties between paraphrasing and syntax-based
statistical machine translation, the novel contributions of our paper
are (1) an in-depth analysis of the types of structural and sentential
paraphrases that can be extracted with bilingual pivoting, (2) a
discussion of how our English--English paraphrase grammar should be
adapted to specific text-to-text generation tasks
\cite{zhao-EtAl:2009:ACLIJCNLP2} with (3) a concrete example of the
adaptation procedure for the task of paraphrase-based sentence
compression \cite{KnightMarcuAI02,cohn-lapata:2008,Cohn2009}.


% Relying on small data sets of semantically equivalent translations,
% \newcite{Pang2003} created finite state automata by syntax-aligning
% parallel sentences, enabling the generation of additional reference
% translations.

% Both \newcite{Barzilay2001} and \newcite{Ibrahim2003} sentence-align
% existing noisy parallel monolingual corpora such as translations of
% the same novels. While \newcite{Ibrahim2003} employ a set of
% heuristics that rely on anchor words identified by textual identity
% or matchin liguistic features such as gender, number or semantic
% class, \newcite{Barzilay2001} use a co-training approach that
% leverages context similarity to identify viable paraphrases.

% Semantic parallelism is well-established as a stong basis for the
% extraction of correspondencies such as paraphrases. However, there
% are notable efforts that choose to forgo it in favor of clustering
% approaches based on distributional characteristics. The well-known
% DIRT method by \newcite{Lin2001} fully relies on distributional
% similarity features for paraphrase extraction. Patterns extracted
% from paths in dependency graphs are clustered based on the
% similarity of the observed contents of their slots.

% Similarly, \newcite{Bhagat2008} argue that vast amounts of text can
% be leveraged to make up for the relative weakness of distributional
% features compared to parallelism. They also forgo complex
% annotations such as syntactic or dependency parses, relying only on
% part-of-speech tags to inform their approach. In their work,
% relations are learned by finding pattern clusters initially seeded
% by already known patterns. However, this method is not capable of
% producing syntactic paraphrases. \mnote{Need better tie-in with the
%   overall theme of structural paraphrases.}


\section{SCFGs in Translation} \label{formalism}

The model we use in our paraphrasing approach is a syntactically
informed \emph{ synchronous context-free grammar} (SCFG).  The SCFG
formalism \cite{Aho1972} was repopularized for statistical machine
translation by \newcite{Chiang2005}.  Formally, a \emph{probabilistic}
SCFG $\mathcal{G}$ is defined by specifying
\[
\mathcal{G} = \langle \mathcal{N}, \mathcal{T}_S, \mathcal{T}_T,
\mathcal{R}, S \rangle ,
\]
where $\mathcal{N}$ is a set of nonterminal symbols, $\mathcal{T}_S$
and $\mathcal{T}_T$ are the source and target language vocabularies,
$\mathcal{R}$ is a set of rules and $S \in \mathcal{N}$ is the root
symbol. The rules in $\mathcal{R}$ take the form:
\begin{equation*}
  C \rightarrow \langle \gamma, \alpha, \sim, w \rangle ,
\end{equation*}
where the rule's left-hand side $C \in \mathcal{N}$ is a nonterminal,
$\gamma \in (\mathcal{N} \cup \mathcal{T}_S)^*$ and $\alpha \in
(\mathcal{N} \cup \mathcal{T}_T)^*$ are strings of terminal and
nonterminal symbols with an equal number of nonterminals
$c_{\mathit{NT}}(\gamma) = c_{\mathit{NT}}(\alpha)$ and
$$
\sim : \{1 \ldots c_{\mathit{NT}}(\gamma)\} \rightarrow \{1 \ldots
c_{\mathit{NT}}(\alpha)\}
$$ 
constitutes a one-to-one correspondency function between the
nonterminals in $\gamma$ and $\alpha$. A non-negative weight $w \geq
0$ is assigned to each rule, reflecting the likelihood of the rule.


\begin{figure}[t]
\begin{center}
\includegraphics[width=0.99\linewidth]{figures/example_extraction-3a.pdf}
\end{center}
\caption{Synchronous grammar rules for translation are extracted from
  sentence pairs in a bixtext which have been automatically parsed and
  word-aligned. Extraction methods vary on whether they extract only
  minimal rules for phrases dominated by nodes in the parse tree, or
  more complex rules that include non-constituent phrases.}
\label{example_extraction}
\end{figure}

\paragraph{Rule Extraction}

Phrase-based approaches to statistical machine translation (and their
successors) extract pairs of $(e, f)$ phrases from automatically
word-aligned parallel sentences. \newcite{Och2003}
% , \newcite{Koehn2004}, \newcite{Tillmann2003}, and
% \newcite{Vogel2003}
described various heuristics for extracting phrase alignments from the
Viterbi word-level alignments that are estimated using
\newcite{Brown1993} word-alignment models.

These phrase extraction heuristics have been extended so that they
extract synchronous grammar rules
\cite{Galley2004,Chiang2005,Zollmann2006,Liu2006}.  Most of these
extraction methods require that one side of the parallel corpus be
parsed. This is typically done automatically with a statistical
parser.

Figure~\ref{example_extraction} shows examples of rules obtained from
a sentence pair. To extract a rule, we first choose a source side span
$f$ like {\it das leck}.  Then we use phrase extraction techniques to
find target spans $e$ that are consistent with the word alignment (in
this case {\it the leak} is consistent with our $f$). The nonterminal
symbol that is the left-hand side of the SCFG rule is then determined
by the syntactic constituent that dominates $e$ (in this case {\it
  NP}). To introduce nonterminals into the right-hand side of the rule,
we can apply rules extracted over sub-phrases of $f$, synchronously
substituting the corresponding nonterminal symbol for the sub-phrases
on both sides. The synchronous substitution applied to $f$ and $e$
then yields the correspondency $\sim$.

One significant differentiating factor between the competing ways of
extracting SCFG rules is whether the extraction method generates rules
only for constituent phrases that are dominated by a node in the parse
tree \cite{Galley2004,cohn-lapata:2008} or whether they include
arbitrary phrases, including non-constituent phrases
\cite{Zollmann2006,Callison-Burch2008}. We adopt the extraction for
all phrases, including non-constituents, since it allows us to cover a
much greater set of phrases, both in translation and paraphrasing.

% However, in doing so we can only assign left-hand side labels to a
% small portion of the possible phrases in a sentence pair. To allow
% for broader coverage, we rely on the labeling method introduced by :
% in addition to single constituent nonterminals, we allow for the
% concatenation of constituents as well as for CCG-style slashed
% constituents \cite{Steedman1999}.


\paragraph{Feature Functions}

Rather than assigning a single weight $w$, we define a set of feature
functions $\vec{\varphi} = \{\varphi_1 ... \varphi_N \}$ that are combined in a
log-linear model:
\begin{equation}
  w = -\sum_{i=1}^N \lambda_i \log \varphi_i .
\end{equation}
The weights $\vec{\lambda}$ of these feature functions are set to
maximize some objective function like \textsc{Bleu}
\cite{Papineni2002} using a procedure called minimum error rate
training (MERT), owing to \newcite{Och2003c}.  MERT iteratively
adjusts the weights until the decoder produces output that best
matches reference translations in a development set, according to the
objective function.  We will examine appropriate objective functions
for text-to-text generation tasks in Section \ref{objective-fn}.

Typical features used in statistical machine translation include
phrase translation probabilities (calculated using maximum likelihood
estimation over all phrase pairs enumerable in the parallel
corpus), word-for-word lexical translation probabilities (which help
to smooth sparser phrase translation estimates), a ``rule application
penalty'' (which governs whether the system prefers fewer longer phrases or a
greater number of shorter phrases), and a language model probability.
% \mnote{TODO: list out typical feature functions -- they can be less
%   formal than this}
%\begin{itemize}
%\item Phrase translation probabilities $\varphi_{\mathit{phrase}}(e |
%  f)$ and $\varphi_{\mathit{phrase}}(f | e)$ which are computed using
%  maximum likelihood estimation over all phrase pairs that are
%  extracted from the parallel corpus.
%\item Lexical translation probabilities 
%  \[ \varphi_{\mathit{lex}} = \langle -\log p_{\mathit{lex}}(e | f),
%  -\log p_{\mathit{lex}}(f | e)\rangle \] \mnote{TODO: describe how
%    the left hand side figures into the FFs}
%\item Count features $c_{\mathit{src}}$ and $c_{\mathit{tgt}}$
%  indicating the number of words on either side of the rule as well
%  as two difference features, $c_{\mathit{dcount}} = c_{\mathit{tgt}}
%  - c_{\mathit{src}}$ and the analoguosly computed difference in the
%  average word length in characters, $c_{\mathit{davg}}$.
%\item Language model probability
%\item A length penalty

%\item An incidator for when a rule only contains terminal symbols
%  ($\delta_{\mathit{lex}}$) and an indicator for when the source side
%  contains terminals, but the target side does not
%  ($\delta_{\mathit{del}}$).
  
%\item Indicators for whether the rule swaps the order of two
%  nonterminals ($\delta_{\mathit{reorder}}$) and whether the two
%  nonterminals are adjacent ($\delta_{\mathit{adj}}$).

%\item A rarity penalty $\varphi_{\mathit{rarity}} =
%  e^{(1-c_{\mathit{rule}})}$ that quantifies the doubt we may place
%  in a rule based on how often we have encountered it in the
%  corpus\footnote{Since we do not have an immediate rule count for a
%    paraphrase rule $N \rightarrow e_1 | e_2$, we instead estimate
%    its rarity penalty as $\varphi_{\mathit{rarity}}(N \rightarrow
%    e_1 | e_2) = \max_{f} \min \{\varphi_{\mathit{rarity}}(N
%    \rightarrow e_1 | f), \varphi_{\mathit{rarity}}(N \rightarrow f |
%    e_2) \}$}.
%\end{itemize}


\paragraph{Decoding}

Given an SCFG and an input source sentence, the decoder performs a
search for the single most probable derivation via the CKY algorithm.
In principle the best translation should be the English sentence $e$
that is the most probable after summing over all $d\in D$ derivations,
since many derivations yield the same $e$.  In practice, we use a
Viterbi approximation and return the translation that is the yield of
the single best derivation:
\begin{eqnarray}
  \label{decoding_eq}
  \hat{e} &=& \arg \max_{e\in Trans(f)} \sum_{d\in D(e,f)}{p(d,e|f)}
  \nonumber \\
             &\approx& \mathit{yield}(\arg \max_{d\in D(e,f)}{p(d,e|f)}).
\end{eqnarray}
Derivations are simply successive applications of the SCFG rules such
as those given in Figure \ref{example_translation}.
%\small
%\begin{tabular}{l}
%$\mathit{NP/NN}$  $\rightarrow$  dem rest des  $|$   the rest of the\\
%$\mathit{VB+JJ}$  $\rightarrow$  gef\"{a}hrlich werden  $|$  be dangerous\\
%$\mathit{NP}$  $\rightarrow$  $\mathit{NP/NN}$ dorfes  $|$  $\mathit{NP/NN}$ village\\
%$\mathit{VP/PP}$  $\rightarrow$  nicht $\mathit{VB+JJ}$ k\"{o}nnen  $|$  can not $\mathit{VB+JJ}$\\
%$\mathit{S}$  $\rightarrow$  sie $\mathit{NP}$ $\mathit{VP/PP}$  $|$  they $\mathit{VP/PP}$ to $\mathit{NP}$\\
%\end{tabular}{l}
%\normalsize

\begin{figure}[t]
\begin{center}
\includegraphics[width=0.9\linewidth]{figures/example_translation-2a.pdf}
\end{center}
\caption{An example derivation produced by a syntactic machine
  translation system.  Although the synchronous trees are unlike the
  derivations found in the Penn Treebank, their yield is a good
  translation of the German.}
\label{example_translation}
\end{figure}


\section{SCFGs in Paraphrasing} \label{acquisition}


\paragraph{Rule Extraction}

To create a paraphrase grammar from a translation grammar, we extend
the syntactically informed pivot approach of
\newcite{Callison-Burch2008} to the SCFG model. For this purpose, we
assume a grammar that translates from a given foreign language to
English. For each pair of translation rules where the left-hand side
$C$ and foreign string $\gamma$ match:
\begin{eqnarray*}
  C \rightarrow \langle \gamma, \alpha_1, \sim_1, \vec{\varphi}_1 \rangle \\
  C \rightarrow \langle \gamma, \alpha_2, \sim_2, \vec{\varphi}_2 \rangle ,
\end{eqnarray*}
we create a paraphrase rule:
\begin{equation*}
C \rightarrow \langle \alpha_1, \alpha_2, \sim, \vec{\varphi} \rangle ,
\end{equation*}
where the nonterminal correspondency relation $\sim$ has been set to
reflect the combined nonterminal alignment:
\begin{equation*}
\sim ~ = ~ \sim_1^{-1} \circ \sim_2 .
\end{equation*}

\paragraph{Feature Functions}

In the computation of the features $\vec{\varphi}$ from
$\vec{\varphi}_1$ and $\vec{\varphi}_2$ we follow the approximation in
Equation~\ref{paraphrase_prob_eqn}, which yields lexical and phrasal
paraphrase probability features. Additionally, we add a boolean
indicator for whether the rule is an identity paraphrase,
$\delta_{\mathit{identity}}$. Another indicator feature,
$\delta_{\mathit{reorder}}$, fires if the rule swaps the order of two
nonterminals, which enables us to promote more complex paraphrases
that require structural reordering.


\paragraph{Decoding}

With this, paraphrasing becomes an English-to-English translation
problem which can be formulated similarly to
Equation~\ref{decoding_eq} as: 
\begin{equation*}
\hat{e_2} \approx \mathit{yield}(\arg \max_{d\in
  D(e_2,e_1)}{p(d,e_2|e_1)}).
  \end{equation*}
 Figure~\ref{paraphrase_derivaton}
shows an example derivation produced as a result of applying our
paraphrase rules in the decoding process. Another advantage of using
the decoder from statistical machine translation is that n-gram
language models, which have been shown to be useful in natural
language generation \cite{Langkilde1998}, are already well integrated
\cite{Huang2007}.

\begin{figure}[!t]
\begin{center}
\includegraphics[width=0.99\linewidth]{figures/example_compression_1col_2a.pdf}
\end{center}
\caption{An example of a synchronous paraphrastic derivation. A few of
  the rules applied in the parse are show in the left column, with the
  pivot phrases that gave rise to them on the
  right.}\label{paraphrase_derivaton}
\end{figure}

\section{Analysis} \label{analysis}

\begin{table*}[!ht]
  \begin{center}
\small
  \begin{tabular}{|c|rrcl|}
    \hline
    \multirow{2}{*}{Possessive rule} & $\mathit{NP}$ $\rightarrow$ & the
    $\mathit{NN}$ of the $\mathit{NNP}$ & $\mid$ & the
    $\mathit{NNP}$'s $\mathit{NN}$ \\
    & $\mathit{NP}$ $\rightarrow$  & the $\mathit{NNS}_1$ made by
    $\mathit{NNS}_2$ & $\mid$ & the $\mathit{NNS}_2$'s
    $\mathit{NNS}_1$ \\
    \hline
    \multirow{2}{*}{Dative shift} & $\mathit{VP}$ $\rightarrow$ & give
    $\mathit{NN}$ to $\mathit{NP}$ & $\mid$ & give $\mathit{NP}$ the
    $\mathit{NN}$ \\
    & $\mathit{VP}$ $\rightarrow$ & provide $\mathit{NP}_1$ to
    $\mathit{NP}_2$ & $\mid$ & give $\mathit{NP}_2$
    $\mathit{NP}_1$ \\
    \hline
    \hline
    \multirow{2}{*}{Adv./adj. phrase move} & 
    $\mathit{S/VP}$ $\rightarrow$ & $\mathit{ADVP}$ they $\mathit{VBP}$
    & $\mid$ & they $\mathit{VPB}$ $\mathit{ADVP}$ \\
    & $\mathit{S}$ $\rightarrow$ & it is $\mathit{ADJP}$ $\mathit{VP}$
    & $\mid$ & $\mathit{VP}$ is $\mathit{ADJP}$ \\
    \hline
    Verb particle shift & 
    $\mathit{VP}$ $\rightarrow$ & $\mathit{VB}$ $\mathit{NP}$ up &
    $\mid$ & $\mathit{VB}$ up $\mathit{NP}$ \\
    \hline
    \multirow{2}{*}{Reduced relative clause} & $\mathit{SBAR/S}$ $\rightarrow$ &
    although $\mathit{PRP}$ $\mathit{VBP}$ that & $\mid$ &although
    $\mathit{PRP}$ $\mathit{VBP}$ \\
    & $\mathit{ADJP}$ $\rightarrow$ &
    very $\mathit{JJ}$ that $\mathit{S}$ & $\mid$ & $\mathit{JJ}$ $\mathit{S}$ \\
    \hline
    \multirow{2}{*}{Partitive constructions} & 
    $\mathit{NP}$ $\rightarrow$ & $\mathit{CD}$ of the $\mathit{NN}$
    & $\mid$ & $\mathit{CD}$ $\mathit{NN}$ \\
    & $\mathit{NP}$ $\rightarrow$ & all $\mathit{DT\backslash NP}$
    & $\mid$ & all of the $\mathit{DT\backslash NP}$ \\
    \hline
    Topicalization  & $\mathit{S}$ $\rightarrow$ & $\mathit{NP}$,
    $\mathit{VP}$ . & $\mid$ & $\mathit{VP}$, $\mathit{NP}$ . \\
    \hline
    \hline
    Passivization &
    $\mathit{SBAR}$ $\rightarrow$ & that $\mathit{NP}$ had
    $\mathit{VBN}$ & $\mid$ & which was $\mathit{VBN}$ by $\mathit{NP}$ \\
    \hline
    \multirow{2}{*}{Light verbs} & $\mathit{VP}$ $\rightarrow$ & take action $\mathit{ADVP}$ &
    $\mid$ & to act $\mathit{ADVP}$ \\
    & $\mathit{VP}$ $\rightarrow$ & $\mathit{TO}$ take a decision $\mathit{PP}$ &
    $\mid$ & $\mathit{TO}$ decide $\mathit{PP}$ \\
    \hline
\end{tabular}
\normalsize
\end{center}
\caption{A selection of meaning-preserving transformations and
  hand-picked examples of syntactic paraphrases that our system
  extracts capturing these.}
\label{example_rules}
\end{table*}


A key motivation for the use of syntactic paraphrases over their
phrasal counterparts is their potential to capture meaning-preserving
linguistic transformations in a more general fashion. A phrasal system
is limited to memorizing fully lexicalized transformations in its
paraphrase table, resulting in poor generalization capabilities. By
contrast, a syntactic paraphrasing system intuitively should be able
to address this issue and learn well-formed and generic patterns that
can be easily applied to unseen data.

To put this expectation to the test, we investigate how our grammar
captures a number of well-known paraphrastic
transformations.\footnote{The data and software used to extract the
  grammar we draw these examples from is described in
  Section~\ref{setup}.} Table~\ref{example_rules} shows the
transformations along with examples of the generic grammar rules our
system learns to represent them. When given a transformation to
extract a syntactic paraphrase for, we want to find rules that neither
under- nor over-generalize. This means that, while replacing the
maximum number of syntactic arguments with nonterminals, the rules
ideally will both retain enough lexicalization to serve as sufficient
evidence for the applicability of the transformation and impose
constraints on the nonterminals to ensure the arguments'
well-formedness.

The paraphrases implementing the \emph{possessive rule} and the
\emph{dative shift} shown in Table~\ref{example_rules} are a good
examples of this: the two noun-phrase arguments to the expressions are
abstracted to nonterminals while each rule's lexicalization provides
an appropriate frame of evidence for the transform. This is important
for a good representation of dative shift, which is a reordering
transformation that fully applies to certain ditransitive verbs while
other verbs are uncommon in one of the forms:
\begin{center}
\begin{tabular}{l}
  give \emph{decontamination equipment} to \emph{Japan} \\
  give \emph{Japan} \emph{decontamination equipment} \\
  \vspace{-10pt}\\
  provide \emph{decontamination equipment} to \emph{Japan} \\
  ? provide \emph{Japan} \emph{decontamination equipment} \\
\end{tabular}
\end{center}
Note how our system extracts a dative shift rule for \emph{to give}
and a rule that both shifts and substitutes a more appropriate verb
for \emph{to provide}.

The use of syntactic nonterminals in our paraphrase rules to capture
complex transforms also makes it possible to impose constraints on
their application. For comparison, as \newcite{Madnani2007} do not
impose any constraints on how the nonterminal $X$ can be realized,
their equivalent of the \emph{topicalization} rule would massively
overgeneralize:
\begin{center}
\begin{tabular}{rrcl}
  $\mathit{S}$ $\rightarrow$ & $\mathit{X}_1$,
  $\mathit{X}_2$ . & $\mid$ & $\mathit{X}_2$, $\mathit{X}_1$ . \\
\end{tabular}
\end{center}
Additional examples of transforms our use of syntax allows us to
capture are the \emph{adverbial phrase shift} and the \emph{reduction
  of a relative clause}, as well as other phenomena listed in
Table~\ref{example_rules}.
 
Unsurprisingly, syntactic information alone is not sufficient to
capture all transformations. For instance it is hard to extract
generic paraphrases for all instances of \emph{passivization}, since
our syntactic model currently has no means of representing the
morphological changes that the verb undergoes:
\begin{center}
\begin{tabular}{l}
  the reactor \emph{leaks} radiation \\
  radiation \emph{is leaking} from the reactor .\\
\end{tabular}
\end{center}
Still, for cases where the verb's morphology does not change, we
manage to learn a rule:
\begin{center}
\begin{tabular}{l}
  the radiation that the reactor had \emph{leaked} \\
  the radiation which \emph{leaked} from the reactor .\\
\end{tabular}
\end{center}
Another example of a deficiency in our synchronous grammar models are
\emph{light verb} constructs such as:
\begin{center}
\begin{tabular}{l}
  to take a \emph{walk} \\
  to \emph{walk} .
\end{tabular}
\end{center}
Here, a noun is transformed into the corresponding verb -- something
our synchronous syntactic CFGs are not able to capture except
through memorization.

Our survey shows that we are able to extract appropriately generic representations for a wide range of paraphrastic transformations.  This is a surprising result which shows that bilingual parallel corpora can be used to learn sentential paraphrases, and that they are a viable alternative to other data sources like monolingual parallel corpora, which more obviously contain sentential paraphrases, but are scarce.

\section{Text-to-Text Applications} \label{adaptation}

The core of many text-to-text generation tasks is sentential
paraphrasing, augmented with specific constraints or goals. Since our
model borrows much of its machinery from statistical machine
translation -- a sentential rewriting problem itself -- it is
straightforward to use our paraphrase grammars to generate new
sentences using SMT's decoding and parameter optimization
techniques. Our framework can be adapted to many different
text-to-text generation tasks.  These could include text
simplification, sentence compression, poetry generation, query
expansion, transforming declarative sentences into questions, and
deriving hypotheses for textual entailment.  Each individual
text-to-text application requires that our framework be adapted in
several ways, by specifying:
%squeeze the equation onto this page by compressing some stuff
%\begin{itemize}\addtolength{\itemsep}{-0.2\baselineskip}
\begin{itemize}
\item A mechanism for extracting synchronous grammar rules (in this
  paper we argue that pivot-based paraphrasing is widely applicable).
\item An appropriate set of rule-level features that capture
  information pertinent to the task (e.g.\ whether a rule simplifies a
  phrase).
\item An appropriate ``objective function'' that scores the output of
  the model, i.e.\ a task-specific equivalent to the \textsc{Bleu}
  metric in SMT.
\item A development set with examples of the sentential
  transformations that we are modeling.
\item Optionally, a way of injecting task-specific rules that were not
  extracted automatically.
\end{itemize} 
In the remainder of this section, we illustrate how our bilingually
extracted paraphrases can be adapted to perform sentence compression,
which is the task of reducing the length of sentence while preserving
its core meaning.  Most previous approaches to sentence compression
focused only on the deletion of a subset of words from the sentence
\cite{KnightMarcuAI02}.  Our approach follows
\newcite{cohn-lapata:2008}, who expand the task to include
substitutions, insertions and reorderings that are automatically
learned from parallel texts.

% squeeze a little space
\vspace{-.2cm}
\subsection{Feature Design}\label{feat-design}
In Section~\ref{acquisition} we discussed phrasal
probabilities. While these help quantify how good a paraphrase is in
general, they do not make any statement on task-specific things such
as the change in language complexity or text length. To make this
information available to the decoder, we enhance our paraphrases with
four compression-targeted features. We add the count features
$c_{\mathit{src}}$ and $c_{\mathit{tgt}}$, indicating the number of
words on either side of the rule as well as two difference features:
$c_{\mathit{dcount}} = c_{\mathit{tgt}} - c_{\mathit{src}}$ and the
analogously computed difference in the average word length in
characters, $c_{\mathit{davg}}$.

% squeeze a little space
\vspace{-.2cm}
\subsection{Objective Function}
\label{objective-fn}

Given our paraphrasing system's connection to SMT, the naive/obvious choice for parameter optimization would be to optimize for \textsc{Bleu} over a set of paraphrases, for instance parallel English reference translations for a machine translation task \cite{Madnani2007}.
 For a candidate $C$ and a reference $R$, (with
lengths $c$ and $r$) \textsc{Bleu} is defined as:\vspace{-0.2cm}
\begin{eqnarray}
  \lefteqn{ \textsc{Bleu}_{N} (C,R) } \nonumber \\
  &=& \left\{ \begin{array}{ll}
      e^{(1 - c/r)} \cdot
      e^{\sum_{n=1}^N\log w_n p_n} & \textrm{if
        $c/r \leq 1$} \nonumber \\ 
       e^{\sum_{n=1}^N\log w_n p_n} & \textrm{otherwise}
\end{array} \right. ,
\end{eqnarray}
where $p_n$ is the modified $n$-gram precision of $C$ against $R$,
with typically $N=4$ and $w_n=\frac{1}{N}$. The ``brevity penalty''
term $e^{(1 - c/r)}$ is added to prevent short candidates
from achieving perfect scores.

Naively optimizing for \textsc{Bleu}, however, will result in a trivial paraphrasing system heavily biased towards producing identity ``paraphrases".  This is obviously not what we are looking for. Moreover, \textsc{Bleu} does not provide a mechanism for directly specifying a per-sentence compression rate, which is desirable for the compression task.

% \footnote{\newcite{Madnani2007} avoid this issue by disallowing
%   identity paraphrases in their grammar. While this is an
%   appropriate choice for their task, for many applications retaining
%   the capability to self-paraphrase where appropriate is crucial.}.

Instead, we propose \textsc{Pr\'ecis}, an objective function tailored
to the text compression task: \vspace{-0.2cm}
\begin{eqnarray}
  \lefteqn{ \textsc{Pr\'ecis}_{\lambda,\varphi} (I, C, R) } \nonumber \\
  &=& \left\{ \begin{array}{ll}
      e^{\lambda (\varphi - c/i)} \cdot \textsc{Bleu}(C, R) & \textrm{if
        $c/i \geq \varphi$} \nonumber \\ 
      \textsc{Bleu}(C, R) & \textrm{otherwise}
\end{array} \right.
\end{eqnarray}
For an input sentence $I$, an output $C$ and reference compression $R$
(with lengths $i$, $c$ and $r$), \textsc{Pr\'ecis} combines the
precision estimate of \textsc{Bleu} with an additional ``verbosity penalty'' that is
applied to compressions that fail to meet a given target compression
rate $\varphi$.  We rely on the \textsc{Bleu} brevity penalty to
prevent the system from producing overly aggressive compressions. The
scaling term $\lambda$ determines how severely we penalize deviations
from $\varphi$. In our experiments we use $\lambda = 10$.

It is straightforward to find similar adaptations for other tasks. For
text simplification, for instance, the penalty term can include a
readability metric. For poetry generation we can analogously penalize
outputs that break the meter \cite{Greene2010}.

% squeeze a little space
\vspace{-.2cm}
\subsection{Development Data}
\label{dev-data}
To tune the parameters of our paraphrase system for sentence
compression, we need an appropriate corpus of reference
compressions. Since our model is designed to compress by paraphrasing
rather than deletion, the commonly used deletion-based compression
data sets like the Ziff-Davis corpus are not suitable. We have thus
created a corpus of compression paraphrases. Beginning with 9570
tuples of parallel English--English sentences obtained from multiple
reference translations for machine translation evaluation, we
construct a parallel compression corpus by selecting the longest
reference in each tuple as the source sentence and the shortest
reference as the target sentence. We further retain only those
sentence pairs where the compression rate $\mathit{cr}$ falls in the
range $0.5 < \mathit{cr} \leq 0.8$. From these, we randomly select 936
sentences for the development set, as well as 560 sentences for a test
set that we use to gauge the performance of our system.


% squeeze a little space
\vspace{-.2cm}
\subsection{Grammar Augmentations} \label{injection}

As we discussed in Section~\ref{analysis}, the paraphrase grammar we
induce is capable of representing a wide variety of
transformations. However, the formalism and extraction method are not
explicitly geared towards a compression application. For instance,
the synchronous nature of our grammar does not allow us to perform
deletions of constituents as done by \newcite{Cohn2007}'s tree
transducers.  One way to extend the grammar's capabilities
towards the requirements of a given task is by injecting additional
rules designed to capture appropriate operations.

For the compression task, this could include adding rules to delete target-side nonterminals:
\begin{center}
\begin{tabular}{c}
 $\mathit{JJ} \rightarrow$ $\mathit{JJ} \mid \varepsilon$ \\
\end{tabular}
\end{center}
This would render the grammar asynchronous and require
adjustments to the decoding process. Alternatively, we can
generate rules that specifically delete particular adjectives from the
corpus:
\begin{center}
\begin{tabular}{c}
 $\mathit{JJ} \rightarrow$ superfluous $\mid \varepsilon$ .\\
\end{tabular}
\end{center}
In our experiments we evaluate the latter approach by generating optional deletion rules for all adjectives,
adverbs and determiners.

% In our system, we implemented the injection of rules that improve the
% handling of out-of-vocabulary words. Typically an SMT system will
% generate heavily penalized identity translation rules for all words in
% an input sentence, to assure that the OOV rules will only apply when
% there is no translation rule for a particular word. For our
% paraphrasing approach however, encountering an unknown word is not
% critical as long as we can assign it the proper syntactic label. To
% enable our system to process unseen words in the input, we allow for
% syntactically parsed input and for every input word $w$ include rules
% of the form $\mathit{T} \rightarrow w \mid w$, where $T$ is the
% part-of-speech tag assigned to $w$ in the input parse. To enable a
% proper inclusion into the paraphrasing model, we add a constant
% penalty feature $\varphi_{\mathit{oov}}$ and include its weight in our
% parameter estimation process.
% squeeze a little space
\vspace{-.2cm}
\subsection{Experimental Setup}
\label{setup}

\begin{table}
\small
\begin{center}
\begin{tabular}{|c|r|}
  \hline
  Grammar & \multicolumn{1}{c|}{\# Rules} \\
  \hline
  total & 42,353,318 \\
  w/o identity & 23,641,016 \\
  w/o complex constituents & 6,439,923 \\
  w/o complex const.\ \& identity & 5,097,250 \\
  \hline
\end{tabular}
\end{center}
\normalsize
\caption{Number and distribution of rules in our paraphrase
  grammar. Note the significant number of identity paraphrases and
  rules with complex nonterminal labels.}
\label{grammar_stats}
\end{table}

We extracted a paraphrase grammar from the French--English Europarl
corpus (v5). The bitext was aligned using the Berkeley aligner and the
English side was parsed with the Berkeley parser. We obtained the
initial translation grammar using the SAMT toolkit \cite{Venugopal2009}.

The grammars we extract tend to be extremely large. To keep their
size manageable, we only consider translation rules that have been
seen more than 3 times and whose translation probability exceeds
$10^{-4}$ for pivot recombination. Additionally, we only retain the
top 25 most likely paraphrases of each phrase, ranked by a uniformly
weighted combination of phrasal and lexical paraphrase probabilities.

We tuned the model parameters to our \textsc{Pr\'ecis} objective function, implemented in the Z-MERT toolkit \cite{Zaidan2009}. For decoding we used the
Joshua decoder \cite{Joshua-2.0}. The language model used in our
paraphraser and  the \newcite{Clarke2008} baseline system is a
Kneser-Ney discounted 5-gram model estimated on the Gigaword corpus
using the SRILM toolkit \cite{SRILM}.
% \todo{\footnotesize{I don't think you need to change this because
%     the C\&L baseline used a 3-gram Gigaword KN lm, which in theory
%     would be the same as the 5-gram model}}

% With the publication of this paper, we will release our software for
% paraphrase grammar extraction, along with the grammars and our
% development and test data sets. \todo{\footnotesize{Give URL or
%     delete or convert to a footnote}}

% squeeze a little space
\vspace{-.2cm}
\subsection{Evaluation} \label{evaluation}

To assess the output quality of the resulting sentence compression
system, we compare it to two state-of-the-art sentence compression
systems.  Specifically, we compare against our implementation of
\newcite{Clarke2008}'s compression model which uses a series of
constraints in an integer linear programming (ILP) solver, and
\newcite{Cohn2007}'s tree transducer toolkit (T3) which learns a
synchronous tree substitution grammar (STSG) from paired monolingual
sentences.  Unlike SCFGs, the STSG formalism allows changes to the
tree topology. Cohn and Lapata argue that this is a natural fit for
sentence compression, since deletions introduce structural mismatches.
We trained the T3
software\footnote{\url{www.dcs.shef.ac.uk/people/T.Cohn/t3/}} on the
936 $\langle\text{full}, \text{compressed}\rangle$ sentence pairs that
comprise our development set.  This is equivalent in size to the
training corpora that \newcite{Cohn2007} used (their training corpora
ranged from 882--1020 sentence pairs), and has the advantage of being
in-domain with respect to our test set.  Both these systems reported
results outperforming previous systems such as
\newcite{McDonald2006}. To showcase the value of the adaptations
discussed above, we also compare variants of our paraphrase-based
compression systems: using Hiero instead of syntax, using syntax with
or without compression features, using an augmented grammar with
optional deletion rules.


We solicit human judgments of the compressions along two five-point
scales: grammaticality and meaning. Judges are instructed to
  decide how much the meaning from a reference translation is retained
  in the compressed sentence, with a score of 5 indicating that all of
  the important information is present, and 1 being that the
  compression does not retain any of the original meaning. Similarly,
  a grammar score of 5 indicates perfect grammaticality, and a grammar
  score of 1 is assigned to sentences that are entirely
  ungrammatical. To ensure fairness, we perform pairwise system
comparisons with compression rates strictly tied on the
sentence-level. For any comparison, a sentence is only included in the
computation of average scores if the difference between both systems'
compression rates is $<0.05$.\footnote{Because evaluation
  quality correlates linearly with compression rate, the
  community-accepted practice of not comparing based on a closely tied
  compression rate is potentially subject to erroneous interpretation
  \cite{Napoles2011}.}

Table~\ref{comparison} shows a set of pairwise comparisons for
compression rates $\approx 0.5$. We see that going from a Hiero-based
to a syntactic paraphrase grammar yields a significant improvement in
grammaticality. Adding compression-specific features improves
grammaticality even further. Further augmenting the grammar with
deletion rules significantly helps retain the core meaning at
compression rates this high, however compared to the un-augmented
syntactic system grammaticality scores drop. While our approach
significantly outperforms the T3 system, we are not able to match
ILP's results in grammaticality.


In Table~\ref{human_judgments} we compare our system to the ILP
approach at a modest compression rate of $\approx 0.8$.  Here, we
significantly outperform ILP in meaning retention while achieving
comparable results in grammaticality.
%\footnote{
This improvement is
  significant at $p < 0.0001$, using the sign test, while the better
  grammaticality score of the ILP system is not statistically
  significant ($p < 0.088$).
%  } 
These results indicate that, over a variety of compression rates, our framework for text-to-text
generation is performing as well as or better than specifically
tailored state-of-the-art methods.

Table~\ref{test_examples} shows an example sentence drawn from our
test set and the compressions produced by the different systems. We
see that both the paraphrase and ILP systems produce good quality
results, with the paraphrase system retaining the meaning of the
source sentence more accurately.


 \begin{table}
   \small
   \begin{center}
     \begin{tabular}{|c|c|c|c|}
       \hline
       & CR & Meaning & Grammar \\
       \hline
       Reference & 0.73 &  4.26 & 4.35 \\
       \hline
       \hline
       Syntax+Feat. & 0.80 & {\bf 3.67} & 3.38 \\
       ILP & 0.80 & 3.50 & 3.49 \\
       \hline
       \hline
       Random Deletions & 0.50 & 1.94 & 1.57 \\
       \hline
     \end{tabular}
   \end{center}
   \normalsize
   \caption{Results of the human evaluation on longer compressions:
     pairwise compression rates (CR), meaning and grammaticality scores. 
     Bold indicates a statistically significance difference at $p <
     0.05$.}
   \label{human_judgments}
 \end{table}


 \begin{table}
   \small
   \begin{center}
     \begin{tabular}{|l|c|c|c|}
       \hline
       & CR & Meaning & Grammar \\
       \hline
       Hiero & 0.56 & 2.57 & 2.35 \\
       Syntax & 0.56 & 2.76 & {\bf 2.67} \\
       \hline
       \hline
       Syntax & 0.53 & 2.70 & 2.49 \\
       Syntax+Feat.& 0.53 & 2.71 & {\bf 2.54} \\
       \hline
       \hline
       Syntax+Feat. & 0.54 & 2.79 & {\bf 2.71} \\
       Syntax+Aug. & 0.54 & {\bf 2.96} & 2.52 \\
       \hline
       \hline
       Syntax+Aug. & 0.52 & 2.87 & 2.40 \\
       ILP & 0.52 & 2.83 & {\bf 3.09} \\
       \hline
       \hline
       Syntax+Aug. & 0.50 & 2.41 & {\bf 2.20} \\
       T3 & 0.50 & 2.01 & 1.93 \\
       \hline
%        \hline
%        Syntax+Feat. & 0.48 & {\bf 3.05} & {\bf 2.61} \\
%        T3 & 0.48 & 1.93 & 1.75 \\
%        \hline
    \end{tabular}
   \end{center}
   \normalsize
   \caption{Human evaluation for shorter compressions and for
     variations of our paraphrase system.  +Feat.\ includes the
     compression features from Section \ref{feat-design}, +Aug.\
     includes optional deletion rules from Section
     \ref{injection}. \vspace{-.5cm}}
   \label{comparison}
 \end{table}

\vspace{-.125cm}
\section{Conclusion} \label{conclusion}
\vspace{-.125cm}

In this work we introduced a method to learn syntactically informed
paraphrases from bilingual parallel texts. We discussed the expressive
power and limitations of our formalism and outlined straightforward
adaptation strategies for applications in text-to-text generation. We
demonstrated when our paraphrasing system was adapted to do sentence
compression, it achieved results competitive with state-of-the-art
compression systems with only minimal effort.

\begin{table*}[!th]
\begin{center}
\small
{
\renewcommand{\arraystretch}{1.5}
\begin{tabular}{|c|>{\raggedright}m{13.2cm}|}
%  \hline
%  & \multicolumn{1}{|c|}{Sentences}  \\
  \hline
  Source & he also expected that he would have a role in the future at
  the level of the islamic movement across the palestinian territories
  , even if he was not lucky enough to win in the elections .\tabularnewline
  \hline
  Reference & he expects to have a future role in the islamic movement
  in the palestinian territories if he is not successful in the
  elections .\tabularnewline
  \hline
  Syntax+Feat. & he also expected that he would have a role in the future
  of the islamic movement in the palestinian territories , although he
  was not lucky enough to win elections .\tabularnewline
  \hline
  ILP & he also expected that he would have a role at the level of the
  islamic movement , even if he was not lucky enough to win in the
  elections .\tabularnewline
  \hline
  \hline 
  Source & in this war which has carried on for the last 12 days ,
  around 700 palestinians , which include a large number of women and
  children , have died .\tabularnewline
  \hline
  Reference & about 700 palestinians , mostly women and children ,
  have been killed in the israeli offensive over the last 12 days .\tabularnewline
  \hline
  Syntax+Feat. & in this war has done for the last 12 days , around
  700 palestinians , including women and children , died .\tabularnewline
  \hline
  ILP & in this war which has carried for the days palestinians ,
  which include a number of women and children died .\tabularnewline
  \hline
  \hline
  Source & hala speaks arabic most of the time with her son , taking
  into consideration that he can speak english with others .\tabularnewline
  \hline
  Reference & hala speaks to her son mostly in arabic , as he can
  speak english to others .\tabularnewline
  \hline
  Syntax+Feat. & hala speaks arabic most of the time with her son ,
  considering that he can speak english with others .\tabularnewline
  \hline
  ILP & hala speaks arabic most of the time , taking into
  consideration that he can speak english with others .\tabularnewline
  \hline
\end{tabular}
}
\normalsize
\end{center}
\caption{Example compressions produced by the two systems in
  Table~\ref{human_judgments} for three input sentences from our test
  data.}
\label{test_examples}
\end{table*}


\section*{Acknowledgments}
 We would like to thank Trevor Cohn for kindly providing us with
 the T3 compression system.
 %
 This research was supported by the NSF under grant
 IIS-0713448. Opinions, interpretations, and conclusions are the
 authors' alone.

\bibliographystyle{acl}
\bibliography{paraphrasing}


\end{document}