-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathstructural_paraphrases.tex
1163 lines (1044 loc) · 51.3 KB
/
structural_paraphrases.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
\documentclass[11pt]{article}
\usepackage{acl-hlt2011}
\usepackage{times}
\usepackage{url}
\usepackage{latexsym}
\usepackage{graphicx}
\usepackage{amsmath}
\usepackage{url}
\usepackage{array}
\usepackage{multirow}
\usepackage{rotating}
\usepackage{todonotes}
%\setlength\titlebox{6.5cm}
% You can expand the title box if you really have to
\DeclareMathOperator*{\argmax}{arg\,max}
\newcommand{\mnote}[1]{\marginpar{%
\vskip-\baselineskip
\raggedright\footnotesize
\itshape\hrule\smallskip\footnotesize{#1}\par\smallskip\hrule}}
\addtolength{\abovecaptionskip}{-0.25cm}
\title{Learning Sentential Paraphrases from Bilingual Parallel Corpora
\\ for Text-to-Text Generation}
\author{Juri Ganitkevitch, Chris Callison-Burch, Courtney
Napoles, \and Benjamin Van Durme\\
Center for Language and Speech Processing, and HLTCOE\\
Johns Hopkins University}
\date{}
\begin{document}
\maketitle
\begin{abstract}
Previous work has shown that high quality {\it phrasal} paraphrases
can be extracted from bilingual parallel corpora. However, it is
not clear whether bitexts are an appropriate resource for extracting
more sophisticated {\it sentential} paraphrases, which are more
obviously learnable from monolingual parallel corpora. We extend
bilingual paraphrase extraction to syntactic paraphrases and
demonstrate its ability to learn a variety of general paraphrastic
transformations, including passivization, dative shift,
and topicalization. We discuss how our model can be adapted to
many text generation tasks by augmenting its feature set,
development data, and parameter estimation routine. We illustrate
this adaptation by using our paraphrase model for the task of
sentence compression and achieve results competitive with
state-of-the-art compression systems.
\end{abstract}
\section{Introduction} \label{introduction}
Paraphrases are alternative ways of expressing the same information
\cite{Culicover1968}. Automatically generating and detecting
paraphrases is a crucial aspect of many NLP tasks. In multi-document
summarization, paraphrase detection is used to collapse redundancies
\cite{Barzilay1999,BarzilayThesis}. Paraphrase generation can be used
for query expansion in information retrieval and question answering
systems
\cite{mckeown:1979:ACL,Anick1999,Ravichandran2002,Riezler2007}.
Paraphrases allow for more flexible matching of system output
against human references for tasks like machine translation and
automatic summarization
\cite{Zhou2006b,Kauchak2006,Madnani2007,Snover2010}.\nocite{Owczarzak2006}
% Paraphrases are used to generate additional reference translations
% for statistical machine translation \cite{Madnani2007} and for more
% flexible matching when evaluating MT output or automatically
% generated summaries \cite{Zhou2006b,Kauchak2006,Owczarzak2006}.
Broadly, we can distinguish two forms of paraphrases: \emph{phrasal
paraphrases} denote a set of surface text forms with the same
meaning:
\begin{center}
\begin{tabular}{c}
the committee's second proposal \\
the second proposal of the committee
\end{tabular}
\end{center}
while \emph{syntactic paraphrases} augment the surface forms by
introducing nonterminals (or \emph{slots}) that are annotated with
syntactic constraints:
\begin{center}
\begin{tabular}{c}
the $\mathit{NP}_1$'s $\mathit{NP}_2$ \\
the $\mathit{NP}_2$ of the $\mathit{NP}_1$
\end{tabular}
\end{center}
It is evident that the latter have a much higher potential for
generalization and for capturing interesting paraphrastic transformations.
A variety of different types of corpora (and semantic equivalence
cues) have been used to automatically induce paraphrase collections
for English \cite{Madnani2010}. Perhaps the most natural type of
corpus for this task is a monolingual parallel text, which allows
sentential paraphrases to be extracted since the
sentence pairs in such corpora are perfect paraphrases of each other
\cite{Barzilay2001,Pang2003}. While rich syntactic paraphrases have
been learned from monolingual parallel corpora, they suffer from very limited data
availability and thus have poor coverage.
Other methods obtain paraphrases from raw monolingual text by relying
on distributional similarity \cite{Lin2001,Bhagat2008}. While vast
amounts of data are readily available for these approaches, the
distributional similarity signal they use is noisier than the
sentence-level correspondency in parallel corpora and additionally
suffers from problems such as mistaking cousin expressions or antonyms
(such as $\{\mathit{boy}, \mathit{girl}\}$ or $\{\mathit{rise},
\mathit{fall}\}$) for paraphrases.
Abundantly available bilingual parallel corpora have been shown to
address both these issues, obtaining paraphrases via a pivoting step
over foreign language phrases \cite{Callison-Burch2005}. The coverage
of paraphrase lexica extracted from bitexts has been shown to
outperform that obtained from other sources \cite{Zhao2008b}. While
there have been efforts pursuing the extraction of more powerful
paraphrases
\cite{Madnani2007,Callison-Burch2008,cohn-lapata:2008,Zhao2008}, it is
not yet clear to what extent sentential paraphrases can be induced
from bitexts. In this paper we:
% Squeeze some space
%\begin{itemize}\addtolength{\itemsep}{-0.2\baselineskip}
\begin{itemize}
\item Extend the bilingual pivoting approach to paraphrase induction
to produce rich syntactic paraphrases.
\item Perform a thorough analysis of the types of paraphrases we
obtain and discuss the paraphrastic transformations we are capable
of capturing.
\item Describe how training paradigms for
syntactic/sentential paraphrase models should be tailored to
different text-to-text generation tasks.
\item Demonstrate our framework's suitability for a variety of
text-to-text generation tasks by obtaining state-of-the-art results
on the example task of sentence compression.
\end{itemize}
\section{Related Work} \label{related_work}
\newcite{Madnani2010} survey a variety of data-driven paraphrasing
techniques, categorizing them based on the type of data that they use.
These include large monolingual texts
\cite{Lin2001,szpektor-EtAl:2004:EMNLP,Bhagat2008}, comparable corpora
\cite{Barzilay2003,Dolan2004}, monolingual parallel corpora
\cite{Barzilay2001,Pang2003}, and bilingual parallel corpora
\cite{Callison-Burch2005,Madnani2007,Zhao2008}. We focus on the
latter type of data.
% Several research efforts have leveraged parallel monolingual
% corpora, however they jointly suffer from the scarcity and noisiness
% of parallel corpora. \newcite{Dolan2004} work around this issue by
% extracting parallel sentences from the vast amount of freely
% available comparable English text and apply machine translation
% techniques to create a paraphrasing system
% \cite{Quirk2004}. However, the word-based translation model and
% monotone decoder they use results in a substantial amount of
% identity paraphrases or single-word substitutions.
Paraphrase extraction using bilingual parallel corpora was proposed by
\newcite{Callison-Burch2005} who induced paraphrases using techniques
from {\it phrase-based} statistical machine translation
\cite{Koehn2003}. After extracting a bilingual phrase table, English
paraphrases are obtained by pivoting through foreign language
phrases.
% The phrase table contains phrase pairs $(e, f)$ (where the $e$ and
% $f$ stand for English and foreign phrases, respectively) as well as
% bi-directional
Since many paraphrases can be extracted for a phrase, Bannard and
Callison-Burch rank them using a paraphrase probability defined in
terms of the translation model probabilities $p(f | e)$ and $p(e |
f)$: \nocite{Callison-Burch2005}
\begin{eqnarray}
p(e_2|e_1) &=& \sum_f p(e_2,f|e_1)\\
&=& \sum_f p(e_2|f,e_1) p(f|e_1) \\
&\approx& \sum_f p(e_2|f) p(f|e_1) .
\label{paraphrase_prob_eqn}
\end{eqnarray}
Several subsequent efforts extended the bilingual pivoting technique,
many of which introduced elements of more contemporary {\it
syntax-based} approaches to statistical machine translation.
\newcite{Madnani2007} extended the technique to {\it hierarchical}
phrase-based machine translation \cite{Chiang2005}, which is formally
a synchronous context-free grammar (SCFG) and thus can be thought of
as a {\it paraphrase grammar}. The paraphrase grammar can paraphrase
(or ``decode'') input sentences using an SCFG decoder, like the Hiero,
Joshua or cdec MT systems \cite{Chiang2007,Joshua-WMT,Dyer_etal_2010}.
Like Hiero, Madnani's model uses just one nonterminal
$X$ instead of linguistic nonterminals.
Three additional efforts incorporated linguistic
syntax. \newcite{Callison-Burch2008} introduced syntactic constraints
by labeling all phrases and paraphrases (even non-constituent phrases)
with CCG-inspired slash categories \cite{Steedman2011}, an approach similar to
\newcite{Zollmann2006}'s syntax-augmented machine translation
(SAMT). Callison-Burch did not formally define a synchronous grammar,
nor discuss decoding, since his presentation did not include
hierarchical rules. \newcite{cohn-lapata:2008} used the GHKM
extraction method \cite{Galley2004}, which is limited to constituent
phrases and thus produces a reasonably small set of syntactic rules.
\newcite{Zhao2008} added slots to bilingually extracted paraphrase
patterns that were labeled with part-of-speech tags, but not larger
syntactic constituents.
Before the shift to statistical natural language processing,
paraphrasing was often treated as syntactic transformations or by
parsing and then generating from a semantic representation
\cite{mckeown:1979:ACL,Muraki1982,Meteer1988,Shemtov1996,Yamamoto2002}.
Indeed, some work generated paraphrases using (non-probabilistic)
synchronous grammars
\cite{Shieber1990,Dras1997,Dras1999,Kozlowski2003}.
After the rise of statistical machine translation, a number of its
techniques were repurposed for paraphrasing. These include sentence
alignment \cite{Gale1993,Barzilay2003a}, word alignment and noisy
channel decoding \cite{Brown1990,Quirk2004}, phrase-based models
\cite{Koehn2003,Callison-Burch2005}, hierarchical phrase-based models
\cite{Chiang2005,Madnani2007}, log-linear models and minimum error
rate training \cite{Och2003c,Madnani2007,Zhao2008b}, and here
syntax-based machine translation
\cite{Wu1997,Yamada2001,Melamed2004,Quirk2005}.
Beyond cementing the ties between paraphrasing and syntax-based
statistical machine translation, the novel contributions of our paper
are (1) an in-depth analysis of the types of structural and sentential
paraphrases that can be extracted with bilingual pivoting, (2) a
discussion of how our English--English paraphrase grammar should be
adapted to specific text-to-text generation tasks
\cite{zhao-EtAl:2009:ACLIJCNLP2} with (3) a concrete example of the
adaptation procedure for the task of paraphrase-based sentence
compression \cite{KnightMarcuAI02,cohn-lapata:2008,Cohn2009}.
% Relying on small data sets of semantically equivalent translations,
% \newcite{Pang2003} created finite state automata by syntax-aligning
% parallel sentences, enabling the generation of additional reference
% translations.
% Both \newcite{Barzilay2001} and \newcite{Ibrahim2003} sentence-align
% existing noisy parallel monolingual corpora such as translations of
% the same novels. While \newcite{Ibrahim2003} employ a set of
% heuristics that rely on anchor words identified by textual identity
% or matchin liguistic features such as gender, number or semantic
% class, \newcite{Barzilay2001} use a co-training approach that
% leverages context similarity to identify viable paraphrases.
% Semantic parallelism is well-established as a stong basis for the
% extraction of correspondencies such as paraphrases. However, there
% are notable efforts that choose to forgo it in favor of clustering
% approaches based on distributional characteristics. The well-known
% DIRT method by \newcite{Lin2001} fully relies on distributional
% similarity features for paraphrase extraction. Patterns extracted
% from paths in dependency graphs are clustered based on the
% similarity of the observed contents of their slots.
% Similarly, \newcite{Bhagat2008} argue that vast amounts of text can
% be leveraged to make up for the relative weakness of distributional
% features compared to parallelism. They also forgo complex
% annotations such as syntactic or dependency parses, relying only on
% part-of-speech tags to inform their approach. In their work,
% relations are learned by finding pattern clusters initially seeded
% by already known patterns. However, this method is not capable of
% producing syntactic paraphrases. \mnote{Need better tie-in with the
% overall theme of structural paraphrases.}
\section{SCFGs in Translation} \label{formalism}
The model we use in our paraphrasing approach is a syntactically
informed \emph{ synchronous context-free grammar} (SCFG). The SCFG
formalism \cite{Aho1972} was repopularized for statistical machine
translation by \newcite{Chiang2005}. Formally, a \emph{probabilistic}
SCFG $\mathcal{G}$ is defined by specifying
\[
\mathcal{G} = \langle \mathcal{N}, \mathcal{T}_S, \mathcal{T}_T,
\mathcal{R}, S \rangle ,
\]
where $\mathcal{N}$ is a set of nonterminal symbols, $\mathcal{T}_S$
and $\mathcal{T}_T$ are the source and target language vocabularies,
$\mathcal{R}$ is a set of rules and $S \in \mathcal{N}$ is the root
symbol. The rules in $\mathcal{R}$ take the form:
\begin{equation*}
C \rightarrow \langle \gamma, \alpha, \sim, w \rangle ,
\end{equation*}
where the rule's left-hand side $C \in \mathcal{N}$ is a nonterminal,
$\gamma \in (\mathcal{N} \cup \mathcal{T}_S)^*$ and $\alpha \in
(\mathcal{N} \cup \mathcal{T}_T)^*$ are strings of terminal and
nonterminal symbols with an equal number of nonterminals
$c_{\mathit{NT}}(\gamma) = c_{\mathit{NT}}(\alpha)$ and
$$
\sim : \{1 \ldots c_{\mathit{NT}}(\gamma)\} \rightarrow \{1 \ldots
c_{\mathit{NT}}(\alpha)\}
$$
constitutes a one-to-one correspondency function between the
nonterminals in $\gamma$ and $\alpha$. A non-negative weight $w \geq
0$ is assigned to each rule, reflecting the likelihood of the rule.
\begin{figure}[t]
\begin{center}
\includegraphics[width=0.99\linewidth]{figures/example_extraction-3a.pdf}
\end{center}
\caption{Synchronous grammar rules for translation are extracted from
sentence pairs in a bixtext which have been automatically parsed and
word-aligned. Extraction methods vary on whether they extract only
minimal rules for phrases dominated by nodes in the parse tree, or
more complex rules that include non-constituent phrases.}
\label{example_extraction}
\end{figure}
\paragraph{Rule Extraction}
Phrase-based approaches to statistical machine translation (and their
successors) extract pairs of $(e, f)$ phrases from automatically
word-aligned parallel sentences. \newcite{Och2003}
% , \newcite{Koehn2004}, \newcite{Tillmann2003}, and
% \newcite{Vogel2003}
described various heuristics for extracting phrase alignments from the
Viterbi word-level alignments that are estimated using
\newcite{Brown1993} word-alignment models.
These phrase extraction heuristics have been extended so that they
extract synchronous grammar rules
\cite{Galley2004,Chiang2005,Zollmann2006,Liu2006}. Most of these
extraction methods require that one side of the parallel corpus be
parsed. This is typically done automatically with a statistical
parser.
Figure~\ref{example_extraction} shows examples of rules obtained from
a sentence pair. To extract a rule, we first choose a source side span
$f$ like {\it das leck}. Then we use phrase extraction techniques to
find target spans $e$ that are consistent with the word alignment (in
this case {\it the leak} is consistent with our $f$). The nonterminal
symbol that is the left-hand side of the SCFG rule is then determined
by the syntactic constituent that dominates $e$ (in this case {\it
NP}). To introduce nonterminals into the right-hand side of the rule,
we can apply rules extracted over sub-phrases of $f$, synchronously
substituting the corresponding nonterminal symbol for the sub-phrases
on both sides. The synchronous substitution applied to $f$ and $e$
then yields the correspondency $\sim$.
One significant differentiating factor between the competing ways of
extracting SCFG rules is whether the extraction method generates rules
only for constituent phrases that are dominated by a node in the parse
tree \cite{Galley2004,cohn-lapata:2008} or whether they include
arbitrary phrases, including non-constituent phrases
\cite{Zollmann2006,Callison-Burch2008}. We adopt the extraction for
all phrases, including non-constituents, since it allows us to cover a
much greater set of phrases, both in translation and paraphrasing.
% However, in doing so we can only assign left-hand side labels to a
% small portion of the possible phrases in a sentence pair. To allow
% for broader coverage, we rely on the labeling method introduced by :
% in addition to single constituent nonterminals, we allow for the
% concatenation of constituents as well as for CCG-style slashed
% constituents \cite{Steedman1999}.
\paragraph{Feature Functions}
Rather than assigning a single weight $w$, we define a set of feature
functions $\vec{\varphi} = \{\varphi_1 ... \varphi_N \}$ that are combined in a
log-linear model:
\begin{equation}
w = -\sum_{i=1}^N \lambda_i \log \varphi_i .
\end{equation}
The weights $\vec{\lambda}$ of these feature functions are set to
maximize some objective function like \textsc{Bleu}
\cite{Papineni2002} using a procedure called minimum error rate
training (MERT), owing to \newcite{Och2003c}. MERT iteratively
adjusts the weights until the decoder produces output that best
matches reference translations in a development set, according to the
objective function. We will examine appropriate objective functions
for text-to-text generation tasks in Section \ref{objective-fn}.
Typical features used in statistical machine translation include
phrase translation probabilities (calculated using maximum likelihood
estimation over all phrase pairs enumerable in the parallel
corpus), word-for-word lexical translation probabilities (which help
to smooth sparser phrase translation estimates), a ``rule application
penalty'' (which governs whether the system prefers fewer longer phrases or a
greater number of shorter phrases), and a language model probability.
% \mnote{TODO: list out typical feature functions -- they can be less
% formal than this}
%\begin{itemize}
%\item Phrase translation probabilities $\varphi_{\mathit{phrase}}(e |
% f)$ and $\varphi_{\mathit{phrase}}(f | e)$ which are computed using
% maximum likelihood estimation over all phrase pairs that are
% extracted from the parallel corpus.
%\item Lexical translation probabilities
% \[ \varphi_{\mathit{lex}} = \langle -\log p_{\mathit{lex}}(e | f),
% -\log p_{\mathit{lex}}(f | e)\rangle \] \mnote{TODO: describe how
% the left hand side figures into the FFs}
%\item Count features $c_{\mathit{src}}$ and $c_{\mathit{tgt}}$
% indicating the number of words on either side of the rule as well
% as two difference features, $c_{\mathit{dcount}} = c_{\mathit{tgt}}
% - c_{\mathit{src}}$ and the analoguosly computed difference in the
% average word length in characters, $c_{\mathit{davg}}$.
%\item Language model probability
%\item A length penalty
%\item An incidator for when a rule only contains terminal symbols
% ($\delta_{\mathit{lex}}$) and an indicator for when the source side
% contains terminals, but the target side does not
% ($\delta_{\mathit{del}}$).
%\item Indicators for whether the rule swaps the order of two
% nonterminals ($\delta_{\mathit{reorder}}$) and whether the two
% nonterminals are adjacent ($\delta_{\mathit{adj}}$).
%\item A rarity penalty $\varphi_{\mathit{rarity}} =
% e^{(1-c_{\mathit{rule}})}$ that quantifies the doubt we may place
% in a rule based on how often we have encountered it in the
% corpus\footnote{Since we do not have an immediate rule count for a
% paraphrase rule $N \rightarrow e_1 | e_2$, we instead estimate
% its rarity penalty as $\varphi_{\mathit{rarity}}(N \rightarrow
% e_1 | e_2) = \max_{f} \min \{\varphi_{\mathit{rarity}}(N
% \rightarrow e_1 | f), \varphi_{\mathit{rarity}}(N \rightarrow f |
% e_2) \}$}.
%\end{itemize}
\paragraph{Decoding}
Given an SCFG and an input source sentence, the decoder performs a
search for the single most probable derivation via the CKY algorithm.
In principle the best translation should be the English sentence $e$
that is the most probable after summing over all $d\in D$ derivations,
since many derivations yield the same $e$. In practice, we use a
Viterbi approximation and return the translation that is the yield of
the single best derivation:
\begin{eqnarray}
\label{decoding_eq}
\hat{e} &=& \arg \max_{e\in Trans(f)} \sum_{d\in D(e,f)}{p(d,e|f)}
\nonumber \\
&\approx& \mathit{yield}(\arg \max_{d\in D(e,f)}{p(d,e|f)}).
\end{eqnarray}
Derivations are simply successive applications of the SCFG rules such
as those given in Figure \ref{example_translation}.
%\small
%\begin{tabular}{l}
%$\mathit{NP/NN}$ $\rightarrow$ dem rest des $|$ the rest of the\\
%$\mathit{VB+JJ}$ $\rightarrow$ gef\"{a}hrlich werden $|$ be dangerous\\
%$\mathit{NP}$ $\rightarrow$ $\mathit{NP/NN}$ dorfes $|$ $\mathit{NP/NN}$ village\\
%$\mathit{VP/PP}$ $\rightarrow$ nicht $\mathit{VB+JJ}$ k\"{o}nnen $|$ can not $\mathit{VB+JJ}$\\
%$\mathit{S}$ $\rightarrow$ sie $\mathit{NP}$ $\mathit{VP/PP}$ $|$ they $\mathit{VP/PP}$ to $\mathit{NP}$\\
%\end{tabular}{l}
%\normalsize
\begin{figure}[t]
\begin{center}
\includegraphics[width=0.9\linewidth]{figures/example_translation-2a.pdf}
\end{center}
\caption{An example derivation produced by a syntactic machine
translation system. Although the synchronous trees are unlike the
derivations found in the Penn Treebank, their yield is a good
translation of the German.}
\label{example_translation}
\end{figure}
\section{SCFGs in Paraphrasing} \label{acquisition}
\paragraph{Rule Extraction}
To create a paraphrase grammar from a translation grammar, we extend
the syntactically informed pivot approach of
\newcite{Callison-Burch2008} to the SCFG model. For this purpose, we
assume a grammar that translates from a given foreign language to
English. For each pair of translation rules where the left-hand side
$C$ and foreign string $\gamma$ match:
\begin{eqnarray*}
C \rightarrow \langle \gamma, \alpha_1, \sim_1, \vec{\varphi}_1 \rangle \\
C \rightarrow \langle \gamma, \alpha_2, \sim_2, \vec{\varphi}_2 \rangle ,
\end{eqnarray*}
we create a paraphrase rule:
\begin{equation*}
C \rightarrow \langle \alpha_1, \alpha_2, \sim, \vec{\varphi} \rangle ,
\end{equation*}
where the nonterminal correspondency relation $\sim$ has been set to
reflect the combined nonterminal alignment:
\begin{equation*}
\sim ~ = ~ \sim_1^{-1} \circ \sim_2 .
\end{equation*}
\paragraph{Feature Functions}
In the computation of the features $\vec{\varphi}$ from
$\vec{\varphi}_1$ and $\vec{\varphi}_2$ we follow the approximation in
Equation~\ref{paraphrase_prob_eqn}, which yields lexical and phrasal
paraphrase probability features. Additionally, we add a boolean
indicator for whether the rule is an identity paraphrase,
$\delta_{\mathit{identity}}$. Another indicator feature,
$\delta_{\mathit{reorder}}$, fires if the rule swaps the order of two
nonterminals, which enables us to promote more complex paraphrases
that require structural reordering.
\paragraph{Decoding}
With this, paraphrasing becomes an English-to-English translation
problem which can be formulated similarly to
Equation~\ref{decoding_eq} as:
\begin{equation*}
\hat{e_2} \approx \mathit{yield}(\arg \max_{d\in
D(e_2,e_1)}{p(d,e_2|e_1)}).
\end{equation*}
Figure~\ref{paraphrase_derivaton}
shows an example derivation produced as a result of applying our
paraphrase rules in the decoding process. Another advantage of using
the decoder from statistical machine translation is that n-gram
language models, which have been shown to be useful in natural
language generation \cite{Langkilde1998}, are already well integrated
\cite{Huang2007}.
\begin{figure}[!t]
\begin{center}
\includegraphics[width=0.99\linewidth]{figures/example_compression_1col_2a.pdf}
\end{center}
\caption{An example of a synchronous paraphrastic derivation. A few of
the rules applied in the parse are show in the left column, with the
pivot phrases that gave rise to them on the
right.}\label{paraphrase_derivaton}
\end{figure}
\section{Analysis} \label{analysis}
\begin{table*}[!ht]
\begin{center}
\small
\begin{tabular}{|c|rrcl|}
\hline
\multirow{2}{*}{Possessive rule} & $\mathit{NP}$ $\rightarrow$ & the
$\mathit{NN}$ of the $\mathit{NNP}$ & $\mid$ & the
$\mathit{NNP}$'s $\mathit{NN}$ \\
& $\mathit{NP}$ $\rightarrow$ & the $\mathit{NNS}_1$ made by
$\mathit{NNS}_2$ & $\mid$ & the $\mathit{NNS}_2$'s
$\mathit{NNS}_1$ \\
\hline
\multirow{2}{*}{Dative shift} & $\mathit{VP}$ $\rightarrow$ & give
$\mathit{NN}$ to $\mathit{NP}$ & $\mid$ & give $\mathit{NP}$ the
$\mathit{NN}$ \\
& $\mathit{VP}$ $\rightarrow$ & provide $\mathit{NP}_1$ to
$\mathit{NP}_2$ & $\mid$ & give $\mathit{NP}_2$
$\mathit{NP}_1$ \\
\hline
\hline
\multirow{2}{*}{Adv./adj. phrase move} &
$\mathit{S/VP}$ $\rightarrow$ & $\mathit{ADVP}$ they $\mathit{VBP}$
& $\mid$ & they $\mathit{VPB}$ $\mathit{ADVP}$ \\
& $\mathit{S}$ $\rightarrow$ & it is $\mathit{ADJP}$ $\mathit{VP}$
& $\mid$ & $\mathit{VP}$ is $\mathit{ADJP}$ \\
\hline
Verb particle shift &
$\mathit{VP}$ $\rightarrow$ & $\mathit{VB}$ $\mathit{NP}$ up &
$\mid$ & $\mathit{VB}$ up $\mathit{NP}$ \\
\hline
\multirow{2}{*}{Reduced relative clause} & $\mathit{SBAR/S}$ $\rightarrow$ &
although $\mathit{PRP}$ $\mathit{VBP}$ that & $\mid$ &although
$\mathit{PRP}$ $\mathit{VBP}$ \\
& $\mathit{ADJP}$ $\rightarrow$ &
very $\mathit{JJ}$ that $\mathit{S}$ & $\mid$ & $\mathit{JJ}$ $\mathit{S}$ \\
\hline
\multirow{2}{*}{Partitive constructions} &
$\mathit{NP}$ $\rightarrow$ & $\mathit{CD}$ of the $\mathit{NN}$
& $\mid$ & $\mathit{CD}$ $\mathit{NN}$ \\
& $\mathit{NP}$ $\rightarrow$ & all $\mathit{DT\backslash NP}$
& $\mid$ & all of the $\mathit{DT\backslash NP}$ \\
\hline
Topicalization & $\mathit{S}$ $\rightarrow$ & $\mathit{NP}$,
$\mathit{VP}$ . & $\mid$ & $\mathit{VP}$, $\mathit{NP}$ . \\
\hline
\hline
Passivization &
$\mathit{SBAR}$ $\rightarrow$ & that $\mathit{NP}$ had
$\mathit{VBN}$ & $\mid$ & which was $\mathit{VBN}$ by $\mathit{NP}$ \\
\hline
\multirow{2}{*}{Light verbs} & $\mathit{VP}$ $\rightarrow$ & take action $\mathit{ADVP}$ &
$\mid$ & to act $\mathit{ADVP}$ \\
& $\mathit{VP}$ $\rightarrow$ & $\mathit{TO}$ take a decision $\mathit{PP}$ &
$\mid$ & $\mathit{TO}$ decide $\mathit{PP}$ \\
\hline
\end{tabular}
\normalsize
\end{center}
\caption{A selection of meaning-preserving transformations and
hand-picked examples of syntactic paraphrases that our system
extracts capturing these.}
\label{example_rules}
\end{table*}
A key motivation for the use of syntactic paraphrases over their
phrasal counterparts is their potential to capture meaning-preserving
linguistic transformations in a more general fashion. A phrasal system
is limited to memorizing fully lexicalized transformations in its
paraphrase table, resulting in poor generalization capabilities. By
contrast, a syntactic paraphrasing system intuitively should be able
to address this issue and learn well-formed and generic patterns that
can be easily applied to unseen data.
To put this expectation to the test, we investigate how our grammar
captures a number of well-known paraphrastic
transformations.\footnote{The data and software used to extract the
grammar we draw these examples from is described in
Section~\ref{setup}.} Table~\ref{example_rules} shows the
transformations along with examples of the generic grammar rules our
system learns to represent them. When given a transformation to
extract a syntactic paraphrase for, we want to find rules that neither
under- nor over-generalize. This means that, while replacing the
maximum number of syntactic arguments with nonterminals, the rules
ideally will both retain enough lexicalization to serve as sufficient
evidence for the applicability of the transformation and impose
constraints on the nonterminals to ensure the arguments'
well-formedness.
The paraphrases implementing the \emph{possessive rule} and the
\emph{dative shift} shown in Table~\ref{example_rules} are a good
examples of this: the two noun-phrase arguments to the expressions are
abstracted to nonterminals while each rule's lexicalization provides
an appropriate frame of evidence for the transform. This is important
for a good representation of dative shift, which is a reordering
transformation that fully applies to certain ditransitive verbs while
other verbs are uncommon in one of the forms:
\begin{center}
\begin{tabular}{l}
give \emph{decontamination equipment} to \emph{Japan} \\
give \emph{Japan} \emph{decontamination equipment} \\
\vspace{-10pt}\\
provide \emph{decontamination equipment} to \emph{Japan} \\
? provide \emph{Japan} \emph{decontamination equipment} \\
\end{tabular}
\end{center}
Note how our system extracts a dative shift rule for \emph{to give}
and a rule that both shifts and substitutes a more appropriate verb
for \emph{to provide}.
The use of syntactic nonterminals in our paraphrase rules to capture
complex transforms also makes it possible to impose constraints on
their application. For comparison, as \newcite{Madnani2007} do not
impose any constraints on how the nonterminal $X$ can be realized,
their equivalent of the \emph{topicalization} rule would massively
overgeneralize:
\begin{center}
\begin{tabular}{rrcl}
$\mathit{S}$ $\rightarrow$ & $\mathit{X}_1$,
$\mathit{X}_2$ . & $\mid$ & $\mathit{X}_2$, $\mathit{X}_1$ . \\
\end{tabular}
\end{center}
Additional examples of transforms our use of syntax allows us to
capture are the \emph{adverbial phrase shift} and the \emph{reduction
of a relative clause}, as well as other phenomena listed in
Table~\ref{example_rules}.
Unsurprisingly, syntactic information alone is not sufficient to
capture all transformations. For instance it is hard to extract
generic paraphrases for all instances of \emph{passivization}, since
our syntactic model currently has no means of representing the
morphological changes that the verb undergoes:
\begin{center}
\begin{tabular}{l}
the reactor \emph{leaks} radiation \\
radiation \emph{is leaking} from the reactor .\\
\end{tabular}
\end{center}
Still, for cases where the verb's morphology does not change, we
manage to learn a rule:
\begin{center}
\begin{tabular}{l}
the radiation that the reactor had \emph{leaked} \\
the radiation which \emph{leaked} from the reactor .\\
\end{tabular}
\end{center}
Another example of a deficiency in our synchronous grammar models are
\emph{light verb} constructs such as:
\begin{center}
\begin{tabular}{l}
to take a \emph{walk} \\
to \emph{walk} .
\end{tabular}
\end{center}
Here, a noun is transformed into the corresponding verb -- something
our synchronous syntactic CFGs are not able to capture except
through memorization.
Our survey shows that we are able to extract appropriately generic representations for a wide range of paraphrastic transformations. This is a surprising result which shows that bilingual parallel corpora can be used to learn sentential paraphrases, and that they are a viable alternative to other data sources like monolingual parallel corpora, which more obviously contain sentential paraphrases, but are scarce.
\section{Text-to-Text Applications} \label{adaptation}
The core of many text-to-text generation tasks is sentential
paraphrasing, augmented with specific constraints or goals. Since our
model borrows much of its machinery from statistical machine
translation -- a sentential rewriting problem itself -- it is
straightforward to use our paraphrase grammars to generate new
sentences using SMT's decoding and parameter optimization
techniques. Our framework can be adapted to many different
text-to-text generation tasks. These could include text
simplification, sentence compression, poetry generation, query
expansion, transforming declarative sentences into questions, and
deriving hypotheses for textual entailment. Each individual
text-to-text application requires that our framework be adapted in
several ways, by specifying:
%squeeze the equation onto this page by compressing some stuff
%\begin{itemize}\addtolength{\itemsep}{-0.2\baselineskip}
\begin{itemize}
\item A mechanism for extracting synchronous grammar rules (in this
paper we argue that pivot-based paraphrasing is widely applicable).
\item An appropriate set of rule-level features that capture
information pertinent to the task (e.g.\ whether a rule simplifies a
phrase).
\item An appropriate ``objective function'' that scores the output of
the model, i.e.\ a task-specific equivalent to the \textsc{Bleu}
metric in SMT.
\item A development set with examples of the sentential
transformations that we are modeling.
\item Optionally, a way of injecting task-specific rules that were not
extracted automatically.
\end{itemize}
In the remainder of this section, we illustrate how our bilingually
extracted paraphrases can be adapted to perform sentence compression,
which is the task of reducing the length of sentence while preserving
its core meaning. Most previous approaches to sentence compression
focused only on the deletion of a subset of words from the sentence
\cite{KnightMarcuAI02}. Our approach follows
\newcite{cohn-lapata:2008}, who expand the task to include
substitutions, insertions and reorderings that are automatically
learned from parallel texts.
% squeeze a little space
\vspace{-.2cm}
\subsection{Feature Design}\label{feat-design}
In Section~\ref{acquisition} we discussed phrasal
probabilities. While these help quantify how good a paraphrase is in
general, they do not make any statement on task-specific things such
as the change in language complexity or text length. To make this
information available to the decoder, we enhance our paraphrases with
four compression-targeted features. We add the count features
$c_{\mathit{src}}$ and $c_{\mathit{tgt}}$, indicating the number of
words on either side of the rule as well as two difference features:
$c_{\mathit{dcount}} = c_{\mathit{tgt}} - c_{\mathit{src}}$ and the
analogously computed difference in the average word length in
characters, $c_{\mathit{davg}}$.
% squeeze a little space
\vspace{-.2cm}
\subsection{Objective Function}
\label{objective-fn}
Given our paraphrasing system's connection to SMT, the naive/obvious choice for parameter optimization would be to optimize for \textsc{Bleu} over a set of paraphrases, for instance parallel English reference translations for a machine translation task \cite{Madnani2007}.
For a candidate $C$ and a reference $R$, (with
lengths $c$ and $r$) \textsc{Bleu} is defined as:\vspace{-0.2cm}
\begin{eqnarray}
\lefteqn{ \textsc{Bleu}_{N} (C,R) } \nonumber \\
&=& \left\{ \begin{array}{ll}
e^{(1 - c/r)} \cdot
e^{\sum_{n=1}^N\log w_n p_n} & \textrm{if
$c/r \leq 1$} \nonumber \\
e^{\sum_{n=1}^N\log w_n p_n} & \textrm{otherwise}
\end{array} \right. ,
\end{eqnarray}
where $p_n$ is the modified $n$-gram precision of $C$ against $R$,
with typically $N=4$ and $w_n=\frac{1}{N}$. The ``brevity penalty''
term $e^{(1 - c/r)}$ is added to prevent short candidates
from achieving perfect scores.
Naively optimizing for \textsc{Bleu}, however, will result in a trivial paraphrasing system heavily biased towards producing identity ``paraphrases". This is obviously not what we are looking for. Moreover, \textsc{Bleu} does not provide a mechanism for directly specifying a per-sentence compression rate, which is desirable for the compression task.
% \footnote{\newcite{Madnani2007} avoid this issue by disallowing
% identity paraphrases in their grammar. While this is an
% appropriate choice for their task, for many applications retaining
% the capability to self-paraphrase where appropriate is crucial.}.
Instead, we propose \textsc{Pr\'ecis}, an objective function tailored
to the text compression task: \vspace{-0.2cm}
\begin{eqnarray}
\lefteqn{ \textsc{Pr\'ecis}_{\lambda,\varphi} (I, C, R) } \nonumber \\
&=& \left\{ \begin{array}{ll}
e^{\lambda (\varphi - c/i)} \cdot \textsc{Bleu}(C, R) & \textrm{if
$c/i \geq \varphi$} \nonumber \\
\textsc{Bleu}(C, R) & \textrm{otherwise}
\end{array} \right.
\end{eqnarray}
For an input sentence $I$, an output $C$ and reference compression $R$
(with lengths $i$, $c$ and $r$), \textsc{Pr\'ecis} combines the
precision estimate of \textsc{Bleu} with an additional ``verbosity penalty'' that is
applied to compressions that fail to meet a given target compression
rate $\varphi$. We rely on the \textsc{Bleu} brevity penalty to
prevent the system from producing overly aggressive compressions. The
scaling term $\lambda$ determines how severely we penalize deviations
from $\varphi$. In our experiments we use $\lambda = 10$.
It is straightforward to find similar adaptations for other tasks. For
text simplification, for instance, the penalty term can include a
readability metric. For poetry generation we can analogously penalize
outputs that break the meter \cite{Greene2010}.
% squeeze a little space
\vspace{-.2cm}
\subsection{Development Data}
\label{dev-data}
To tune the parameters of our paraphrase system for sentence
compression, we need an appropriate corpus of reference
compressions. Since our model is designed to compress by paraphrasing
rather than deletion, the commonly used deletion-based compression
data sets like the Ziff-Davis corpus are not suitable. We have thus
created a corpus of compression paraphrases. Beginning with 9570
tuples of parallel English--English sentences obtained from multiple
reference translations for machine translation evaluation, we
construct a parallel compression corpus by selecting the longest
reference in each tuple as the source sentence and the shortest
reference as the target sentence. We further retain only those
sentence pairs where the compression rate $\mathit{cr}$ falls in the
range $0.5 < \mathit{cr} \leq 0.8$. From these, we randomly select 936
sentences for the development set, as well as 560 sentences for a test
set that we use to gauge the performance of our system.
% squeeze a little space
\vspace{-.2cm}
\subsection{Grammar Augmentations} \label{injection}
As we discussed in Section~\ref{analysis}, the paraphrase grammar we
induce is capable of representing a wide variety of
transformations. However, the formalism and extraction method are not
explicitly geared towards a compression application. For instance,
the synchronous nature of our grammar does not allow us to perform
deletions of constituents as done by \newcite{Cohn2007}'s tree
transducers. One way to extend the grammar's capabilities
towards the requirements of a given task is by injecting additional
rules designed to capture appropriate operations.
For the compression task, this could include adding rules to delete target-side nonterminals:
\begin{center}
\begin{tabular}{c}
$\mathit{JJ} \rightarrow$ $\mathit{JJ} \mid \varepsilon$ \\
\end{tabular}
\end{center}
This would render the grammar asynchronous and require
adjustments to the decoding process. Alternatively, we can
generate rules that specifically delete particular adjectives from the
corpus:
\begin{center}
\begin{tabular}{c}
$\mathit{JJ} \rightarrow$ superfluous $\mid \varepsilon$ .\\
\end{tabular}
\end{center}
In our experiments we evaluate the latter approach by generating optional deletion rules for all adjectives,
adverbs and determiners.
% In our system, we implemented the injection of rules that improve the
% handling of out-of-vocabulary words. Typically an SMT system will
% generate heavily penalized identity translation rules for all words in
% an input sentence, to assure that the OOV rules will only apply when
% there is no translation rule for a particular word. For our
% paraphrasing approach however, encountering an unknown word is not
% critical as long as we can assign it the proper syntactic label. To
% enable our system to process unseen words in the input, we allow for
% syntactically parsed input and for every input word $w$ include rules
% of the form $\mathit{T} \rightarrow w \mid w$, where $T$ is the
% part-of-speech tag assigned to $w$ in the input parse. To enable a
% proper inclusion into the paraphrasing model, we add a constant
% penalty feature $\varphi_{\mathit{oov}}$ and include its weight in our
% parameter estimation process.
% squeeze a little space
\vspace{-.2cm}
\subsection{Experimental Setup}
\label{setup}
\begin{table}
\small
\begin{center}
\begin{tabular}{|c|r|}
\hline
Grammar & \multicolumn{1}{c|}{\# Rules} \\
\hline
total & 42,353,318 \\
w/o identity & 23,641,016 \\
w/o complex constituents & 6,439,923 \\
w/o complex const.\ \& identity & 5,097,250 \\
\hline
\end{tabular}
\end{center}
\normalsize
\caption{Number and distribution of rules in our paraphrase
grammar. Note the significant number of identity paraphrases and
rules with complex nonterminal labels.}
\label{grammar_stats}
\end{table}
We extracted a paraphrase grammar from the French--English Europarl
corpus (v5). The bitext was aligned using the Berkeley aligner and the
English side was parsed with the Berkeley parser. We obtained the
initial translation grammar using the SAMT toolkit \cite{Venugopal2009}.
The grammars we extract tend to be extremely large. To keep their
size manageable, we only consider translation rules that have been
seen more than 3 times and whose translation probability exceeds
$10^{-4}$ for pivot recombination. Additionally, we only retain the
top 25 most likely paraphrases of each phrase, ranked by a uniformly
weighted combination of phrasal and lexical paraphrase probabilities.
We tuned the model parameters to our \textsc{Pr\'ecis} objective function, implemented in the Z-MERT toolkit \cite{Zaidan2009}. For decoding we used the
Joshua decoder \cite{Joshua-2.0}. The language model used in our
paraphraser and the \newcite{Clarke2008} baseline system is a
Kneser-Ney discounted 5-gram model estimated on the Gigaword corpus
using the SRILM toolkit \cite{SRILM}.
% \todo{\footnotesize{I don't think you need to change this because
% the C\&L baseline used a 3-gram Gigaword KN lm, which in theory
% would be the same as the 5-gram model}}
% With the publication of this paper, we will release our software for
% paraphrase grammar extraction, along with the grammars and our
% development and test data sets. \todo{\footnotesize{Give URL or
% delete or convert to a footnote}}
% squeeze a little space
\vspace{-.2cm}
\subsection{Evaluation} \label{evaluation}
To assess the output quality of the resulting sentence compression
system, we compare it to two state-of-the-art sentence compression
systems. Specifically, we compare against our implementation of
\newcite{Clarke2008}'s compression model which uses a series of
constraints in an integer linear programming (ILP) solver, and
\newcite{Cohn2007}'s tree transducer toolkit (T3) which learns a
synchronous tree substitution grammar (STSG) from paired monolingual
sentences. Unlike SCFGs, the STSG formalism allows changes to the
tree topology. Cohn and Lapata argue that this is a natural fit for
sentence compression, since deletions introduce structural mismatches.
We trained the T3
software\footnote{\url{www.dcs.shef.ac.uk/people/T.Cohn/t3/}} on the
936 $\langle\text{full}, \text{compressed}\rangle$ sentence pairs that
comprise our development set. This is equivalent in size to the
training corpora that \newcite{Cohn2007} used (their training corpora
ranged from 882--1020 sentence pairs), and has the advantage of being
in-domain with respect to our test set. Both these systems reported
results outperforming previous systems such as
\newcite{McDonald2006}. To showcase the value of the adaptations
discussed above, we also compare variants of our paraphrase-based
compression systems: using Hiero instead of syntax, using syntax with
or without compression features, using an augmented grammar with
optional deletion rules.
We solicit human judgments of the compressions along two five-point
scales: grammaticality and meaning. Judges are instructed to
decide how much the meaning from a reference translation is retained
in the compressed sentence, with a score of 5 indicating that all of
the important information is present, and 1 being that the
compression does not retain any of the original meaning. Similarly,
a grammar score of 5 indicates perfect grammaticality, and a grammar
score of 1 is assigned to sentences that are entirely
ungrammatical. To ensure fairness, we perform pairwise system
comparisons with compression rates strictly tied on the
sentence-level. For any comparison, a sentence is only included in the
computation of average scores if the difference between both systems'
compression rates is $<0.05$.\footnote{Because evaluation
quality correlates linearly with compression rate, the
community-accepted practice of not comparing based on a closely tied
compression rate is potentially subject to erroneous interpretation
\cite{Napoles2011}.}
Table~\ref{comparison} shows a set of pairwise comparisons for
compression rates $\approx 0.5$. We see that going from a Hiero-based
to a syntactic paraphrase grammar yields a significant improvement in
grammaticality. Adding compression-specific features improves
grammaticality even further. Further augmenting the grammar with
deletion rules significantly helps retain the core meaning at
compression rates this high, however compared to the un-augmented
syntactic system grammaticality scores drop. While our approach
significantly outperforms the T3 system, we are not able to match
ILP's results in grammaticality.
In Table~\ref{human_judgments} we compare our system to the ILP
approach at a modest compression rate of $\approx 0.8$. Here, we
significantly outperform ILP in meaning retention while achieving
comparable results in grammaticality.
%\footnote{
This improvement is
significant at $p < 0.0001$, using the sign test, while the better
grammaticality score of the ILP system is not statistically
significant ($p < 0.088$).
% }
These results indicate that, over a variety of compression rates, our framework for text-to-text
generation is performing as well as or better than specifically
tailored state-of-the-art methods.
Table~\ref{test_examples} shows an example sentence drawn from our
test set and the compressions produced by the different systems. We